Amazon SageMaker | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Amazon SageMaker

Amazon SageMaker is AWS's end-to-end machine learning platform, launched in 2017. It is the default ML platform for AWS-native engineering teams and the largest single ML platform by customer count.

Amazon SageMaker is AWS's end-to-end machine learning platform, launched at AWS re:Invent in November 2017. It is, by raw customer count, almost certainly the largest ML platform in the world. Its position is the same as every other AWS service: not the best in any single dimension, but the default for any team that already lives in AWS, which turns out to be most enterprises. The default has enormous value in enterprise sales.

If Databricks ML is the data-team-led ML platform, SageMaker is the engineering-team-led ML platform. The buyer is usually a Director of ML Engineering whose team already runs production services on EC2, S3, and Lambda, and who wants ML to live in the same VPC, IAM, and billing system as everything else.

Origin Story

In 2016-2017, AWS noticed that machine learning was about to become a huge category and that AWS had no managed offering for it. Customers were stitching together EC2 instances, custom AMIs, S3 buckets, and DIY orchestration to train models. The internal Amazon ML teams (the ones training models for product recommendations, fulfillment, and Alexa) had built sophisticated internal tooling, but none of it was exposed to customers.

The decision to build SageMaker was made by AWS leadership specifically to commoditize the ML platform layer before any startup could build a defensible category around it. (The same playbook AWS ran against MongoDB with DocumentDB, against Elastic with OpenSearch, etc.) Werner Vogels personally announced SageMaker on stage at re:Invent 2017 with a typical AWS pitch: "managed Jupyter, managed training, managed deployment, all integrated with the AWS services you already use."

The first version of SageMaker was crude. Notebooks worked. Training worked. Deployment worked. Almost nothing else did. But the service shipped, AWS sales started selling it, and the product got better fast. By 2019, SageMaker had added Pipelines (for MLOps workflows), Experiments, Model Monitor, and a model registry. By 2021, it was a full end-to-end platform competitive with Databricks ML for AWS-native customers.

In 2024, AWS announced the next-generation SageMaker — a major redesign that reframed SageMaker as a unified data, analytics, and AI platform, integrating with Redshift, EMR, and Bedrock. This is partly a response to Databricks' lakehouse pitch, and partly an admission that the original SageMaker was too narrowly scoped to be the AWS answer to Databricks.

What SageMaker Includes

SageMaker is genuinely sprawling. The major components:

SageMaker Studio: The IDE, a browser-based environment with notebooks, terminals, and integrated tools.
Training: Managed training jobs on CPU or GPU instances, with distributed training, spot instances, and warm pools.
Inference: Real-time endpoints, batch transform, asynchronous inference, and serverless inference. Multi-model endpoints for cost efficiency.
Feature Store: Online and offline feature stores backed by S3 and DynamoDB.
Pipelines: A DAG-based MLOps orchestrator (similar in spirit to Kubeflow Pipelines).
Model Registry: Versioned model catalog with stage tags.
Model Monitor: Drift detection for deployed endpoints.
Clarify: Bias detection and explainability.
JumpStart: A catalog of pre-trained models (foundation models from Hugging Face, Meta, Stability AI, etc.) with one-click fine-tuning.
HyperPod: Resilient GPU clusters for training large models.
Ground Truth: Data labeling.
Canvas: A no-code ML interface for business analysts.
Bedrock integration: For using Anthropic, Meta, Cohere, and other foundation models without managing GPUs.

This is a longer feature list than any single competitor. The flip side is that SageMaker has the fragmented-AWS-product-suite feel: the components are powerful but often inconsistent, with overlapping features and confusing names. (There are at least three different ways to deploy a model.)

The Opinionated Take

SageMaker is the inevitable choice for AWS-native organizations and the frustrating choice for everyone else. Its strengths are entirely about distribution and integration: every AWS account already has it, every AWS DevOps team already understands IAM and CloudFormation, and every AWS-native data pipeline can wire into SageMaker without leaving the VPC. For a Fortune 500 already running on AWS, picking anything other than SageMaker for ML is a meaningfully harder political conversation.

The frustrations are also real. The UX has historically been inferior to Databricks, Vertex AI, and most pure-play ML platforms. The component sprawl is overwhelming for newcomers. The classical AWS pricing model — billed by instance-hour for everything, with separate charges for storage, data transfer, and endpoints — is hard to predict and harder to optimize. And SageMaker has had a longer reputation for shipping features that look great in keynotes but turn out to be limited in practice (see: the original SageMaker Pipelines).

The LLM era is mixed for SageMaker. On one hand, AWS Bedrock is a genuinely strong foundation model API and integrates cleanly with SageMaker. On the other hand, the most exciting LLM training and inference workloads are happening on specialized platforms (Together, Anyscale, Modal, MosaicML/Databricks), not on SageMaker. AWS is racing to close the gap with Trainium chips, HyperPod, and the next-gen SageMaker, but the LLM-era story is still being written.

The honest prediction: SageMaker will continue to be the largest ML platform by customer count for the foreseeable future, simply because AWS has the largest cloud customer base. It will not necessarily be the best, but it will be the most inevitable. Databricks will continue to win in data-team-led purchases, and SageMaker will continue to win in engineering-team-led purchases. Both will keep growing.

Where SageMaker Fits

SageMaker lives inside AWS:

Data: S3, Redshift, Aurora, DynamoDB, EMR, Athena
Compute: EC2 instances (CPU, GPU, Trainium, Inferentia)
Identity & governance: IAM, KMS, CloudTrail
Inference consumers: Lambda, ECS, EKS, API Gateway
Foundation models: Bedrock, JumpStart, Hugging Face

A typical SageMaker buyer is an ML platform team at an AWS-native enterprise that needs enterprise-grade ML infrastructure with the same governance and security as the rest of their AWS environment.

How TextQL Works with SageMaker

TextQL Ana connects to AWS data sources — Redshift, Athena, RDS, S3-via-Iceberg — to answer questions in natural language. When customers run SageMaker for classical ML, the outputs of those models (predictions, scores, segments) typically land back in Redshift or S3, and Ana can query those outputs alongside the rest of the warehouse data. A business user can ask "show me customers with the highest propensity-to-buy scores from last week's model run" and get an answer pulled from the table SageMaker wrote. TextQL is complementary to SageMaker: SageMaker builds the models, Ana lets business users query their outputs in plain English.

See TextQL in action

Amazon SageMaker

Launched November 2017 (re:Invent)

Parent Amazon Web Services (AWS)

HQ Seattle, Washington

Key components Studio, Training, Inference, Feature Store, Pipelines, Clarify, JumpStart, Bedrock integration

Category DS/ML Platforms

Compete with Databricks ML, Vertex AI, Azure ML, DataRobot

Notable customers Intuit, Capital One, Roche, NFL, Toyota, Formula 1

Monthly mindshare ~250K · AWS ML default; broad usage but per-account light touch