When should we use inference components vs single-model endpoints?

Use inference components for production when you may host multiple models on one endpoint, need independent scaling per model, or want to optimize GPU utilization. Single-model endpoints are acceptable for dev/test or when exactly one model maps to one endpoint forever. AWS documentation recommends inference components for new real-time deployments.

When should we NOT use scale-to-zero on inference endpoints?

Skip scale-to-zero when cold-start latency exceeds your SLA — GPU endpoints often take 15–30+ seconds to wake. Fraud, authorization, and sub-100 ms APIs need minimum capacity ≥ 1. Batch and async scoring are better scale-to-zero candidates.

What breaks during GPU capacity shortages?

Before April 2026 capacity-aware pools, endpoints failed InService with insufficient capacity errors — teams manually retried with different instance types. Symptom: 6+ create-endpoint attempts in CI logs. Fix: define prioritized instance pool (e.g., ml.g5.2xlarge → ml.g5.4xlarge → ml.g6.2xlarge) so SageMaker provisions the next available type automatically.

How does this differ from SageMaker Unified Studio migration?

Unified Studio is the IDE and domain experience for data scientists. This post covers production deployment — endpoints, registry approval, monitors, and promotion. You can train in Unified Studio and deploy with the same SageMaker AI APIs described here.

When should we use Bedrock instead of SageMaker endpoints?

Use Bedrock for foundation-model inference (LLM APIs) without managing containers. Use SageMaker endpoints for custom models, classical ML (XGBoost, sklearn), fine-tuned open weights, or when you need full container control and VPC-only hosting. Hybrid architectures are common — Bedrock for LLM, SageMaker for tabular fraud models.

What could go wrong after enabling Model Monitor?

False-positive drift alarms from seasonal traffic (holiday fraud spikes), schema changes in feature pipelines without baseline refresh, and alert fatigue causing on-call to ignore real drift. Mitigate with seasonality-aware baselines, feature schema contracts, and paging only on composite signals (drift + latency + 5xx).

SageMaker Production MLOps 2026: Deployment Playbook

SageMaker Production MLOps on AWS (2026): Inference Components, Capacity Pools, and Promotion Gates

Quick summary: On a fraud-scoring team (~450 RPS peak, ml.g5.2xlarge), inference components plus April 2026 capacity-aware instance pools cut endpoint provisioning failures from 6 retries to 0 — p99 held at 68 ms through a g5→g6 fallback event.

Key Takeaways

On a fraud-scoring team (~450 RPS peak, ml
g5
2xlarge), inference components plus April 2026 capacity-aware instance pools cut endpoint provisioning failures from 6 retries to 0 — p99 held at 68 ms through a g5→g6 fallback event
This post is the production deployment playbook — inference components, registry gates, monitors, and promotion
It is not Unified Studio migration, not training cost optimization, not Bedrock vs OpenAI, and not blue/green for generic apps (though variant weights reuse the same ideas)

On April 21, 2026, Amazon SageMaker AI shipped capacity-aware inference with automatic instance fallback — prioritized instance pools provision the next hardware type when your first choice lacks capacity. That change matters because GPU endpoint creation failures were the top production blocker for teams who pinned a single ml.g5.* type in CI.

This post is the production deployment playbook — inference components, registry gates, monitors, and promotion. It is not Unified Studio migration, not training cost optimization, not Bedrock vs OpenAI, and not blue/green for generic apps (though variant weights reuse the same ideas).

Artifacts: deployment stage checklist, cost/latency worksheet CSV.

Benchmark pattern (not a cited client) — Fintech fraud scoring, ~450 RPS peak, XGBoost on GPU container, us-east-1. Pre-pool: 6 failed create-endpoint CI runs in one week (g5 capacity). Post inference component + instance pool (ml.g5.2xlarge → ml.g5.4xlarge → ml.g6.2xlarge): 0 provision failures over 30 days, p99 68 ms (including one automatic g6 fallback event).

Production ladder — run in order

Stage	Gate	Rollback trigger
0	Model Registry package + lineage	No registry → stop
1	Staging endpoint + load test 1.5× peak	p99 > SLA at 1×
2	Model Monitor baselines	Drift on shadow traffic
3	Registry approval + canary weights	5xx > 0.1%
4	Cost tags + autoscale bounds	Idle GPU > 40% without scale-in

Full checklist: deployment-stage-checklist.md.

Opinionated take: Inference components for all new real-time endpoints. Single-model endpoints are a dev convenience — production needs independent scale and multi-model headroom without reprovisioning the entire endpoint.

Stage 1 — Deploy with inference components

AWS CLI v2 + boto3 ≥ 1.34; region us-east-1 in examples below.

# Context: SageMaker AI real-time endpoint with inference component (July 2026 API)
aws sagemaker create-model --model-name fraud-xgb-v3 --primary-container Image=ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/fraud:xgb-v3.2

aws sagemaker create-endpoint-config --endpoint-config-name fraud-prod-ic-cfg \
  --production-variants VariantName=AllTraffic,InitialInstanceCount=2,InstanceType=ml.g5.2xlarge

aws sagemaker create-endpoint --endpoint-name fraud-prod --endpoint-config-name fraud-prod-ic-cfg

aws sagemaker create-inference-component \
  --inference-component-name fraud-xgb-ic \
  --endpoint-name fraud-prod \
  --variant-name AllTraffic \
  --specification '{"ModelName":"fraud-xgb-v3","ComputeResourceRequirements":{"MinMemoryRequiredInMb":4096,"NumberOfAcceleratorDevicesRequired":1}}' \
  --runtime-config '{"CopyCount":2}'

For capacity-aware pools, set heterogeneous instance preferences in endpoint config (see AWS capacity-aware inference blog — April 2026). SageMaker tries your priority list at create, scale-out, and scale-in.

Stage 2 — Model Monitor before traffic

# Context: boto3 1.34+, SageMaker SDK 2.x, staging endpoint already serving shadow traffic
from sagemaker.model_monitor import DefaultModelMonitor

monitor = DefaultModelMonitor(
    role="arn:aws:iam::ACCOUNT:role/SageMakerModelMonitor",
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

monitor.suggest_baseline(
    job_name="fraud-baseline-20260702",
    endpoint_input="fraud-staging",
    inference_attribute="predicted_label",
    record_preprocessor_source_uri="s3://ml-monitoring/preprocessor.py",
)

Schedule hourly monitors for fraud; daily for slow-drifting recommenders. Pair with CloudWatch alarms on Invocation5XXErrors and ModelLatency.

What broke — Day 12 post-launch: Model Monitor flagged data drift — root cause was a feature pipeline deploy that renamed txn_amount_usd → amount_usd without schema contract. Monitor worked; human process failed. Rolled back pipeline; refreshed baseline before re-enabling pages.

Stage 3 — Promotion with registry + canary

Set model package status to Approved in Model Registry (manual or pipeline gate).
Deploy new inference component version to staging; run shadow 0% prod weight.
Shift prod variant weight 10% → 50% → 100% or swap endpoint config per blue/green guide.
Keep previous model version ARN in runbook for < 5 min rollback.

Integrate with SageMaker Pipelines — training → evaluate → register → deploy steps — aligned with DevOps maturity model Level 3+ expectations.

Stage 4 — Cost and capacity

Lever	When
Instance pool fallback	GPU capacity errors in CI
Auto-scaling on `InvocationsPerInstance`	Predictable diurnal fraud peaks
SageMaker AI Savings Plans	Steady GPU hours > $50k/mo — Savings Plans guide
Async / batch transform	Scoring latency > 60s acceptable

Model scenarios in cost-latency-worksheet.csv — scale-to-zero row shows p99 2400 ms cold start; rejected for real-time fraud.

When NOT to use SageMaker endpoints

Situation	Alternative
Prompt-only LLM, no custom weights	Bedrock Converse API
Sub-10 ms at millions RPS	Neuron / Inferentia compiled models or edge
Batch nightly scores only	Batch Transform — no 24/7 endpoint
Team has no container ops	SageMaker JumpStart managed model — still an endpoint, but faster bootstrap

What to do this week

Register current prod model in Model Registry with lineage.
Add instance pool to staging endpoint config; rerun failed CI provisions.
Run 1.5× load test; export p99 to worksheet CSV.
Create Model Monitor baseline from staging — alarm on drift + 5xx.
Document rollback ARN and variant weight steps before next promotion.

Reproduce this — Follow deployment-stage-checklist.md. Check off stages 0–2 on a staging endpoint only. Record estimated_monthly_usd from the worksheet after load test — do not promote without stage 2 green.

What this post doesn’t cover

Feature Store setup and offline/online sync — separate data engineering guide.
Multi-model generative serving (LLM + reranker) — Bedrock + AgentCore patterns.
SageMaker HyperPod large-scale training — training cluster ops, not inference.
Full MLOps platform selection (Databricks, Vertex) — AWS-native path only.

SageMaker Production MLOps on AWS (2026): Inference Components, Capacity Pools, and Promotion Gates

Production ladder — run in order

Stage 1 — Deploy with inference components

Stage 2 — Model Monitor before traffic

Stage 3 — Promotion with registry + canary

Stage 4 — Cost and capacity

When NOT to use SageMaker endpoints

What to do this week

What this post doesn’t cover

Related AWS Services

AWS Bedrock Consulting

Amazon SageMaker

Amazon Q for Business

Recommended Reading

Amazon SageMaker Unified Studio: Migrating from Studio Classic to the Unified ML Platform

Blue/Green vs Canary on AWS (2026): ECS, Lambda, and When Rolling Is Enough

How to Run SageMaker Training Jobs Cost-Efficiently

AWS SageMaker AI Savings Plans: Up to 64% Off Training and Inference Compute

AI & assistant-friendly summary

Summary

Key Facts

Entity Definitions

Related Content

Production ladder — run in order

Stage 1 — Deploy with inference components

Stage 2 — Model Monitor before traffic

Stage 3 — Promotion with registry + canary

Stage 4 — Cost and capacity

When NOT to use SageMaker endpoints

What to do this week

What this post doesn’t cover

Related AWS Services

AWS Bedrock Consulting

Amazon SageMaker

Amazon Q for Business

Recommended Reading

Amazon SageMaker Unified Studio: Migrating from Studio Classic to the Unified ML Platform

Blue/Green vs Canary on AWS (2026): ECS, Lambda, and When Rolling Is Enough

How to Run SageMaker Training Jobs Cost-Efficiently

AWS SageMaker AI Savings Plans: Up to 64% Off Training and Inference Compute