What is Amazon Bedrock Provisioned Throughput?

Provisioned Throughput is a Bedrock pricing mode where you pre-commit to a fixed number of Model Units (MUs) for a specific foundation model, billed by the hour for 1-month or 6-month terms. Each MU delivers a guaranteed input and output token rate per minute. In exchange for the commitment, the per-token economics are materially cheaper than on-demand at high sustained volume — but you pay for the reserved capacity whether or not you use it. Provisioned Throughput is also the only way to access custom-imported models and fine-tuned models on Bedrock.

When does Provisioned Throughput break even versus on-demand?

The break-even point depends on the model and the input/output token mix, but the rule of thumb we use is: Provisioned Throughput tends to win once a single deployment of a single model sustains roughly 60–70% utilization of one MU for the full 24 hours of every day in the commit period. Below that utilization, on-demand is cheaper because you only pay for the tokens you actually generate. Above that, the hourly MU rate is amortized across enough tokens that the per-token cost drops below the on-demand rate. Always verify against the current model-specific MU pricing on the [Amazon Bedrock Pricing page](https://aws.amazon.com/bedrock/pricing/) — the ratio shifts as AWS adjusts pricing.

Should I use Prompt Caching, Batch Inference, or Provisioned Throughput first?

In that order, usually. Prompt Caching is free to enable, costs almost nothing to evaluate, and reduces input-token cost on cached prefixes by up to ~90% — adopt it everywhere it applies (long static system prompts, RAG context, multi-turn agents). Batch Inference cuts the per-token cost roughly in half for asynchronous workloads (overnight summarization, classification, embedding generation) and only requires moving from synchronous to asynchronous job design. Provisioned Throughput is a financial commitment with a 1-month or 6-month lock-in and only pays back at sustained scale — exhaust the easier optimizations first.

Can I mix Provisioned Throughput and on-demand for the same model?

Yes, and most production deployments at scale do. Provisioned Throughput handles the predictable steady-state traffic; on-demand absorbs spikes that exceed the provisioned MU capacity. The application or routing layer (or Bedrock cross-region inference profile) decides which path each request takes. The cost-optimal split is usually to provision for the 70th-percentile-hour traffic and let on-demand cover the peaks — provisioning for peak wastes the reservation during quiet hours.

Are custom-imported and fine-tuned models always Provisioned Throughput?

Yes. Custom Model Import on Bedrock and any model you fine-tune through Bedrock Model Customization is served exclusively via Provisioned Throughput — there is no on-demand pricing tier for them. This is an architecture decision masquerading as a billing decision: choosing to fine-tune is choosing to pay for at least one MU of provisioned capacity 24/7 for the model's lifetime, plus storage. For most teams, prompt engineering or RAG against a base model on on-demand pricing is materially cheaper than fine-tuning until the volume is very high.

How do I measure my actual Bedrock usage before committing to Provisioned Throughput?

Run a 14-day on-demand baseline with CloudWatch Bedrock metrics enabled. Capture InvocationCount, InputTokenCount, and OutputTokenCount per model per hour. Build the per-hour utilization curve and look at the median, the 70th percentile, and the peak. Provision for the 70th-percentile-hour, not the peak. The same exercise also surfaces opportunities for Prompt Caching (high prefix-overlap hours) and Batch Inference (pure-generation hours that are not user-facing).

How does the new Claude Platform on AWS change this analysis?

Claude Platform on AWS (announced as Coming Soon at aws.amazon.com/claude-platform) gives you Anthropic's native Claude experience billed and authenticated through your AWS account — IAM for access, consolidated AWS billing, CloudTrail logging — but with one critical difference: customer data is processed by Anthropic outside the AWS boundary. Bedrock keeps every byte inside AWS. For workloads with strict data-residency, HIPAA, or regulated-industry requirements, Bedrock remains the only viable path and the Provisioned Throughput break-even math in this guide applies as written. For internal tooling or non-residency-constrained workloads, Claude Platform on AWS may offer a simpler operational model — but pricing has not yet been published, so direct cost comparisons are premature.

Bedrock Provisioned Throughput vs On-Demand: 2026 Break-Even Math

Most engineering teams encounter Bedrock Provisioned Throughput in the same way: a finance review surfaces a Bedrock line item that has tripled month-over-month, someone screenshots the AWS pricing page, and a Slack thread starts about whether to “buy reserved capacity.”

The screenshot rarely contains enough information to make the decision. Provisioned Throughput is not a discount on on-demand pricing — it is a different pricing model with different units (Model Units per hour, not tokens), different commitment terms (1 month or 6 months, not pay-as-you-go), and a different break-even economics that depends heavily on traffic shape, not just total token volume.

This is the math we use in real FinOps engagements to decide whether a Bedrock workload should move to Provisioned Throughput, stay on on-demand, or apply the cheaper optimizations (Prompt Caching, Batch Inference) first.

The four Bedrock pricing modes, mapped

Bedrock has four pricing modes that you can mix-and-match per model and per workload. Each mode has a different cost shape — and the right answer for a given workload is usually a combination, not a single mode.

Mode	What you pay for	Commitment	When it wins
On-Demand	Per input and output token, billed per request	None	Spiky traffic, dev/test, low total volume
Provisioned Throughput	Reserved Model Units (MUs) per hour	1 month or 6 months	Sustained high-throughput steady-state traffic
Batch Inference	Per token, ~50% discount vs on-demand	None — async jobs only	Overnight or queued non-real-time workloads
Prompt Caching	Cached input tokens at ~10% of on-demand input rate	None	Long static prefixes reused across many requests

The most common cost-optimization mistake we see is teams jumping straight to Provisioned Throughput because it sounds like a “reserved instance for AI.” Provisioned Throughput has the highest commitment and the longest payback time of the four modes. The cheaper optimizations should be exhausted first.

For deeper context on the model-selection and token-budget side of Bedrock cost, see our existing guide on Bedrock cost optimization, token budgets, and model selection.

How Provisioned Throughput is actually priced

Bedrock Provisioned Throughput is sold in Model Units (MUs). Each MU delivers a guaranteed minimum throughput — measured in input tokens per minute and output tokens per minute — for one specific foundation model.

The pricing has three dimensions:

Hourly MU rate — Varies by model family. Smaller models (Haiku-class, Nova Micro/Lite) have lower hourly rates than larger models (Sonnet, Opus, Nova Pro/Premier).
Commitment term — No commit is the most expensive (and only available for some models). 1-month commit is cheaper. 6-month commit is the cheapest per-hour rate.
Number of MUs — You can provision multiple MUs of the same model in the same region for higher aggregate throughput. Capacity scales linearly; pricing scales linearly.

A single MU of a Claude Sonnet-class model on a 6-month commit can deliver tens of thousands of input tokens per minute — enough for most production applications below a few hundred concurrent users. A single MU of a Haiku-class model delivers materially more tokens per minute at a lower hourly rate, because the model is cheaper to serve.

Verify before committing. Per-MU hourly rates and per-MU throughput vary by model and change as AWS releases updated foundation model versions. The decision math below uses the ratio of provisioned-to-on-demand cost, which is more stable than any specific dollar figure. Always cross-check the current numbers on the Amazon Bedrock Pricing page before signing a 6-month commit.

The break-even formula

There are two equivalent ways to think about the break-even point. Pick whichever matches the data you actually have.

Formulation 1: Sustained tokens per minute

If you know your sustained throughput (input + output tokens per minute averaged across the hour), the break-even question is:

At what tokens-per-minute does the per-token economics of one MU undercut on-demand?

Let:

T_in = input tokens per minute delivered by one MU
T_out = output tokens per minute delivered by one MU
R_mu = MU hourly cost (commit-term-adjusted)
R_in = on-demand input rate (per token)
R_out = on-demand output rate (per token)

The hourly on-demand cost of producing the same throughput as one MU is:

hourly_on_demand_cost = (T_in × 60 × R_in) + (T_out × 60 × R_out)

The break-even utilization is:

break_even_utilization = R_mu / hourly_on_demand_cost

If your actual hourly utilization of one MU is above the break-even percentage, Provisioned Throughput is cheaper. Below, on-demand is cheaper.

For most current Claude and Nova model families on a 1-month commit, the break-even utilization tends to land between 60% and 75% of one MU’s full capacity — sustained, every hour of every day, for the entire commitment.

Formulation 2: Total tokens per month

If you only know your total monthly token volume (not the throughput shape), use:

break_even_monthly_tokens ≈ (R_mu × 730) / weighted_per_token_rate

where weighted_per_token_rate is the average of input and output rates weighted by your actual input-to-output ratio. The 730 is the approximate hours in a month.

This formulation is less accurate than the throughput-based one because it ignores traffic shape — a workload that produces all its tokens in 4 hours per day will under-utilize Provisioned Throughput dramatically even if the monthly total looks high.

Five worked scenarios

To make the math concrete, here are five typical workload shapes and the right call for each. The dollar figures are illustrative for the structure of the decision — the absolute amounts will move with model pricing, but the relative call is stable.

Scenario 1: Internal AI assistant, 12M tokens/month, business hours only

A SaaS company runs an internal Q&A assistant on Claude Haiku. Token volume is 12M/month, but it concentrates in a 9-hour business-hours window across a 5-day workweek. Effective active hours per month ≈ 195.

On-demand monthly cost: 12M tokens × Haiku weighted rate
One MU at the smallest commit term: still pays for 730 hours/month, of which only 195 are active

Call: Stay on-demand. Even though token volume is non-trivial, the traffic shape leaves the MU idle 73% of the time. Apply Prompt Caching to the static system prompt for further savings.

Scenario 2: Customer-facing chatbot, 80M tokens/month, 24/7 with mild diurnal pattern

A consumer-facing AI feature on Claude Sonnet 4. Token volume is steady around 80M/month with a 2× peak-to-trough ratio.

On-demand monthly cost: ~80M × Sonnet weighted rate
Throughput requirement at the 70th-percentile hour: roughly 110K tokens/hour
One MU of Sonnet 4 (1-month commit) likely delivers more than that at full utilization

Call: Provision one MU of Sonnet 4 (1-month commit) for the steady traffic, leave on-demand to absorb the spikes. Re-evaluate at the next renewal — if utilization holds, move to a 6-month commit for the lower hourly rate.

Scenario 3: RAG-heavy enterprise search, 350M tokens/month, steady 24/7

A regulated enterprise running a Knowledge Base-backed search assistant on Claude Sonnet 4. Long retrieved-context inputs (input-heavy ratio), steady 24/7 throughput.

The input-heavy ratio means weighted per-token rate is closer to the input rate than output rate
Steady 24/7 means MU utilization can plausibly hit 80%+
350M tokens/month sustained is firmly in the territory where Provisioned Throughput wins

Call: Provision two MUs of Sonnet 4 on a 6-month commit. Apply Prompt Caching to the system prompt and the RAG retrieval template — this further reduces input-token cost on top of the provisioned discount. Use on-demand only as overflow.

Scenario 4: Overnight document processing, 200M tokens/month, all between 1am and 6am

A legal-tech platform that ingests case files and generates structured summaries. All inference is asynchronous and runs in a 5-hour overnight window.

Provisioned Throughput utilization across 24 hours: ~21% — far below break-even
On-demand for 200M tokens is expensive
Batch Inference matches the workload shape exactly: async, non-real-time, can tolerate hours of latency

Call: Move to Bedrock Batch Inference. The ~50% per-token discount applies to the entire 200M monthly volume with zero capacity commitment. Provisioned Throughput would be wasted on this traffic shape; Batch is the correct primitive.

Scenario 5: Custom fine-tuned model, 30M tokens/month

A healthcare SaaS fine-tuned a Claude Haiku model on internal terminology. Total volume is modest at 30M tokens/month.

Custom-imported and fine-tuned Bedrock models do not have an on-demand tier — they are served exclusively via Provisioned Throughput
The decision is not “Provisioned vs on-demand” — it is “Provisioned, or do not use a custom model at all”

Call: Either provision the minimum MU for the custom model (and accept the 24/7 hourly cost as the floor), or step back and re-evaluate whether RAG against a base Claude Haiku on on-demand could deliver the required quality at a fraction of the cost. We have walked many teams off fine-tuning at this volume — the operational and cost burden of a permanent provisioned MU rarely pays back below ~100M tokens/month.

For more on this trade-off, see our guide on Bedrock as the fastest path to enterprise GenAI.

The order to apply optimizations

Provisioned Throughput should not be the first lever. The cost-optimization sequence we follow in real engagements:

Right-size the model. Measure on-demand cost per use case at Haiku-class models first. Most workloads do not need Sonnet or Opus for 60–70% of their invocations. Model selection is a 5–20× cost lever; everything below is fractional in comparison.
Trim the prompt. System prompts above 500 tokens are usually inflated. Cut redundant instructions and verbose role-play preambles.
Enable Prompt Caching. For long static prefixes (system prompts, RAG context, agent orchestration scaffolds), cached input tokens are roughly 10% of the on-demand rate. Free to enable, near-zero risk.
Move async work to Batch Inference. Anything that is not real-time user-facing — overnight summarization, document classification, embedding generation, scheduled reports — belongs in Batch.
Then, and only then, evaluate Provisioned Throughput. Run the break-even math against your measured 70th-percentile-hour utilization, not your peak.

Steps 1–4 typically cut Bedrock spend by 40–70% before any commitment to Provisioned Throughput is on the table. We have seen teams sign 6-month Provisioned Throughput commits and then realize three weeks later that Prompt Caching alone would have made the commitment redundant.

Common mistakes when buying Provisioned Throughput

These are the patterns we see in cost reviews after a Provisioned Throughput commitment is already in place.

Provisioning for peak traffic. A workload with a 4× peak-to-trough ratio that provisions for peak utilizes the MU at 25% on average. The on-demand cost of the peak hours alone would have been cheaper than the wasted provisioned hours. Provision for the 70th-percentile-hour and let on-demand absorb the spikes.

Ignoring the model-version refresh cycle. A 6-month commit on a model that gets superseded by a better, cheaper version after month 2 is a stranded asset. Anthropic and Amazon refresh model families on roughly 6–12 month cycles; commit terms longer than 1 month should be a deliberate bet on model stability, not a default.

Double-counting Bedrock Agents costs. Agent-orchestrated invocations are billed against the underlying model’s pricing mode. If the model is on Provisioned Throughput, the agent’s token consumption draws against the provisioned MU. If you separately budget for “Agent costs” on top of foundation-model costs, you are double-counting.

Underestimating Knowledge Base baseline cost. Bedrock Knowledge Base on the OpenSearch Serverless backing store has a 2-OCU minimum at $0.24/OCU-hour — about $345/month before a single query runs. This is independent of whether the foundation model is on-demand or Provisioned Throughput. For low-query workloads under 10K queries/month, evaluate Aurora PostgreSQL with pgvector as a lower-floor alternative.

Forgetting cross-region inference profiles. Cross-region inference profiles can change the effective per-token economics for some models because they unlock capacity that a single-region on-demand call would have throttled. Run the on-demand baseline with inference profiles enabled before assuming you need Provisioned Throughput for capacity reasons.

Multi-region Provisioned Throughput

Provisioned Throughput is purchased per region. Workloads with users distributed across geographies face a choice:

Single-region provisioned, cross-region inference profiles for the rest — Provision in your primary region; let inference profiles route excess traffic to other regions on on-demand pricing. This is the cheapest setup for most workloads with a clear primary region.
Multi-region provisioned with regional load balancing — Provision in two or three regions for genuine multi-region active-active. Adds operational complexity (capacity management per region) and ties up commitment in regions that may be lower-utilized. Worth it only when latency or data-residency requirements force per-region capacity.
One region only, accept latency — For internal tools or non-latency-sensitive workloads, single-region Provisioned Throughput with no failover is operationally simplest. Document the failure mode in your runbook.

Treat multi-region provisioned throughput the same way you would treat multi-region RDS: only buy it when there is a real availability or latency requirement that single-region with failover cannot meet. Multi-region is rarely the cheapest answer.

Measuring before you commit

Before signing any Provisioned Throughput commitment, run a 14-day on-demand measurement. The data you need:

InvocationCount per model per hour — From CloudWatch Bedrock metrics, partitioned by model ID
InputTokenCount and OutputTokenCount per hour — Same dimensions
Per-hour utilization curve — (InputTokenCount × 60s + OutputTokenCount × 60s) / one_MU_throughput

Plot the per-hour utilization across the 14 days. Look at:

Median hour — If the median is below 50%, Provisioned Throughput is unlikely to win even on a 6-month commit.
70th-percentile hour — This is the right size to provision. Above this, on-demand handles the spikes.
Peak hour — Used for capacity planning, not for sizing the commitment.

This 14-day exercise also surfaces the data you need for the cheaper optimizations:

Hours with high prefix-overlap rate are Prompt Caching candidates
Hours dominated by asynchronous batch jobs are Batch Inference candidates
Hours where one model is invoked at high volume but a smaller model would have sufficed are model-selection candidates

We run this measurement as part of every Cloud Cost Optimization engagement that touches GenAI workloads. The output is a single dashboard that surfaces which optimization to apply in which hour-block.

When Provisioned Throughput is the right call

To summarize, Provisioned Throughput is the right answer when all of these are true:

Steady-state traffic that sustains 60%+ of one MU’s capacity for the majority of hours
A 1-month or 6-month commitment is acceptable given your model-refresh tolerance
The cheaper optimizations (right-sizing the model, trimming prompts, Prompt Caching, Batch Inference) have been applied or are not applicable
The workload uses a custom-imported or fine-tuned model (in which case the choice is forced)

Below those bars, on-demand plus the cheaper optimizations is almost always the right answer. The most common pattern at scale is Provisioned Throughput for the predictable steady-state, on-demand for the spikes, Batch for the async work, and Prompt Caching applied across all of the above — not one mode picked exclusively.

What about Claude Platform on AWS?

While this post was being written, AWS announced Claude Platform on AWS — a new path that delivers Anthropic’s native Claude experience billed and authenticated through your AWS account. IAM handles access, billing rolls into your existing AWS invoice, and CloudTrail logs every call. There is no separate Anthropic account or API key.

The early reaction in the engineering community captures the appeal honestly. As one practitioner put it: “Bedrock is fine but it always felt like one layer too many when all you want is Claude with decent enterprise controls.” The simplification argument is real — for teams that wanted Claude on AWS billing without Bedrock’s wrapper, this is a more direct path.

But the official AWS page is explicit about the trade-off: customer data on Claude Platform is processed by Anthropic outside the AWS boundary. Bedrock keeps data within AWS infrastructure; Claude Platform does not. For regulated workloads — HIPAA-eligible healthcare, PCI cardholder data, FedRAMP environments, EU data-residency obligations — that single line moves the decision back to Bedrock unconditionally. The Provisioned Throughput math in this guide still applies; the alternative path simply is not available for those workloads.

For non-residency-constrained workloads (internal productivity tools, public marketing copy generation, R&D), Claude Platform on AWS may turn out to be the simpler operational answer once pricing is published. As of writing, the page lists the service as “Coming Soon” with no pricing details — so any direct cost comparison is premature. We will update this post when pricing and regional availability are announced.

The practical effect of the announcement on the Provisioned Throughput decision is to sharpen it, not weaken it: if you have already concluded that data-residency forces you onto Bedrock, the only remaining question is whether your sustained traffic justifies a commitment. That is the math this post is here to answer.

Where this fits in your AWS cost program

Bedrock is rarely the largest line item on an AWS bill — but it is the fastest-growing one across our portfolio in 2026, and the one most likely to surprise finance leaders quarter-over-quarter. The teams that handle this well treat Bedrock cost the same way they treat compute and database cost: with a measured baseline, an explicit optimization sequence, and commitments sized to the 70th-percentile-hour rather than the peak.

If you are at the point of evaluating Provisioned Throughput, you are in the territory where a structured FinOps review usually pays back inside the first month. Our Cloud Cost Optimization and Bedrock consulting engagements include this measurement and the resulting commitment plan. If you would rather work the numbers yourself, the AWS Bedrock Token Cost Calculator is a starting point — and the existing Bedrock cost optimization guide covers the optimization levers that come before any commitment.

Bedrock Provisioned Throughput vs On-Demand: Break-Even Math for Production Workloads (2026)

The four Bedrock pricing modes, mapped

How Provisioned Throughput is actually priced

The break-even formula

Formulation 1: Sustained tokens per minute

Formulation 2: Total tokens per month

Five worked scenarios

Scenario 1: Internal AI assistant, 12M tokens/month, business hours only

Scenario 2: Customer-facing chatbot, 80M tokens/month, 24/7 with mild diurnal pattern

Scenario 3: RAG-heavy enterprise search, 350M tokens/month, steady 24/7

Scenario 4: Overnight document processing, 200M tokens/month, all between 1am and 6am

Scenario 5: Custom fine-tuned model, 30M tokens/month

The order to apply optimizations

Common mistakes when buying Provisioned Throughput

Multi-region Provisioned Throughput

Measuring before you commit

When Provisioned Throughput is the right call

What about Claude Platform on AWS?

Where this fits in your AWS cost program

Ready to discuss your AWS strategy?

Recommended Reading

AWS Bedrock Cost Optimization: Token Budgets, Model Selection, and Inference Profiles

Autoscaling Broke Your Budget (AI Made It Worse)

AWS NAT Gateway Billing: Why You Are Paying for Ghost Infrastructure

EC2 Spot Instance Selection: A Data-Driven Approach to 60–90% Cost Reduction

AI & assistant-friendly summary

Summary

Key Facts

Entity Definitions

Related Content

The four Bedrock pricing modes, mapped

How Provisioned Throughput is actually priced

The break-even formula

Formulation 1: Sustained tokens per minute

Formulation 2: Total tokens per month

Five worked scenarios

Scenario 1: Internal AI assistant, 12M tokens/month, business hours only

Scenario 2: Customer-facing chatbot, 80M tokens/month, 24/7 with mild diurnal pattern

Scenario 3: RAG-heavy enterprise search, 350M tokens/month, steady 24/7

Scenario 4: Overnight document processing, 200M tokens/month, all between 1am and 6am

Scenario 5: Custom fine-tuned model, 30M tokens/month

The order to apply optimizations

Common mistakes when buying Provisioned Throughput

Multi-region Provisioned Throughput

Measuring before you commit

When Provisioned Throughput is the right call

What about Claude Platform on AWS?

Where this fits in your AWS cost program

Ready to discuss your AWS strategy?

Recommended Reading

AWS Bedrock Cost Optimization: Token Budgets, Model Selection, and Inference Profiles

Autoscaling Broke Your Budget (AI Made It Worse)

AWS NAT Gateway Billing: Why You Are Paying for Ghost Infrastructure

EC2 Spot Instance Selection: A Data-Driven Approach to 60–90% Cost Reduction