How is Amazon Bedrock pricing structured?

Bedrock charges per token — separately for input tokens and output tokens — for each model invocation. Output tokens are typically 3–5× more expensive per token than input tokens because they require the model to perform generation rather than just encoding. On top of per-token charges, Bedrock Knowledge Base has retrieval charges (per query + storage), Bedrock Agents have per-session orchestration charges, and Bedrock Guardrails has per-text-unit evaluation charges. Total Bedrock cost = model invocation cost + optional Knowledge Base cost + optional Agent cost + optional Guardrails cost.

Which Claude model on Bedrock is most cost-efficient?

Claude Haiku 3.5 is the most cost-efficient Claude model for tasks that do not require complex reasoning — summarization, classification, extraction, simple Q&A, and structured output generation. For complex reasoning, multi-step analysis, or tasks requiring nuanced judgment, Claude Sonnet 4.x provides the best cost-quality balance. Claude Opus is for tasks where maximum reasoning quality matters more than cost. A practical approach: run a sample of your prompts through Haiku first, evaluate quality, and only escalate to Sonnet if quality is insufficient. Most organizations find 60–70% of their use cases can run on Haiku with acceptable quality.

What are Bedrock cross-region inference profiles and when do they save money?

Cross-region inference profiles route Bedrock API calls to the region with available model capacity, providing both higher throughput during peak periods and cost efficiency when on-demand capacity is constrained. For some models, inference profiles also unlock lower per-token rates than single-region on-demand pricing. The practical benefit is reliability at scale: instead of hitting throughput limits in your primary region during high-traffic periods (which either fails requests or requires expensive error handling), inference profiles automatically use available capacity across regions. The cost benefit is secondary to the availability benefit for most workloads.

AWS Bedrock Cost Optimization: Token Budgets, Model Selection, Inference Profiles

Amazon Bedrock billing has a structure most teams do not fully model before their first production deployment. The per-token cost for a single model invocation is the visible line item. The real cost driver is the composition of invocations your architecture produces — how many tokens flow through Bedrock per user action, across how many model calls, with what supporting services alongside.

This guide covers the cost model for each Bedrock component and the practical decisions that reduce Bedrock spend without reducing application quality.

How Bedrock Pricing Works

Bedrock charges on a per-token basis for model invocations. The pricing structure has two components:

Input tokens — The tokens in your prompt (system prompt + user message + any context injected)
Output tokens — The tokens the model generates in its response

Output tokens are consistently more expensive than input tokens — typically 3–5× per token — because generation requires more compute than encoding. This asymmetry has architectural implications: architectures that generate long outputs are more expensive per invocation than architectures that inject large contexts but receive structured short responses.

Example cost calculation (Claude Haiku 3.5):

A customer service chatbot receives a message, retrieves 2,000 tokens of context from a Knowledge Base, and generates a 150-token reply:

Input tokens: ~100 (system prompt) + 30 (user message) + 2,000 (retrieved context) = ~2,130 input tokens
Output tokens: ~150
Cost per invocation: (2,130 × input rate) + (150 × output rate)

At this message structure, the retrieved context dominates input cost. Injecting 2,000 tokens of context is significantly more expensive than the 30-token user message. This makes context efficiency — retrieving only the most relevant chunks, not the most tokens — a primary cost lever.

Model Selection: The Biggest Cost Variable

Model choice determines the per-token rate, and per-token rates vary by an order of magnitude across the Bedrock model catalog.

The Selection Framework

Use Case Category	Recommended Model	Why
Summarization, classification, extraction	Claude Haiku 3.5	High accuracy, lowest cost, fast
Structured output generation (JSON, tables)	Claude Haiku 3.5	Reliable structure at minimal cost
RAG-based Q&A with injected context	Claude Haiku 3.5	Context handling is strong; reasoning not required
Multi-step reasoning, complex analysis	Claude Sonnet 4.x	Reasoning quality justifies cost step-up
Code generation, technical documentation	Claude Sonnet 4.x	Code quality requires stronger reasoning
Long-form content, nuanced judgment	Claude Sonnet 4.x	Quality-sensitive, cost is secondary
Maximum reasoning quality, critical decisions	Claude Opus 4.x	Reserved for tasks where errors are expensive

The practical rule: Start with Haiku for every use case. Evaluate quality on a representative sample. Escalate to Sonnet only if Haiku quality is insufficient for the specific task. For most organizations, 60–70% of Bedrock invocations can run on Haiku with acceptable output quality.

Routing by Task Complexity

For applications that serve multiple task types, implement model routing that selects the appropriate model per task:

def select_model(task_type: str, estimated_complexity: str) -> str:
    routing = {
        ("extraction", "low"): "anthropic.claude-haiku-3-5",
        ("summarization", "low"): "anthropic.claude-haiku-3-5",
        ("summarization", "high"): "anthropic.claude-sonnet-4-x",
        ("reasoning", "high"): "anthropic.claude-sonnet-4-x",
        ("critical_decision", "high"): "anthropic.claude-opus-4-x",
    }
    return routing.get((task_type, estimated_complexity), "anthropic.claude-haiku-3-5")

A routing layer adds minimal latency and can cut Bedrock costs by 40–60% compared to using a single high-capability model for all tasks.

Token Budget Management

System Prompt Efficiency

System prompts are injected on every invocation. A 2,000-token system prompt costs 2,000 × input rate on every call, regardless of what the user asks. System prompt inflation is the most common source of avoidable token cost.

Common system prompt inefficiencies:

Repeated instructions that are already part of the model’s base behavior
Long role-play preambles when a shorter version produces the same behavior
Full policy documents injected as context when only relevant sections are needed
Verbose formatting instructions that could be condensed

Target: System prompts under 500 tokens for most use cases. Benchmark your current system prompt by measuring output quality with progressively shorter versions. Most models maintain 95%+ of quality at 50% prompt length.

Context Window Cost vs Quality Trade-off

For RAG-based applications, the number of retrieved chunks directly determines input token cost. More chunks = more context = higher accuracy (up to a point) = higher cost.

The optimal chunk count for cost efficiency is typically lower than the maximum the context window allows:

Test your application’s accuracy at 3, 5, 10, and 15 retrieved chunks
Measure accuracy gain vs cost increase at each step
Most RAG applications reach diminishing returns at 5–7 chunks; beyond that, accuracy gains are marginal while cost increases linearly

Caching Repeated Context

Bedrock Prompt Caching (available for supported models) caches prompt prefixes across invocations. If your system prompt + static context is the same for every user in a session, cached tokens are significantly cheaper than re-processing them on every call.

When caching helps most: Applications with a fixed, long system prompt + static knowledge base context that are reused across many invocations. Agentic workflows where multiple model calls share the same initial context.

Knowledge Base Cost Model

Bedrock Knowledge Base has charges beyond model invocations:

Vector storage — OpenSearch Serverless backing store has a minimum of 2 OCUs ($0.24/OCU-hour = ~$346/month minimum for the 2-OCU floor)
Retrieval queries — Per-query charge for each Knowledge Base lookup
Embedding generation — Tokens processed to generate embeddings during ingestion

The OpenSearch Serverless minimum is the most commonly missed Bedrock cost. A Knowledge Base that does 100 queries per day has a small query cost, but the underlying OpenSearch Serverless cluster costs $346/month minimum regardless of query volume. For low-query-volume applications, this floor cost dominates.

Cost-reducing alternatives:

Aurora PostgreSQL with pgvector — Pay only for database instance cost, no minimum floor. Correct for low-to-medium query volumes.
Amazon OpenSearch Service (managed clusters) — For high-volume production workloads where the per-hour node cost is predictable and lower than Serverless at scale.
DynamoDB with Bedrock Knowledge Base — Not yet a supported backing store, but following AWS’s trajectory.

For most Knowledge Base use cases under 10,000 queries/month, Aurora PostgreSQL with pgvector has a lower total cost than OpenSearch Serverless.

Bedrock Agents: Orchestration Cost

Bedrock Agents add an orchestration layer on top of model invocations. Each agentic loop iteration (user input → model call → action → model call → response) generates multiple model invocations. The total token cost of an agentic interaction is multiplicative relative to a single-turn Q&A.

Cost drivers unique to Agents:

Multi-turn orchestration — Each planning step, action call, and response synthesis is a separate model invocation
Large system prompts — Agents inject substantial orchestration context into every model call
Action result injection — Tool call results (API responses, code execution output) are injected back into the model context

Token multiplier effect: A user action that requires 3 agent steps (plan, execute, synthesize) consumes roughly 3× the tokens of a single-turn response to the same question, plus the overhead of the agent orchestration prompt.

For applications where agentic orchestration is optional, provide a direct RAG path for simple queries and only route to the agentic path for queries that genuinely require multi-step reasoning or tool use. Most “does this need an agent?” classification can be done with a cheap Haiku call.

Guardrails: Evaluation Cost

Bedrock Guardrails evaluates content against your configured policies (content filters, sensitive information redaction, topic denial, grounding check). Each evaluation is charged per text unit.

Guardrails cost optimization:

Apply Guardrails only to user-facing inputs and outputs, not to intermediate agentic steps where the content is not directly user-controlled
For grounding checks (hallucination detection), scope to final responses only — checking every intermediate step multiplies cost without improving user safety
Test whether content filters at lower sensitivity levels provide sufficient protection before enabling maximum sensitivity, as over-filtering increases the reject-and-retry loop cost

Cross-Region Inference Profiles

Cross-region inference profiles route Bedrock API calls to whichever supported region has available capacity for the requested model. They provide:

Higher effective throughput limits — Your requests distribute across regions rather than competing for single-region capacity
Reduced throttling errors — Fewer requests hit ThrottlingException during peak periods
Automatic failover — If one region has a model outage, requests route to another automatically

Cost implications: For some models, inference profiles offer the same per-token pricing as single-region on-demand calls. For others, the profile pricing is slightly different — verify against the Bedrock pricing page for your specific model before assuming parity. The primary value is availability and throughput, with cost being secondary.

When to use inference profiles: Any production application with sustained throughput requirements or strict latency SLAs. The cost of throttling errors (retries, user-visible failures) often exceeds any pricing difference.

Monitoring Bedrock Costs

CloudWatch Metrics to Track

InvocationLatency — P99 latency per model; slow responses may indicate model degradation or routing issues
InputTokenCount / OutputTokenCount — Track per-model, per-application to catch prompt inflation
InvocationClientErrors / InvocationThrottles — Throttle rate indicates whether capacity is becoming a cost risk

Cost Attribution

Tag Bedrock API calls using the bedrock:InferenceProfileARN or application-level tags to attribute costs per feature, team, or customer tier. Without tagging, Bedrock costs appear as a single line item in Cost Explorer, making it impossible to determine which feature or use case is driving spend.

Estimating Costs Before Production

Before deploying a new Bedrock-backed feature, run a cost estimate:

Sample 100 representative user inputs for the feature
Run them through the intended model and record input/output token counts
Calculate average tokens per invocation
Multiply by expected daily invocation volume
Apply current pricing to get estimated daily and monthly cost

This 30-minute exercise catches architectures with inadvertently high token costs before they hit production billing.

Getting Started

For organizations deploying their first production Bedrock workloads, the cost optimization sequence is:

Start with Haiku for all use cases — validate quality before upgrading models
Keep system prompts under 500 tokens — measure impact of prompt reduction
Use Aurora pgvector over OpenSearch Serverless for Knowledge Base unless query volume justifies the $346/month floor
Enable cross-region inference profiles for production workloads
Tag all Bedrock calls by feature and team for attribution

For AWS Bedrock architecture and implementation consulting, including cost-aware RAG design, agent orchestration patterns, and Bedrock deployment on your AWS account, talk to our team.

AWS Bedrock Cost Optimization: Token Budgets, Model Selection, and Inference Profiles

How Bedrock Pricing Works

Model Selection: The Biggest Cost Variable

The Selection Framework

Routing by Task Complexity

Token Budget Management

System Prompt Efficiency

Context Window Cost vs Quality Trade-off

Caching Repeated Context

Knowledge Base Cost Model

Bedrock Agents: Orchestration Cost

Guardrails: Evaluation Cost

Cross-Region Inference Profiles

Monitoring Bedrock Costs

CloudWatch Metrics to Track

Cost Attribution

Estimating Costs Before Production

Getting Started

Ready to discuss your AWS strategy?

Recommended Reading

Autoscaling Broke Your Budget (AI Made It Worse)

AWS Cost Optimization Hub: One Dashboard to Prioritize All Your Savings

Karpenter vs Cluster Autoscaler: EKS Node Cost Optimization in 2026

Logging Yourself Into Bankruptcy

AI & assistant-friendly summary

Summary

Entity Definitions

Related Content

How Bedrock Pricing Works

Model Selection: The Biggest Cost Variable

The Selection Framework

Routing by Task Complexity

Token Budget Management

System Prompt Efficiency

Context Window Cost vs Quality Trade-off

Caching Repeated Context

Knowledge Base Cost Model

Bedrock Agents: Orchestration Cost

Guardrails: Evaluation Cost

Cross-Region Inference Profiles

Monitoring Bedrock Costs

CloudWatch Metrics to Track

Cost Attribution

Estimating Costs Before Production

Getting Started

Ready to discuss your AWS strategy?

Recommended Reading

Autoscaling Broke Your Budget (AI Made It Worse)

AWS Cost Optimization Hub: One Dashboard to Prioritize All Your Savings

Karpenter vs Cluster Autoscaler: EKS Node Cost Optimization in 2026

Logging Yourself Into Bankruptcy