AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

Bedrock billing is not a single line item — it is a composition of model invocation costs, Knowledge Base retrieval, Agent orchestration, Guardrails evaluation, and cross-region inference profile routing. Each component has its own pricing model and its own set of cost traps.

Entity Definitions

AWS Bedrock
AWS Bedrock is an AWS service discussed in this article.
Bedrock
Bedrock is an AWS service discussed in this article.
cost optimization
cost optimization is a cloud computing concept discussed in this article.

AWS Bedrock Cost Optimization: Token Budgets, Model Selection, and Inference Profiles

Quick summary: Bedrock billing is not a single line item — it is a composition of model invocation costs, Knowledge Base retrieval, Agent orchestration, Guardrails evaluation, and cross-region inference profile routing. Each component has its own pricing model and its own set of cost traps.

Table of Contents

Amazon Bedrock billing has a structure most teams do not fully model before their first production deployment. The per-token cost for a single model invocation is the visible line item. The real cost driver is the composition of invocations your architecture produces — how many tokens flow through Bedrock per user action, across how many model calls, with what supporting services alongside.

This guide covers the cost model for each Bedrock component and the practical decisions that reduce Bedrock spend without reducing application quality.

How Bedrock Pricing Works

Bedrock charges on a per-token basis for model invocations. The pricing structure has two components:

  • Input tokens — The tokens in your prompt (system prompt + user message + any context injected)
  • Output tokens — The tokens the model generates in its response

Output tokens are consistently more expensive than input tokens — typically 3–5× per token — because generation requires more compute than encoding. This asymmetry has architectural implications: architectures that generate long outputs are more expensive per invocation than architectures that inject large contexts but receive structured short responses.

Example cost calculation (Claude Haiku 3.5):

A customer service chatbot receives a message, retrieves 2,000 tokens of context from a Knowledge Base, and generates a 150-token reply:

  • Input tokens: ~100 (system prompt) + 30 (user message) + 2,000 (retrieved context) = ~2,130 input tokens
  • Output tokens: ~150
  • Cost per invocation: (2,130 × input rate) + (150 × output rate)

At this message structure, the retrieved context dominates input cost. Injecting 2,000 tokens of context is significantly more expensive than the 30-token user message. This makes context efficiency — retrieving only the most relevant chunks, not the most tokens — a primary cost lever.

Model Selection: The Biggest Cost Variable

Model choice determines the per-token rate, and per-token rates vary by an order of magnitude across the Bedrock model catalog.

The Selection Framework

Use Case CategoryRecommended ModelWhy
Summarization, classification, extractionClaude Haiku 3.5High accuracy, lowest cost, fast
Structured output generation (JSON, tables)Claude Haiku 3.5Reliable structure at minimal cost
RAG-based Q&A with injected contextClaude Haiku 3.5Context handling is strong; reasoning not required
Multi-step reasoning, complex analysisClaude Sonnet 4.xReasoning quality justifies cost step-up
Code generation, technical documentationClaude Sonnet 4.xCode quality requires stronger reasoning
Long-form content, nuanced judgmentClaude Sonnet 4.xQuality-sensitive, cost is secondary
Maximum reasoning quality, critical decisionsClaude Opus 4.xReserved for tasks where errors are expensive

The practical rule: Start with Haiku for every use case. Evaluate quality on a representative sample. Escalate to Sonnet only if Haiku quality is insufficient for the specific task. For most organizations, 60–70% of Bedrock invocations can run on Haiku with acceptable output quality.

Routing by Task Complexity

For applications that serve multiple task types, implement model routing that selects the appropriate model per task:

def select_model(task_type: str, estimated_complexity: str) -> str:
    routing = {
        ("extraction", "low"): "anthropic.claude-haiku-3-5",
        ("summarization", "low"): "anthropic.claude-haiku-3-5",
        ("summarization", "high"): "anthropic.claude-sonnet-4-x",
        ("reasoning", "high"): "anthropic.claude-sonnet-4-x",
        ("critical_decision", "high"): "anthropic.claude-opus-4-x",
    }
    return routing.get((task_type, estimated_complexity), "anthropic.claude-haiku-3-5")

A routing layer adds minimal latency and can cut Bedrock costs by 40–60% compared to using a single high-capability model for all tasks.

Token Budget Management

System Prompt Efficiency

System prompts are injected on every invocation. A 2,000-token system prompt costs 2,000 × input rate on every call, regardless of what the user asks. System prompt inflation is the most common source of avoidable token cost.

Common system prompt inefficiencies:

  • Repeated instructions that are already part of the model’s base behavior
  • Long role-play preambles when a shorter version produces the same behavior
  • Full policy documents injected as context when only relevant sections are needed
  • Verbose formatting instructions that could be condensed

Target: System prompts under 500 tokens for most use cases. Benchmark your current system prompt by measuring output quality with progressively shorter versions. Most models maintain 95%+ of quality at 50% prompt length.

Context Window Cost vs Quality Trade-off

For RAG-based applications, the number of retrieved chunks directly determines input token cost. More chunks = more context = higher accuracy (up to a point) = higher cost.

The optimal chunk count for cost efficiency is typically lower than the maximum the context window allows:

  • Test your application’s accuracy at 3, 5, 10, and 15 retrieved chunks
  • Measure accuracy gain vs cost increase at each step
  • Most RAG applications reach diminishing returns at 5–7 chunks; beyond that, accuracy gains are marginal while cost increases linearly

Caching Repeated Context

Bedrock Prompt Caching (available for supported models) caches prompt prefixes across invocations. If your system prompt + static context is the same for every user in a session, cached tokens are significantly cheaper than re-processing them on every call.

When caching helps most: Applications with a fixed, long system prompt + static knowledge base context that are reused across many invocations. Agentic workflows where multiple model calls share the same initial context.

Knowledge Base Cost Model

Bedrock Knowledge Base has charges beyond model invocations:

  • Vector storage — OpenSearch Serverless backing store has a minimum of 2 OCUs ($0.24/OCU-hour = ~$346/month minimum for the 2-OCU floor)
  • Retrieval queries — Per-query charge for each Knowledge Base lookup
  • Embedding generation — Tokens processed to generate embeddings during ingestion

The OpenSearch Serverless minimum is the most commonly missed Bedrock cost. A Knowledge Base that does 100 queries per day has a small query cost, but the underlying OpenSearch Serverless cluster costs $346/month minimum regardless of query volume. For low-query-volume applications, this floor cost dominates.

Cost-reducing alternatives:

  • Aurora PostgreSQL with pgvector — Pay only for database instance cost, no minimum floor. Correct for low-to-medium query volumes.
  • Amazon OpenSearch Service (managed clusters) — For high-volume production workloads where the per-hour node cost is predictable and lower than Serverless at scale.
  • DynamoDB with Bedrock Knowledge Base — Not yet a supported backing store, but following AWS’s trajectory.

For most Knowledge Base use cases under 10,000 queries/month, Aurora PostgreSQL with pgvector has a lower total cost than OpenSearch Serverless.

Bedrock Agents: Orchestration Cost

Bedrock Agents add an orchestration layer on top of model invocations. Each agentic loop iteration (user input → model call → action → model call → response) generates multiple model invocations. The total token cost of an agentic interaction is multiplicative relative to a single-turn Q&A.

Cost drivers unique to Agents:

  • Multi-turn orchestration — Each planning step, action call, and response synthesis is a separate model invocation
  • Large system prompts — Agents inject substantial orchestration context into every model call
  • Action result injection — Tool call results (API responses, code execution output) are injected back into the model context

Token multiplier effect: A user action that requires 3 agent steps (plan, execute, synthesize) consumes roughly 3× the tokens of a single-turn response to the same question, plus the overhead of the agent orchestration prompt.

For applications where agentic orchestration is optional, provide a direct RAG path for simple queries and only route to the agentic path for queries that genuinely require multi-step reasoning or tool use. Most “does this need an agent?” classification can be done with a cheap Haiku call.

Guardrails: Evaluation Cost

Bedrock Guardrails evaluates content against your configured policies (content filters, sensitive information redaction, topic denial, grounding check). Each evaluation is charged per text unit.

Guardrails cost optimization:

  • Apply Guardrails only to user-facing inputs and outputs, not to intermediate agentic steps where the content is not directly user-controlled
  • For grounding checks (hallucination detection), scope to final responses only — checking every intermediate step multiplies cost without improving user safety
  • Test whether content filters at lower sensitivity levels provide sufficient protection before enabling maximum sensitivity, as over-filtering increases the reject-and-retry loop cost

Cross-Region Inference Profiles

Cross-region inference profiles route Bedrock API calls to whichever supported region has available capacity for the requested model. They provide:

  • Higher effective throughput limits — Your requests distribute across regions rather than competing for single-region capacity
  • Reduced throttling errors — Fewer requests hit ThrottlingException during peak periods
  • Automatic failover — If one region has a model outage, requests route to another automatically

Cost implications: For some models, inference profiles offer the same per-token pricing as single-region on-demand calls. For others, the profile pricing is slightly different — verify against the Bedrock pricing page for your specific model before assuming parity. The primary value is availability and throughput, with cost being secondary.

When to use inference profiles: Any production application with sustained throughput requirements or strict latency SLAs. The cost of throttling errors (retries, user-visible failures) often exceeds any pricing difference.

Monitoring Bedrock Costs

CloudWatch Metrics to Track

  • InvocationLatency — P99 latency per model; slow responses may indicate model degradation or routing issues
  • InputTokenCount / OutputTokenCount — Track per-model, per-application to catch prompt inflation
  • InvocationClientErrors / InvocationThrottles — Throttle rate indicates whether capacity is becoming a cost risk

Cost Attribution

Tag Bedrock API calls using the bedrock:InferenceProfileARN or application-level tags to attribute costs per feature, team, or customer tier. Without tagging, Bedrock costs appear as a single line item in Cost Explorer, making it impossible to determine which feature or use case is driving spend.

Estimating Costs Before Production

Before deploying a new Bedrock-backed feature, run a cost estimate:

  1. Sample 100 representative user inputs for the feature
  2. Run them through the intended model and record input/output token counts
  3. Calculate average tokens per invocation
  4. Multiply by expected daily invocation volume
  5. Apply current pricing to get estimated daily and monthly cost

This 30-minute exercise catches architectures with inadvertently high token costs before they hit production billing.

Getting Started

For organizations deploying their first production Bedrock workloads, the cost optimization sequence is:

  1. Start with Haiku for all use cases — validate quality before upgrading models
  2. Keep system prompts under 500 tokens — measure impact of prompt reduction
  3. Use Aurora pgvector over OpenSearch Serverless for Knowledge Base unless query volume justifies the $346/month floor
  4. Enable cross-region inference profiles for production workloads
  5. Tag all Bedrock calls by feature and team for attribution

For AWS Bedrock architecture and implementation consulting, including cost-aware RAG design, agent orchestration patterns, and Bedrock deployment on your AWS account, talk to our team.

Contact us about Bedrock cost optimization →

PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Ready to discuss your AWS strategy?

Our certified architects can help you implement these solutions.

Recommended Reading

Explore All Articles »
Autoscaling Broke Your Budget (AI Made It Worse)

Autoscaling Broke Your Budget (AI Made It Worse)

Autoscaling was supposed to make costs predictable by matching capacity to demand. Instead, it introduced feedback loops, burst amplification, and — with AI workloads — a new class of non-deterministic spend that no scaling policy anticipates.

Karpenter vs Cluster Autoscaler: EKS Node Cost Optimization in 2026

Karpenter replaces Cluster Autoscaler as the recommended EKS node autoscaler. It provisions nodes faster, selects better-fit instance types per workload, and consolidates nodes more aggressively — typically reducing EKS compute costs by 20-40% compared to an equivalent Cluster Autoscaler deployment.

Logging Yourself Into Bankruptcy

Logging Yourself Into Bankruptcy

Observability is not free, and the industry has collectively underpriced it. CloudWatch log ingestion, metrics explosion, and X-Ray trace volume can together exceed your compute bill — especially once AI workloads introduce high-cardinality telemetry at scale.