RAG at ~$200/mo, not $2K

Generative AI RAG on Bedrock — S3 Vectors + Knowledge Bases

Production RAG on AWS for ~$200/month at Series-A scale instead of $1,500–3,000/month on OpenSearch Serverless — Bedrock Knowledge Bases on S3 Vectors, Guardrails, and per-tenant inference profiles. The 2026 AWS-native default since S3 Vectors went GA.

Last updated: July 5, 2026Author: FactualMinds AWS ArchitectsReviewed by: Palaniappan P · AWS Solutions Architect — Professional

Problem

Most teams stand up RAG by stitching OpenSearch Serverless to a custom Python service, then watch the retrieval bill grow linearly with corpus size — $1,500–3,000/month on a 5M-vector workload before the first LLM token is generated. Latency, evaluation, and per-tenant spend control stay manual. By the time the workload reaches production, the retrieval layer is more expensive than the inference, and there is no clean answer to 'what does Customer X cost us in tokens this month?'

Solution

Use Bedrock Knowledge Bases on S3 Vectors for the retrieval layer — managed ingestion, managed chunking, managed embeddings, and storage-class economics that scale to billions of vectors at ~10% of OpenSearch Serverless cost above the 10M-vector mark. Wrap inference in Bedrock Guardrails for PII and content safety, and isolate per-tenant spend through application-level inference profiles. Move to OpenSearch Serverless only when you need hybrid BM25 + vector search or sub-50ms p95.

AWS services in this pattern

Service	Role
Amazon Bedrock Knowledge Bases	Managed RAG control plane — handles document ingestion, chunking strategy, embedding model selection, and the Retrieve / RetrieveAndGenerate APIs
Amazon S3 Vectors	Storage-tier vector index (GA Dec 2025) — up to 2B vectors per index at S3 economics; default retrieval store for new RAG workloads in 2026
Amazon Bedrock (Claude Sonnet 5, Nova, Titan)	Foundation model inference — Sonnet 5 for agentic lanes, Nova Micro/Lite for volume, Sonnet 4.6 or Haiku 4.x for stable production until benchmarked
Amazon Bedrock Guardrails	PII redaction, denied topic filtering, contextual grounding checks, and prompt-injection mitigation applied uniformly across every model call
Bedrock Application Inference Profiles	Per-tenant or per-workload tagging of inference calls — the foundation for spend caps, cost attribution, and audit
Amazon OpenSearch Serverless (escalation only)	Vector + lexical hybrid search and sub-50ms retrieval — used when S3 Vectors latency or BM25 requirements force the upgrade
Amazon Bedrock AgentCore	Production agent runtime (GA Oct 2025) for multi-step retrieval, tool use, and stateful agent workflows on top of the RAG layer
Amazon S3	Source-of-truth document store — versioned, encrypted with KMS, ingested into Knowledge Bases via managed connectors
AWS Lambda	Pre-processing hooks (PII scrubbing, document normalization) and post-retrieval re-ranking when needed
Amazon CloudWatch + Bedrock model invocation logs	Token usage, latency, guardrail trip rate, and per-tenant cost telemetry — feeds CUR 2.0 for unit-cost reporting

Architecture components

Ingestion path

S3 source bucket → Knowledge Base data source → managed chunking + embedding → S3 Vectors index. Schedule incremental sync; full re-embed only on chunking strategy change.

Retrieval path

Application calls Bedrock Retrieve or RetrieveAndGenerate with tenant_id metadata filter; S3 Vectors returns top-K chunks; Bedrock composes the prompt and invokes the chosen model under a Guardrail.

Guardrail layer

Single Guardrail definition reused across every Bedrock invocation — PII masking, denied topics, contextual grounding score threshold, and a model-agnostic safety contract.

Per-tenant isolation

Application Inference Profile per tenant carries cost-allocation tags; metadata filter on the vector store scopes retrieval to that tenant's documents.

Evaluation harness

Bedrock Evaluations + a golden-question dataset run on every model or chunking change; faithfulness, answer-relevance, and context-precision tracked in CloudWatch dashboards.

Cost telemetry

Bedrock model invocation logs to CloudWatch; daily Lambda flattens to S3; AWS CUR 2.0 joins on the inference-profile tag to expose per-tenant token spend in QuickSight.

AWS lifecycle notice (June 30, 2026) — Amazon Bedrock Agents Classic is now Bedrock Agents Classic, in maintenance for new customers after July 30, 2026. Net-new agent builds should use Bedrock AgentCore. Full matrix: lifecycle roundup.

Why this pattern

RAG on AWS has had three architectural eras. The first (2023) was OpenSearch + Lambda + a custom embedding pipeline — high engineering cost, high operational tax, total control. The second (2024) was Bedrock Knowledge Bases on OpenSearch Serverless — managed ingestion, managed retrieval, but the OpenSearch Serverless bill remained the dominant cost line on most workloads. The third (2026) is Knowledge Bases on S3 Vectors, which collapses the retrieval-layer economics by an order of magnitude for storage-bound workloads.

The pattern below is what we deploy in AWS Architecture Review engagements when a team is either standing up RAG for the first time in 2026 or migrating off an OpenSearch Serverless retrieval layer that has gotten too expensive.

Choosing the retrieval store

Signal	S3 Vectors	OpenSearch Serverless	Self-managed (pgvector, Pinecone, etc.)
Corpus over 10M vectors	✅
Sub-50ms p95 retrieval		✅	maybe
Hybrid BM25 + vector search		✅	maybe
Tight cost per query at scale	✅		depends
Tightest possible operational footprint	✅	✅
Existing Postgres team and a single small workload			✅
Need to run RAG inside a tenant’s VPC for compliance	escalate	escalate	✅

For most enterprise RAG in 2026, the answer is S3 Vectors.

What “production RAG on Bedrock” looks like

Every layer carries safety, cost, and tenancy context:

Ingestion: documents land in an encrypted S3 source bucket; Knowledge Bases incremental sync picks up changes; the embedding model and chunking strategy are versioned alongside the corpus.
Application call: client request carries tenant_id from the JWT; the application calls RetrieveAndGenerate against the Knowledge Base with a metadata filter scoping retrieval to that tenant’s documents.
Guardrail wrapper: every Bedrock invocation passes through a single Guardrail definition — PII masking on input, denied topics, grounding-score threshold on output.
Inference profile: the call is routed through a per-tenant Bedrock Application Inference Profile so the cost-allocation tag flows into CUR 2.0.
Evaluation: every PR that changes a model, prompt, chunking strategy, or retrieval parameter triggers a Bedrock Evaluation run against the golden-question set; faithfulness and context-precision deltas are PR review fodder.

Where this pattern shows up in our consulting

We deploy this stack most often in Amazon Bedrock and Generative AI on AWS engagements at SaaS, healthtech, and enterprise customers — typically as a six-week engagement that delivers the Knowledge Base, the Guardrail policy, the per-tenant inference-profile pattern, and the evaluation harness. For multi-tenant SaaS specifically, this pattern composes with the Multi-Tenant SaaS on AWS pattern — the tenant boundary is the same, just extended into the AI layer.

Trade-offs

Pro

S3 Vectors cuts the retrieval layer cost by up to 90% versus OpenSearch Serverless at corpus sizes above ~10M vectors — the dominant economic line item in most RAG workloads disappears.

Con

S3 Vectors is optimized for cost-per-vector, not for sub-50ms p95 retrieval. Real-time interactive workloads with strict latency SLAs may need OpenSearch Serverless or a small in-memory cache layer.

Pro

Bedrock Knowledge Bases handles ingestion, chunking, embedding selection, and the metadata-filter API — the parts of RAG that consume engineering quarters elsewhere become a managed control plane.

Con

The managed chunking strategy is opinionated. Heavily structured corpora (legal contracts, code, scientific papers) often outperform with custom chunking + a self-managed embedding pipeline writing into S3 Vectors directly.

Pro

Bedrock Guardrails decouples safety policy from model choice — swap Claude for Nova for Llama without rewriting the guardrail layer or the audit story.

Con

Guardrails add 50–150ms to every invocation and a small per-call cost. For high-throughput internal workloads, evaluate per-route guardrail policy instead of a global wrapper.

Cost notes

Typical Series A workload (5M vectors, 200K queries/month, average 800 input tokens, 400 output tokens with Claude Haiku-class model): retrieval on S3 Vectors lands around $200/month versus $1,500–3,000/month on OpenSearch Serverless at equivalent recall. Inference dominates the bill — $400–900/month on Haiku-class models, $3K–8K/month if you default to a Sonnet-class model. Bedrock Guardrails add ~5% to inference cost. The economic tipping point versus self-hosting embeddings + a vector DB is well below 1M vectors — the managed stack wins on TCO at virtually all enterprise RAG scales.

Related patterns

Multi-Tenant SaaS on AWS — Pool, Silo, and Bridge

Production-ready multi-tenant architecture for SaaS on AWS. Covers tenant isolation models (pool, silo, bridge), per-tenant cost attribution, noisy-neighbor mitigation, and the trade-offs CTOs actually wrestle with at Series B and beyond.

HIPAA on AWS for healthtech — The Smallest Defensible Footprint

BAA-eligible reference architecture for a Series A healthtech on AWS — Cognito, ALB, Fargate, Aurora encrypted with KMS CMKs, S3 with object-level encryption, CloudTrail Lake, AWS Config HIPAA conformance pack, GuardDuty, Macie, Audit Manager, and Bedrock for HIPAA-eligible AI features.

Consulting engagements that deliver this pattern

Amazon Bedrock Consulting for Production LLM Applications

Amazon Bedrock implementation consulting — Knowledge Bases, Agents, Guardrails, model routing, and production RAG. Hands-on Bedrock engineering, not GenAI strategy.

Generative AI on AWS — Production-Ready LLM Apps in Weeks

Generative AI strategy and delivery on AWS — use-case selection, Bedrock + SageMaker architecture, governance, evaluations, and production rollout across the AWS AI stack.

AWS Well-Architected Review — Free Assessment

Free AWS Well-Architected Review from FactualMinds. Identify risks, compliance gaps, and optimization opportunities.

Deep dives

How to Build a RAG Pipeline with Amazon Bedrock Knowledge Bases

Amazon Bedrock Knowledge Bases automate the RAG (Retrieval-Augmented Generation) pipeline — semantic search, chunking, embedding, and context injection into Claude or other foundation models. This guide covers setup, data ingestion, cost optimization, and production patterns.

Fine-Tuning vs RAG on AWS Bedrock: When to Use Each

Compare fine-tuning and RAG (retrieval-augmented generation) for customizing LLMs on Bedrock. Cost, latency, and accuracy trade-offs.

AWS Bedrock Cost Optimization: Token Budgets, Model Selection, and Inference Profiles

Bedrock billing is not a single line item — it is a composition of model invocation costs, Knowledge Base retrieval, Agent orchestration, Guardrails evaluation, and cross-region inference profile routing. Each component has its own pricing model and its own set of cost traps.

How to Build Multi-Tenant GenAI on AWS Bedrock

Build SaaS with AI: multi-tenant architecture on Bedrock, cost isolation, and tenant data security.

Frequently asked questions

When should we use S3 Vectors versus OpenSearch Serverless for the vector store?

Default to S3 Vectors. It went GA in December 2025 and is purpose-built for the storage-economics tier of RAG workloads — billions of vectors at S3 cost. Move to OpenSearch Serverless only when you need hybrid BM25 + vector search, sub-50ms p95 retrieval for an interactive product surface, or filtering semantics S3 Vectors metadata filters cannot express. Most enterprise RAG workloads do not need either.

Do we need Bedrock AgentCore for RAG, or is Knowledge Bases enough?

If your interaction is single-shot retrieve-then-generate, Knowledge Bases alone is enough. Add AgentCore (GA October 2025) when the workload needs multi-step retrieval, tool use, persistent agent memory, or guarded code execution — the production agent runtime is what AgentCore exists to provide. Many teams ship Knowledge Bases first and graduate to AgentCore once a single use case demands it.

How do we control per-tenant or per-workload spend?

Create a Bedrock Application Inference Profile per tenant or per workload, tag it with tenant_id, and route every invocation through it. The model invocation logs carry the profile tag, AWS CUR 2.0 picks the tag up on the Bedrock line items, and Cost Optimization Hub recommendations respect the boundary. Layer application-level monthly token quotas on top — Bedrock has no built-in hard cap on cross-region inference.

Can we mix custom embeddings with Bedrock Knowledge Bases?

Yes — write your embeddings directly to S3 Vectors and call the index from your application. You lose the managed ingestion path but keep S3 Vectors economics. The hybrid model is common for teams that have a tuned embedding pipeline they cannot give up but want to retire OpenSearch Serverless for the cost reasons.

Is this pattern HIPAA-compliant out of the box?

Bedrock and Bedrock AgentCore became HIPAA-eligible in February 2026. Combined with HIPAA-eligible S3, S3 Vectors, KMS, and CloudWatch, this pattern can be deployed under a BAA. The compliance work is in the controls — encryption with customer-managed KMS keys, no PHI in prompts that lack redaction, signed BAA from AWS, and audit logging via CloudTrail. The HIPAA on AWS for healthtech pattern walks the full control map.

What does the evaluation pipeline look like?

An evaluation set — 200–500 questions you already know the right answer to — stored in S3 and re-run through Bedrock Evaluations after every chunking-strategy or model change. The three metrics that matter day-to-day are faithfulness (does the answer stay grounded in the retrieved context), context precision (is the retrieval returning the right chunks), and answer relevance. CloudWatch dashboards trend the metrics; PRs that move them ship with a justification.

Want this pattern deployed end-to-end?

Our team builds these patterns in production for SaaS, healthcare, fintech, and enterprise customers. Tell us your constraints and we'll scope the engagement.

Talk to AWS Experts

See more patterns