Skip to main content

RAG at ~$200/mo, not $2K

Generative AI RAG on Bedrock — S3 Vectors + Knowledge Bases

Production RAG on AWS for ~$200/month at Series-A scale instead of $1,500–3,000/month on OpenSearch Serverless — Bedrock Knowledge Bases on S3 Vectors, Guardrails, and per-tenant inference profiles. The 2026 AWS-native default since S3 Vectors went GA.

Last updated: May 1, 2026 Author: FactualMinds AWS Architects Reviewed by: Palaniappan P · AWS Solutions Architect — Professional

Problem

Most teams stand up RAG by stitching OpenSearch Serverless to a custom Python service, then watch the retrieval bill grow linearly with corpus size — $1,500–3,000/month on a 5M-vector workload before the first LLM token is generated. Latency, evaluation, and per-tenant spend control stay manual. By the time the workload reaches production, the retrieval layer is more expensive than the inference, and there is no clean answer to 'what does Customer X cost us in tokens this month?'

Solution

Use Bedrock Knowledge Bases on S3 Vectors for the retrieval layer — managed ingestion, managed chunking, managed embeddings, and storage-class economics that scale to billions of vectors at ~10% of OpenSearch Serverless cost above the 10M-vector mark. Wrap inference in Bedrock Guardrails for PII and content safety, and isolate per-tenant spend through application-level inference profiles. Move to OpenSearch Serverless only when you need hybrid BM25 + vector search or sub-50ms p95.

AWS services in this pattern

Service Role
Amazon Bedrock Knowledge Bases Managed RAG control plane — handles document ingestion, chunking strategy, embedding model selection, and the Retrieve / RetrieveAndGenerate APIs
Amazon S3 Vectors Storage-tier vector index (GA Dec 2025) — up to 2B vectors per index at S3 economics; default retrieval store for new RAG workloads in 2026
Amazon Bedrock (Claude, Nova, Titan) Foundation model inference — model choice driven by latency, cost, and reasoning depth requirements per use case
Amazon Bedrock Guardrails PII redaction, denied topic filtering, contextual grounding checks, and prompt-injection mitigation applied uniformly across every model call
Bedrock Application Inference Profiles Per-tenant or per-workload tagging of inference calls — the foundation for spend caps, cost attribution, and audit
Amazon OpenSearch Serverless (escalation only) Vector + lexical hybrid search and sub-50ms retrieval — used when S3 Vectors latency or BM25 requirements force the upgrade
Amazon Bedrock AgentCore Production agent runtime (GA Oct 2025) for multi-step retrieval, tool use, and stateful agent workflows on top of the RAG layer
Amazon S3 Source-of-truth document store — versioned, encrypted with KMS, ingested into Knowledge Bases via managed connectors
AWS Lambda Pre-processing hooks (PII scrubbing, document normalization) and post-retrieval re-ranking when needed
Amazon CloudWatch + Bedrock model invocation logs Token usage, latency, guardrail trip rate, and per-tenant cost telemetry — feeds CUR 2.0 for unit-cost reporting

Architecture components

Ingestion path

S3 source bucket → Knowledge Base data source → managed chunking + embedding → S3 Vectors index. Schedule incremental sync; full re-embed only on chunking strategy change.

Retrieval path

Application calls Bedrock Retrieve or RetrieveAndGenerate with tenant_id metadata filter; S3 Vectors returns top-K chunks; Bedrock composes the prompt and invokes the chosen model under a Guardrail.

Guardrail layer

Single Guardrail definition reused across every Bedrock invocation — PII masking, denied topics, contextual grounding score threshold, and a model-agnostic safety contract.

Per-tenant isolation

Application Inference Profile per tenant carries cost-allocation tags; metadata filter on the vector store scopes retrieval to that tenant's documents.

Evaluation harness

Bedrock Evaluations + a golden-question dataset run on every model or chunking change; faithfulness, answer-relevance, and context-precision tracked in CloudWatch dashboards.

Cost telemetry

Bedrock model invocation logs to CloudWatch; daily Lambda flattens to S3; AWS CUR 2.0 joins on the inference-profile tag to expose per-tenant token spend in QuickSight.

Why this pattern

RAG on AWS has had three architectural eras. The first (2023) was OpenSearch + Lambda + a custom embedding pipeline — high engineering cost, high operational tax, total control. The second (2024) was Bedrock Knowledge Bases on OpenSearch Serverless — managed ingestion, managed retrieval, but the OpenSearch Serverless bill remained the dominant cost line on most workloads. The third (2026) is Knowledge Bases on S3 Vectors, which collapses the retrieval-layer economics by an order of magnitude for storage-bound workloads.

The pattern below is what we deploy in AWS Architecture Review engagements when a team is either standing up RAG for the first time in 2026 or migrating off an OpenSearch Serverless retrieval layer that has gotten too expensive.

Choosing the retrieval store

SignalS3 VectorsOpenSearch ServerlessSelf-managed (pgvector, Pinecone, etc.)
Corpus over 10M vectors
Sub-50ms p95 retrievalmaybe
Hybrid BM25 + vector searchmaybe
Tight cost per query at scaledepends
Tightest possible operational footprint
Existing Postgres team and a single small workload
Need to run RAG inside a tenant’s VPC for complianceescalateescalate

For most enterprise RAG in 2026, the answer is S3 Vectors.

What “production RAG on Bedrock” looks like

Every layer carries safety, cost, and tenancy context:

Where this pattern shows up in our consulting

We deploy this stack most often in Amazon Bedrock and Generative AI on AWS engagements at SaaS, healthtech, and enterprise customers — typically as a six-week engagement that delivers the Knowledge Base, the Guardrail policy, the per-tenant inference-profile pattern, and the evaluation harness. For multi-tenant SaaS specifically, this pattern composes with the Multi-Tenant SaaS on AWS pattern — the tenant boundary is the same, just extended into the AI layer.

Trade-offs

Pro

S3 Vectors cuts the retrieval layer cost by up to 90% versus OpenSearch Serverless at corpus sizes above ~10M vectors — the dominant economic line item in most RAG workloads disappears.

Con

S3 Vectors is optimized for cost-per-vector, not for sub-50ms p95 retrieval. Real-time interactive workloads with strict latency SLAs may need OpenSearch Serverless or a small in-memory cache layer.

Pro

Bedrock Knowledge Bases handles ingestion, chunking, embedding selection, and the metadata-filter API — the parts of RAG that consume engineering quarters elsewhere become a managed control plane.

Con

The managed chunking strategy is opinionated. Heavily structured corpora (legal contracts, code, scientific papers) often outperform with custom chunking + a self-managed embedding pipeline writing into S3 Vectors directly.

Pro

Bedrock Guardrails decouples safety policy from model choice — swap Claude for Nova for Llama without rewriting the guardrail layer or the audit story.

Con

Guardrails add 50–150ms to every invocation and a small per-call cost. For high-throughput internal workloads, evaluate per-route guardrail policy instead of a global wrapper.

Cost notes

Typical Series A workload (5M vectors, 200K queries/month, average 800 input tokens, 400 output tokens with Claude Haiku-class model): retrieval on S3 Vectors lands around $200/month versus $1,500–3,000/month on OpenSearch Serverless at equivalent recall. Inference dominates the bill — $400–900/month on Haiku-class models, $3K–8K/month if you default to a Sonnet-class model. Bedrock Guardrails add ~5% to inference cost. The economic tipping point versus self-hosting embeddings + a vector DB is well below 1M vectors — the managed stack wins on TCO at virtually all enterprise RAG scales.

Related patterns

Consulting engagements that deliver this pattern

Deep dives

Frequently asked questions

When should we use S3 Vectors versus OpenSearch Serverless for the vector store?

Default to S3 Vectors. It went GA in December 2025 and is purpose-built for the storage-economics tier of RAG workloads — billions of vectors at S3 cost. Move to OpenSearch Serverless only when you need hybrid BM25 + vector search, sub-50ms p95 retrieval for an interactive product surface, or filtering semantics S3 Vectors metadata filters cannot express. Most enterprise RAG workloads do not need either.

Do we need Bedrock AgentCore for RAG, or is Knowledge Bases enough?

If your interaction is single-shot retrieve-then-generate, Knowledge Bases alone is enough. Add AgentCore (GA October 2025) when the workload needs multi-step retrieval, tool use, persistent agent memory, or guarded code execution — the production agent runtime is what AgentCore exists to provide. Many teams ship Knowledge Bases first and graduate to AgentCore once a single use case demands it.

How do we control per-tenant or per-workload spend?

Create a Bedrock Application Inference Profile per tenant or per workload, tag it with tenant_id, and route every invocation through it. The model invocation logs carry the profile tag, AWS CUR 2.0 picks the tag up on the Bedrock line items, and Cost Optimization Hub recommendations respect the boundary. Layer application-level monthly token quotas on top — Bedrock has no built-in hard cap on cross-region inference.

Can we mix custom embeddings with Bedrock Knowledge Bases?

Yes — write your embeddings directly to S3 Vectors and call the index from your application. You lose the managed ingestion path but keep S3 Vectors economics. The hybrid model is common for teams that have a tuned embedding pipeline they cannot give up but want to retire OpenSearch Serverless for the cost reasons.

Is this pattern HIPAA-compliant out of the box?

Bedrock and Bedrock AgentCore became HIPAA-eligible in February 2026. Combined with HIPAA-eligible S3, S3 Vectors, KMS, and CloudWatch, this pattern can be deployed under a BAA. The compliance work is in the controls — encryption with customer-managed KMS keys, no PHI in prompts that lack redaction, signed BAA from AWS, and audit logging via CloudTrail. The HIPAA on AWS for healthtech pattern walks the full control map.

What does the evaluation pipeline look like?

An evaluation set — 200–500 questions you already know the right answer to — stored in S3 and re-run through Bedrock Evaluations after every chunking-strategy or model change. The three metrics that matter day-to-day are faithfulness (does the answer stay grounded in the retrieved context), context precision (is the retrieval returning the right chunks), and answer relevance. CloudWatch dashboards trend the metrics; PRs that move them ship with a justification.

Want this pattern deployed end-to-end?

Our team builds these patterns in production for SaaS, healthcare, fintech, and enterprise customers. Tell us your constraints and we'll scope the engagement.