Decoupled services, no dual-writes
Event-driven microservices on AWS — EventBridge, Pipes, and the Outbox Pattern
Event-driven architecture on AWS without the silent dropped events that surface a month later as inventory or billing reconciliation work — transactional outbox via EventBridge Pipes, schema discipline on the bus, and a DLQ-and-replay path operators can run without paging an architect.
Last updated: May 1, 2026Author: FactualMinds AWS ArchitectsReviewed by: Palaniappan P · AWS Solutions Architect — Professional
Problem
Dual-writes fail silently 1–5% of the time in production. A service writes to its database and then publishes to SNS or EventBridge in the same request; the publish step fails; the event is gone. A month in, the team is debugging missing inventory updates with no audit trail. Six months in, the bus carries a hundred undocumented event shapes, and any service migration requires an archaeology dig through six handlers and a custom retry queue.
Solution
Use the transactional outbox pattern on top of AWS-native primitives — write to your service's database in a single transaction, then let EventBridge Pipes pull from DynamoDB Streams or DMS-CDC and fan out to a dedicated EventBridge custom bus. Treat the bus as a versioned schema contract via the EventBridge Schema Registry, route every consumer through SQS for durable retry, and orchestrate multi-step workflows with Step Functions instead of choreographing them on the bus.
AWS services in this pattern
| Service | Role |
|---|---|
| Amazon EventBridge (custom bus) | Application event bus — one custom bus per business domain; rule-based routing to consumers; carries versioned event schemas registered in Schema Registry |
| Amazon EventBridge Pipes | Source-to-target plumbing without Lambda glue — pulls from DynamoDB Streams, Kinesis, SQS, MSK, or Amazon MQ and lands events on the bus with optional filtering and enrichment |
| Amazon EventBridge Schema Registry | Versioned event-schema catalog — every producer registers; consumers generate strongly-typed bindings; schema evolution rules prevent breaking changes |
| Amazon SQS (per-consumer queue + DLQ) | Durable buffer between the bus and each consumer — controls consumer concurrency, isolates noisy neighbors, and captures poison messages in a per-consumer DLQ |
| AWS Lambda | Default consumer compute — short, idempotent handlers; partial-batch-failure on SQS; auto-scaling without operator intervention |
| Amazon ECS Fargate | Long-running consumer for handlers that need >15 minute execution, sticky in-memory state, or libraries that do not cold-start well on Lambda |
| AWS Step Functions | Orchestration plane for multi-step workflows — replaces choreography-on-the-bus when the workflow has a defined state machine; handles retries, timeouts, and compensation |
| Amazon DynamoDB | Source-of-truth store with streams enabled — the outbox source for EventBridge Pipes when the service is DynamoDB-backed |
| AWS DMS (CDC) or Amazon Aurora DSQL | Outbox source for relational stores — DMS streams CDC into Pipes; Aurora DSQL provides active-active SQL when the service needs multi-region writes |
| Amazon CloudWatch + AWS X-Ray | End-to-end tracing across the bus, consumers, and downstream services — required for debugging the failure modes that event-driven systems make subtle |
Architecture components
Transactional outbox
Service writes to its database in a single transaction; DynamoDB Streams or DMS CDC captures the change; EventBridge Pipes pulls from the stream and lands the event on the bus. No dual-write, no lost events.
Event bus per domain
One custom EventBridge bus per business domain — orders, billing, identity. Each bus has its own rules, its own consumers, and its own schema namespace; cross-domain events require an explicit cross-bus rule, which is also the natural audit point.
Per-consumer SQS queue + DLQ
Bus rule lands events on a consumer-specific SQS queue with a configured DLQ; consumer Lambdas/Fargate workers poll the queue with partial-batch-failure semantics; poison messages go to the DLQ with the original event for replay.
Schema contract
Every event registered in Schema Registry; CI fails the build if a producer publishes an unregistered schema or breaks a consumer-mandated field; evolution rules enforce backward-compatible changes.
Step Functions for orchestration
When a workflow is multi-step with state — order saga, document processing, multi-tenant onboarding — write it as a Step Functions state machine that publishes domain events on transitions instead of trying to choreograph the workflow across loose handlers.
Replay tooling
DLQ → operator-triggered replay back to the queue or directly to the bus, with idempotency keys on every event; consumers ALWAYS handle retries and replays correctly because they design for them, not because they get lucky.
Why this pattern
Event-driven microservices on AWS go wrong in two predictable ways. The first is the dual-write — the service writes to its database, then publishes to SNS or EventBridge, and the publish silently fails some fraction of the time. The second is bus archaeology — six months in, nobody knows what events live on the bus, what shape they take, or which consumers depend on which fields, so any service migration becomes a multi-week trace.
The pattern below addresses both directly. The transactional outbox via EventBridge Pipes eliminates the dual-write. The Schema Registry plus per-consumer SQS queues and explicit DLQs eliminate the archaeology. Step Functions absorbs the multi-step workflows that should never have been choreography in the first place.
Choosing the orchestration shape
| Workload | Raw EventBridge | EventBridge + Step Functions | MSK / Kafka |
|---|---|---|---|
| Fire-and-forget fan-out | ✅ | ||
| Multi-step workflow with state | ✅ | ||
| Cross-domain events between business contexts | ✅ | ||
| Replay-from-offset analytics ingestion | ✅ | ||
| High-throughput log aggregation | ✅ | ||
| Saga / compensating transaction | ✅ | ||
| Real-time CDC into a lakehouse | Pipes | ✅ |
What the failure modes look like (and how this design handles them)
- Dual-write between database and bus → eliminated by the outbox + Pipes.
- Lost events from a bad consumer deploy → caught by per-consumer SQS + DLQ; replayed once the consumer is fixed.
- Schema drift between producer and consumer → caught by Schema Registry + CI; consumers fail at build time, not in production.
- Choreography that nobody can debug → replaced with Step Functions where the workflow is genuinely stateful.
- Idempotency violations on retry → every event carries an idempotency key; consumers MUST treat retries as safe.
Where this pattern shows up in our consulting
We deploy event-driven architectures most often in AWS Serverless and Architecture Review engagements at growing SaaS companies — usually when a monolith-to-microservices migration is underway and the team has felt the pain of the dual-write firsthand. The lakehouse pattern composes downstream: domain events on EventBridge feed Kinesis Data Streams via Pipes, land in S3 Tables, and become the analytics source of truth — see Lakehouse on AWS for that side of the design.
Trade-offs
Pro
EventBridge Pipes eliminates the Lambda-glue layer that previously sat between every CDC source and the bus — fewer cold starts, fewer stack frames, fewer places for the dual-write to silently fail.
Con
Pipes adds another AWS-specific concept to learn; teams already deep on Kafka with Debezium often prefer to keep the Debezium pattern even on AWS. Both are defensible; pick one and stick with it.
Pro
Per-consumer SQS queues with explicit DLQs make the failure modes legible — every consumer has one place to look when events 'disappear', and replay is a single CLI command, not a forensic exercise.
Con
Per-consumer queues multiply the AWS resource count and the IaC surface. For small services with few consumers, the queue-per-consumer pattern can feel heavy; for systems past about three consumers per bus, it pays for itself within a quarter.
Pro
Step Functions for orchestration keeps multi-step workflows debuggable — the state machine visualization is the single best post-incident review artifact in AWS for event-driven workloads.
Con
Step Functions Standard pricing is per state transition. A 100-event fan-out costs roughly $0.25 on Step Functions versus $0.01 on raw EventBridge + SQS. Use Step Functions where the workflow has explicit state; do not use it as a generic event router.
Cost notes
EventBridge custom-bus events are $1 per million; SQS standard messages are $0.40 per million; Lambda invocations are dwarfed by the egress and the state-storage costs in DynamoDB or Aurora that drive most service costs. A typical mid-size SaaS event-driven workload (50M events/month across 5 domains and 12 consumers) lands at $1,500–4,000/month in event-plane spend — small relative to the underlying compute and storage. Step Functions Standard at 10M state transitions/month is around $250; Express workflows are dramatically cheaper for high-throughput orchestration. The expensive failure mode is operational, not financial — uncaught dual-writes generate inventory or billing reconciliation work that costs engineer-weeks to clean up.
Related patterns
Lakehouse on AWS — S3 Tables, Iceberg, Athena, and Redshift Spectrum
Production lakehouse reference architecture on AWS — S3 Tables (managed Apache Iceberg), Glue Data Catalog, Athena, Redshift Spectrum, Lake Formation, and Managed Service for Apache Flink for streaming ingest. The AWS-native default for unified analytics in 2026.
Multi-Tenant SaaS on AWS — Pool, Silo, and Bridge
Production-ready multi-tenant architecture for SaaS on AWS. Covers tenant isolation models (pool, silo, bridge), per-tenant cost attribution, noisy-neighbor mitigation, and the trade-offs CTOs actually wrestle with at Series B and beyond.
Consulting engagements that deliver this pattern
AWS Serverless Architecture & Lambda Consulting
Scalable, cost-efficient applications with AWS serverless — Lambda, API Gateway, DynamoDB, Step Functions. Consulting from an AWS Select Tier Partner.
AWS Well-Architected Review — Free Assessment
Free AWS Well-Architected Review from FactualMinds. Identify risks, compliance gaps, and optimization opportunities.
Deep dives
Frequently asked questions
When should we use EventBridge versus MSK (managed Kafka)?
EventBridge for application events between AWS services, especially when consumers are heterogeneous (Lambda, Fargate, Step Functions, third-party SaaS via API destinations). MSK for high-throughput streams that consumer-side need replay-from-offset semantics — analytics ingest, log aggregation, change-data-capture pipelines that feed lakes. Many teams run both: EventBridge as the application-event plane, MSK as the data-pipeline plane. Picking one for everything is the most common mistake.
Do we still need Lambda glue between sources and EventBridge?
Almost never as of 2026. EventBridge Pipes covers the source-to-bus plumbing for DynamoDB Streams, Kinesis, SQS, MSK, MQ, and DMS without writing a Lambda. The Pipes filter and enrichment steps cover most of the transformation cases the old Lambda glue handled. You write a Lambda only when the enrichment is genuinely service-specific.
How do we handle ordering and exactly-once semantics?
Most application events do not need ordering — handlers should be idempotent (every event carries an idempotency key) and partial-batch-failure on SQS handles retries correctly. When ordering is genuinely required (per-account ledger, financial transactions), use FIFO SQS queues with the appropriate group key, or move that specific stream onto MSK with a partitioning strategy. Designing every consumer for ordering when it does not need it is the most common over-engineering mistake on EventBridge.
What about Step Functions versus orchestration on the bus?
If the workflow has a state machine you can draw — order placed → payment captured → inventory reserved → shipment created → order completed, with explicit compensation paths — write it as a Step Functions state machine. The state machine becomes the canonical artifact for incident review and onboarding. Reserve raw bus choreography for fan-out events that have no defined sequence (a NewUserCreated event triggering ten independent welcome workflows).
How do we evolve event schemas without breaking consumers?
Register every event in EventBridge Schema Registry; treat the schema as a contract; only make additive changes (new optional fields). Breaking changes require a new event-type version (v1 → v2) and a transition period where the producer publishes both. Consumer CI tests against the schema registry catch the common mistake — a producer accidentally renaming a field — before deploy.
Where does Aurora DSQL fit?
Aurora DSQL (GA May 2025) is the answer when an event-driven service genuinely needs multi-region active-active SQL state — for example, a global SaaS where any region can serve writes for any tenant. Use it as the source-of-truth store for that service; the outbox pattern still applies, just sourced from DSQL CDC. For services that do not need multi-region writes, DynamoDB or single-region Aurora remain the simpler defaults.
Want this pattern deployed end-to-end?
Our team builds these patterns in production for SaaS, healthcare, fintech, and enterprise customers. Tell us your constraints and we'll scope the engagement.