---
title: Event-driven microservices on AWS — EventBridge, Pipes, and the Outbox Pattern
description: Production event-driven architecture on AWS — EventBridge custom buses, EventBridge Pipes for the transactional outbox, SQS dead-letter queues, Step Functions for orchestration, and Lambda or Fargate workers. Decouple services without dual-writes.
url: https://www.factualminds.com/patterns/event-driven-microservices/
category: serverless
publishDate: 2026-05-01
updateDate: 2026-05-01
---

# Event-driven microservices on AWS — EventBridge, Pipes, and the Outbox Pattern

> Event-driven architecture on AWS without the silent dropped events that surface a month later as inventory or billing reconciliation work — transactional outbox via EventBridge Pipes, schema discipline on the bus, and a DLQ-and-replay path operators can run without paging an architect.

## Why this pattern

Event-driven microservices on AWS go wrong in two predictable ways. The first is the **dual-write** — the service writes to its database, then publishes to SNS or EventBridge, and the publish silently fails some fraction of the time. The second is **bus archaeology** — six months in, nobody knows what events live on the bus, what shape they take, or which consumers depend on which fields, so any service migration becomes a multi-week trace.

The pattern below addresses both directly. The transactional outbox via EventBridge Pipes eliminates the dual-write. The Schema Registry plus per-consumer SQS queues and explicit DLQs eliminate the archaeology. Step Functions absorbs the multi-step workflows that should never have been choreography in the first place.

## Choosing the orchestration shape

| Workload                                      | Raw EventBridge | EventBridge + Step Functions | MSK / Kafka |
| --------------------------------------------- | --------------- | ---------------------------- | ----------- |
| Fire-and-forget fan-out                       | ✅              |                              |             |
| Multi-step workflow with state                |                 | ✅                           |             |
| Cross-domain events between business contexts | ✅              |                              |             |
| Replay-from-offset analytics ingestion        |                 |                              | ✅          |
| High-throughput log aggregation               |                 |                              | ✅          |
| Saga / compensating transaction               |                 | ✅                           |             |
| Real-time CDC into a lakehouse                | Pipes           |                              | ✅          |

## What the failure modes look like (and how this design handles them)

- **Dual-write between database and bus** → eliminated by the outbox + Pipes.
- **Lost events from a bad consumer deploy** → caught by per-consumer SQS + DLQ; replayed once the consumer is fixed.
- **Schema drift between producer and consumer** → caught by Schema Registry + CI; consumers fail at build time, not in production.
- **Choreography that nobody can debug** → replaced with Step Functions where the workflow is genuinely stateful.
- **Idempotency violations on retry** → every event carries an idempotency key; consumers MUST treat retries as safe.

## Where this pattern shows up in our consulting

We deploy event-driven architectures most often in [AWS Serverless](/services/aws-serverless/) and [Architecture Review](/services/aws-architecture-review/) engagements at growing SaaS companies — usually when a monolith-to-microservices migration is underway and the team has felt the pain of the dual-write firsthand. The lakehouse pattern composes downstream: domain events on EventBridge feed Kinesis Data Streams via Pipes, land in S3 Tables, and become the analytics source of truth — see [Lakehouse on AWS](/patterns/lakehouse-on-aws/) for that side of the design.

## Problem

Dual-writes fail silently 1–5% of the time in production. A service writes to its database and then publishes to SNS or EventBridge in the same request; the publish step fails; the event is gone. A month in, the team is debugging missing inventory updates with no audit trail. Six months in, the bus carries a hundred undocumented event shapes, and any service migration requires an archaeology dig through six handlers and a custom retry queue.

## Solution

Use the transactional outbox pattern on top of AWS-native primitives — write to your service's database in a single transaction, then let EventBridge Pipes pull from DynamoDB Streams or DMS-CDC and fan out to a dedicated EventBridge custom bus. Treat the bus as a versioned schema contract via the EventBridge Schema Registry, route every consumer through SQS for durable retry, and orchestrate multi-step workflows with Step Functions instead of choreographing them on the bus.

## AWS Services

- **Amazon EventBridge (custom bus)** — Application event bus — one custom bus per business domain; rule-based routing to consumers; carries versioned event schemas registered in Schema Registry
- **Amazon EventBridge Pipes** — Source-to-target plumbing without Lambda glue — pulls from DynamoDB Streams, Kinesis, SQS, MSK, or Amazon MQ and lands events on the bus with optional filtering and enrichment
- **Amazon EventBridge Schema Registry** — Versioned event-schema catalog — every producer registers; consumers generate strongly-typed bindings; schema evolution rules prevent breaking changes
- **Amazon SQS (per-consumer queue + DLQ)** — Durable buffer between the bus and each consumer — controls consumer concurrency, isolates noisy neighbors, and captures poison messages in a per-consumer DLQ
- **AWS Lambda** — Default consumer compute — short, idempotent handlers; partial-batch-failure on SQS; auto-scaling without operator intervention
- **Amazon ECS Fargate** — Long-running consumer for handlers that need >15 minute execution, sticky in-memory state, or libraries that do not cold-start well on Lambda
- **AWS Step Functions** — Orchestration plane for multi-step workflows — replaces choreography-on-the-bus when the workflow has a defined state machine; handles retries, timeouts, and compensation
- **Amazon DynamoDB** — Source-of-truth store with streams enabled — the outbox source for EventBridge Pipes when the service is DynamoDB-backed
- **AWS DMS (CDC) or Amazon Aurora DSQL** — Outbox source for relational stores — DMS streams CDC into Pipes; Aurora DSQL provides active-active SQL when the service needs multi-region writes
- **Amazon CloudWatch + AWS X-Ray** — End-to-end tracing across the bus, consumers, and downstream services — required for debugging the failure modes that event-driven systems make subtle

## Components

### Transactional outbox
Service writes to its database in a single transaction; DynamoDB Streams or DMS CDC captures the change; EventBridge Pipes pulls from the stream and lands the event on the bus. No dual-write, no lost events.

### Event bus per domain
One custom EventBridge bus per business domain — orders, billing, identity. Each bus has its own rules, its own consumers, and its own schema namespace; cross-domain events require an explicit cross-bus rule, which is also the natural audit point.

### Per-consumer SQS queue + DLQ
Bus rule lands events on a consumer-specific SQS queue with a configured DLQ; consumer Lambdas/Fargate workers poll the queue with partial-batch-failure semantics; poison messages go to the DLQ with the original event for replay.

### Schema contract
Every event registered in Schema Registry; CI fails the build if a producer publishes an unregistered schema or breaks a consumer-mandated field; evolution rules enforce backward-compatible changes.

### Step Functions for orchestration
When a workflow is multi-step with state — order saga, document processing, multi-tenant onboarding — write it as a Step Functions state machine that publishes domain events on transitions instead of trying to choreograph the workflow across loose handlers.

### Replay tooling
DLQ → operator-triggered replay back to the queue or directly to the bus, with idempotency keys on every event; consumers ALWAYS handle retries and replays correctly because they design for them, not because they get lucky.

## Trade-offs

- **Pro:** EventBridge Pipes eliminates the Lambda-glue layer that previously sat between every CDC source and the bus — fewer cold starts, fewer stack frames, fewer places for the dual-write to silently fail.
- **Con:** Pipes adds another AWS-specific concept to learn; teams already deep on Kafka with Debezium often prefer to keep the Debezium pattern even on AWS. Both are defensible; pick one and stick with it.

- **Pro:** Per-consumer SQS queues with explicit DLQs make the failure modes legible — every consumer has one place to look when events 'disappear', and replay is a single CLI command, not a forensic exercise.
- **Con:** Per-consumer queues multiply the AWS resource count and the IaC surface. For small services with few consumers, the queue-per-consumer pattern can feel heavy; for systems past about three consumers per bus, it pays for itself within a quarter.

- **Pro:** Step Functions for orchestration keeps multi-step workflows debuggable — the state machine visualization is the single best post-incident review artifact in AWS for event-driven workloads.
- **Con:** Step Functions Standard pricing is per state transition. A 100-event fan-out costs roughly $0.25 on Step Functions versus $0.01 on raw EventBridge + SQS. Use Step Functions where the workflow has explicit state; do not use it as a generic event router.

## Cost Estimate

EventBridge custom-bus events are $1 per million; SQS standard messages are $0.40 per million; Lambda invocations are dwarfed by the egress and the state-storage costs in DynamoDB or Aurora that drive most service costs. A typical mid-size SaaS event-driven workload (50M events/month across 5 domains and 12 consumers) lands at $1,500–4,000/month in event-plane spend — small relative to the underlying compute and storage. Step Functions Standard at 10M state transitions/month is around $250; Express workflows are dramatically cheaper for high-throughput orchestration. The expensive failure mode is operational, not financial — uncaught dual-writes generate inventory or billing reconciliation work that costs engineer-weeks to clean up.

## Related Patterns

- lakehouse-on-aws
- multi-tenant-saas-on-aws

## FAQ

### When should we use EventBridge versus MSK (managed Kafka)?
EventBridge for application events between AWS services, especially when consumers are heterogeneous (Lambda, Fargate, Step Functions, third-party SaaS via API destinations). MSK for high-throughput streams that consumer-side need replay-from-offset semantics — analytics ingest, log aggregation, change-data-capture pipelines that feed lakes. Many teams run both: EventBridge as the application-event plane, MSK as the data-pipeline plane. Picking one for everything is the most common mistake.

### Do we still need Lambda glue between sources and EventBridge?
Almost never as of 2026. EventBridge Pipes covers the source-to-bus plumbing for DynamoDB Streams, Kinesis, SQS, MSK, MQ, and DMS without writing a Lambda. The Pipes filter and enrichment steps cover most of the transformation cases the old Lambda glue handled. You write a Lambda only when the enrichment is genuinely service-specific.

### How do we handle ordering and exactly-once semantics?
Most application events do not need ordering — handlers should be idempotent (every event carries an idempotency key) and partial-batch-failure on SQS handles retries correctly. When ordering is genuinely required (per-account ledger, financial transactions), use FIFO SQS queues with the appropriate group key, or move that specific stream onto MSK with a partitioning strategy. Designing every consumer for ordering when it does not need it is the most common over-engineering mistake on EventBridge.

### What about Step Functions versus orchestration on the bus?
If the workflow has a state machine you can draw — order placed → payment captured → inventory reserved → shipment created → order completed, with explicit compensation paths — write it as a Step Functions state machine. The state machine becomes the canonical artifact for incident review and onboarding. Reserve raw bus choreography for fan-out events that have no defined sequence (a NewUserCreated event triggering ten independent welcome workflows).

### How do we evolve event schemas without breaking consumers?
Register every event in EventBridge Schema Registry; treat the schema as a contract; only make additive changes (new optional fields). Breaking changes require a new event-type version (v1 → v2) and a transition period where the producer publishes both. Consumer CI tests against the schema registry catch the common mistake — a producer accidentally renaming a field — before deploy.

### Where does Aurora DSQL fit?
Aurora DSQL (GA May 2025) is the answer when an event-driven service genuinely needs multi-region active-active SQL state — for example, a global SaaS where any region can serve writes for any tenant. Use it as the source-of-truth store for that service; the outbox pattern still applies, just sourced from DSQL CDC. For services that do not need multi-region writes, DynamoDB or single-region Aurora remain the simpler defaults.

---

*Source: https://www.factualminds.com/patterns/event-driven-microservices/*
