Kafka on MSK: Partition Rebalancing and Exactly-Once Semantics
Quick summary: Consumer group rebalance storms stall processing longer than broker outages. This guide covers cooperative rebalancing, idempotent producers, and transactional reads on Amazon MSK—with when SQS FIFO is simpler.
Key Takeaways
- This guide covers cooperative rebalancing, idempotent producers, and transactional reads on Amazon MSK—with when SQS FIFO is simpler
- Amazon MSK (June 2026) runs managed Kafka brokers with IAM auth and tiered storage options—but consumer group protocol behavior is still Kafka
- A rolling EKS deploy that replaces 30 consumers at once triggers stop-the-world rebalance unless you configure cooperative assignment
- Benchmark pattern — 24-partition topic, 24 consumers, rolling replace: eager assignor paused processing 38 s; cooperative-sticky <6 s max partition move
- EOS enabled added ~8% producer latency vs idempotent-only
Table of Contents
Amazon MSK (June 2026) runs managed Kafka brokers with IAM auth and tiered storage options—but consumer group protocol behavior is still Kafka. A rolling EKS deploy that replaces 30 consumers at once triggers stop-the-world rebalance unless you configure cooperative assignment.
Benchmark pattern — 24-partition topic, 24 consumers, rolling replace: eager assignor paused processing 38 s; cooperative-sticky <6 s max partition move. EOS enabled added ~8% producer latency vs idempotent-only. Artifact:
examples/engineering-guides/kafka-msk-partition-rebalancing-exactly-once-semantics/.
Partition rebalancing internals
- Consumer heartbeat missed → coordinator marks member dead.
- Partitions revoked and reassigned (eager = all at once).
- In-flight messages must complete or violate
max.poll.interval.ms.
AWS tip: Run MSK consumers on EKS with PDB (see Kubernetes track) so Kubernetes does not kill half the group simultaneously.
Exactly-once semantics (EOS)
Requires:
enable.idempotence=trueon producertransactional.idunique per producer instance- Consumer
isolation.level=read_committed - Idempotent sink still required—EOS is broker-to-consumer, not into your database without dedupe keys
AWS services map
| Need | Service | Skip when |
|---|---|---|
| Ordered log | MSK | <1k msgs/s, simple queue OK |
| Simpler ordering | SQS FIFO | Need log replay / compaction |
| Stream fan-out | Kinesis | Kafka ecosystem not required |
When this advice breaks
- Cross-Region active-active — MSK MirrorMaker 2 lag; EOS does not span clusters without careful offset mapping.
- RabbitMQ workloads — Different protocol; see Amazon MQ guide in this track.
What to do this week
- Set
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor. - Enable idempotent producer; add EOS only after sink dedupe proven.
- Alarm on
kafka.consumer.group.rebalancerate and consumer lag p99. - Document rebalance behavior in GameDay runbooks.
What this guide doesn’t cover
SQS/EventBridge patterns—canonical posts in this track parts 3–5.
AWS Cloud Architect & AI Expert
AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.