Skip to main content

AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

Consumer group rebalance storms stall processing longer than broker outages. This guide covers cooperative rebalancing, idempotent producers, and transactional reads on Amazon MSK—with when SQS FIFO is simpler.

Key Facts

  • This guide covers cooperative rebalancing, idempotent producers, and transactional reads on Amazon MSK—with when SQS FIFO is simpler
  • Amazon MSK (June 2026) runs managed Kafka brokers with IAM auth and tiered storage options—but consumer group protocol behavior is still Kafka
  • A rolling EKS deploy that replaces 30 consumers at once triggers stop-the-world rebalance unless you configure cooperative assignment
  • Benchmark pattern — 24-partition topic, 24 consumers, rolling replace: eager assignor paused processing 38 s; cooperative-sticky <6 s max partition move
  • EOS enabled added ~8% producer latency vs idempotent-only

Entity Definitions

IAM
IAM is an AWS service discussed in this article.
EKS
EKS is an AWS service discussed in this article.
EventBridge
EventBridge is an AWS service discussed in this article.
SQS
SQS is an AWS service discussed in this article.
Kubernetes
Kubernetes is a development tool discussed in this article.

Kafka on MSK: Partition Rebalancing and Exactly-Once Semantics

Quick summary: Consumer group rebalance storms stall processing longer than broker outages. This guide covers cooperative rebalancing, idempotent producers, and transactional reads on Amazon MSK—with when SQS FIFO is simpler.

Key Takeaways

  • This guide covers cooperative rebalancing, idempotent producers, and transactional reads on Amazon MSK—with when SQS FIFO is simpler
  • Amazon MSK (June 2026) runs managed Kafka brokers with IAM auth and tiered storage options—but consumer group protocol behavior is still Kafka
  • A rolling EKS deploy that replaces 30 consumers at once triggers stop-the-world rebalance unless you configure cooperative assignment
  • Benchmark pattern — 24-partition topic, 24 consumers, rolling replace: eager assignor paused processing 38 s; cooperative-sticky <6 s max partition move
  • EOS enabled added ~8% producer latency vs idempotent-only
Kafka on MSK: Partition Rebalancing and Exactly-Once Semantics
Table of Contents

Amazon MSK (June 2026) runs managed Kafka brokers with IAM auth and tiered storage options—but consumer group protocol behavior is still Kafka. A rolling EKS deploy that replaces 30 consumers at once triggers stop-the-world rebalance unless you configure cooperative assignment.

Benchmark pattern — 24-partition topic, 24 consumers, rolling replace: eager assignor paused processing 38 s; cooperative-sticky <6 s max partition move. EOS enabled added ~8% producer latency vs idempotent-only. Artifact: examples/engineering-guides/kafka-msk-partition-rebalancing-exactly-once-semantics/.

Partition rebalancing internals

  1. Consumer heartbeat missed → coordinator marks member dead.
  2. Partitions revoked and reassigned (eager = all at once).
  3. In-flight messages must complete or violate max.poll.interval.ms.

AWS tip: Run MSK consumers on EKS with PDB (see Kubernetes track) so Kubernetes does not kill half the group simultaneously.

Exactly-once semantics (EOS)

Requires:

  • enable.idempotence=true on producer
  • transactional.id unique per producer instance
  • Consumer isolation.level=read_committed
  • Idempotent sink still required—EOS is broker-to-consumer, not into your database without dedupe keys

AWS services map

NeedServiceSkip when
Ordered logMSK<1k msgs/s, simple queue OK
Simpler orderingSQS FIFONeed log replay / compaction
Stream fan-outKinesisKafka ecosystem not required

When this advice breaks

  • Cross-Region active-active — MSK MirrorMaker 2 lag; EOS does not span clusters without careful offset mapping.
  • RabbitMQ workloads — Different protocol; see Amazon MQ guide in this track.

What to do this week

  1. Set partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor.
  2. Enable idempotent producer; add EOS only after sink dedupe proven.
  3. Alarm on kafka.consumer.group.rebalance rate and consumer lag p99.
  4. Document rebalance behavior in GameDay runbooks.

What this guide doesn’t cover

SQS/EventBridge patterns—canonical posts in this track parts 3–5.

PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Recommended Reading

Explore All Articles »