---
title: Kafka on MSK: Partition Rebalancing and Exactly-Once Semantics
description: Consumer group rebalance storms stall processing longer than broker outages. This guide covers cooperative rebalancing, idempotent producers, and transactional reads on Amazon MSK—with when SQS FIFO is simpler.
url: https://www.factualminds.com/blog/kafka-msk-partition-rebalancing-exactly-once-semantics/
datePublished: 2026-06-12T00:00:00.000Z
dateModified: 2026-06-12T00:00:00.000Z
author: Palaniappan P
category: Cloud Architecture
tags: engineering-guide, kafka, msk, messaging, aws
---

# Kafka on MSK: Partition Rebalancing and Exactly-Once Semantics

> Consumer group rebalance storms stall processing longer than broker outages. This guide covers cooperative rebalancing, idempotent producers, and transactional reads on Amazon MSK—with when SQS FIFO is simpler.

**Amazon MSK (June 2026)** runs managed Kafka brokers with IAM auth and tiered storage options—but **consumer group protocol** behavior is still Kafka. A rolling EKS deploy that replaces 30 consumers at once triggers **stop-the-world rebalance** unless you configure cooperative assignment.

> **Benchmark pattern** — 24-partition topic, 24 consumers, rolling replace: **eager** assignor paused processing **38 s**; **cooperative-sticky** **&lt;6 s** max partition move. EOS enabled added **~8%** producer latency vs idempotent-only. Artifact: `examples/engineering-guides/kafka-msk-partition-rebalancing-exactly-once-semantics/`.

## Partition rebalancing internals

1. Consumer heartbeat missed → coordinator marks member dead.
2. Partitions revoked and reassigned (eager = all at once).
3. In-flight messages must complete or violate `max.poll.interval.ms`.

**AWS tip:** Run MSK consumers on **EKS with PDB** (see Kubernetes track) so Kubernetes does not kill half the group simultaneously.

## Exactly-once semantics (EOS)

Requires:

- `enable.idempotence=true` on producer
- `transactional.id` unique per producer instance
- Consumer `isolation.level=read_committed`
- **Idempotent sink** still required—EOS is broker-to-consumer, not into your database without dedupe keys

## AWS services map

| Need             | Service  | Skip when                      |
| ---------------- | -------- | ------------------------------ |
| Ordered log      | MSK      | &lt;1k msgs/s, simple queue OK |
| Simpler ordering | SQS FIFO | Need log replay / compaction   |
| Stream fan-out   | Kinesis  | Kafka ecosystem not required   |

## When this advice breaks

- **Cross-Region active-active** — MSK MirrorMaker 2 lag; EOS does not span clusters without careful offset mapping.
- **RabbitMQ workloads** — Different protocol; see Amazon MQ guide in this track.

## What to do this week

1. Set `partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor`.
2. Enable idempotent producer; add EOS only after sink dedupe proven.
3. Alarm on `kafka.consumer.group.rebalance` rate and consumer lag p99.
4. Document rebalance behavior in GameDay runbooks.

## What this guide doesn't cover

SQS/EventBridge patterns—canonical posts in this track parts 3–5.

## FAQ

### Does MSK support Kafka exactly-once semantics?
Yes on supported Kafka versions: idempotent producer + transactions (`read_committed` consumers). Requires correct `transactional.id`, broker version alignment, and consumers that fail on abort markers—not automatic on upgrade.

### Why do MSK consumers pause during deploys?
Classic eager rebalance revokes all partitions before reassignment. Use cooperative-sticky assignors and scale consumers gradually; tune `session.timeout.ms` and `max.poll.interval.ms` for processing time.

---

*Source: https://www.factualminds.com/blog/kafka-msk-partition-rebalancing-exactly-once-semantics/*
