---
title: High-Throughput Event Processing on AWS (2026): SQS, Kinesis, MSK, and Flink Tier Selection With Cost-Cliff Math
description: On a composite ingest workload (~8k ordered TPS, 1 KB payloads), staying on SQS FIFO without high-throughput mode capped effective throughput near 300 TPS/API and modeled queue backlog cost near $95/mo before ops time — enabling high-throughput FIFO or switching to Kinesis on-demand changed the ceiling, not the consumer code.
url: https://www.factualminds.com/blog/aws-high-throughput-event-processing-tier-selection-2026/
datePublished: 2026-06-30T00:00:00.000Z
dateModified: 2026-06-30T00:00:00.000Z
author: palaniappan-p
category: Cloud Architecture
tags: aws, aws-sqs, amazon-kinesis, amazon-msk, apache-flink, event-driven, data-streaming, cost-optimization
---

# High-Throughput Event Processing on AWS (2026): SQS, Kinesis, MSK, and Flink Tier Selection With Cost-Cliff Math

> On a composite ingest workload (~8k ordered TPS, 1 KB payloads), staying on SQS FIFO without high-throughput mode capped effective throughput near 300 TPS/API and modeled queue backlog cost near $95/mo before ops time — enabling high-throughput FIFO or switching to Kinesis on-demand changed the ceiling, not the consumer code.

**On November 25, 2025**, AWS raised the per-stream **enhanced fan-out consumer limit on Kinesis Data Streams to 50** — each consumer gets a dedicated **2 MB/s** read pipe instead of sharing the shard read cap. That change matters because many "we need MSK for throughput" conversations are really **consumer fan-out** problems, not Kafka protocol problems.

This post is the **throughput tier ladder** — when to stay on **SQS**, when **Kinesis on-demand** wins, when **MSK** is worth broker hours, and when **Managed Service for Apache Flink** is compute, not transport. It is **not** the [Kinesis vs MSK platform pick](/blog/amazon-kinesis-data-streams-vs-msk-which-streaming-platform/) alone, **not** [sync vs async boundaries](/blog/aws-event-driven-async-messaging-boundaries/), **not** [SQS reliability patterns](/blog/aws-sqs-reliable-messaging-patterns-for-production/), **not** the [Kinesis→Lambda→DynamoDB reference pipeline](/blog/real-time-data-pipeline-kinesis-lambda-dynamodb/), and **not** [JVM runtime throughput tuning](/blog/virtual-threads-lock-free-concurrency-high-throughput-aws/).

Artifacts: [throughput tier decision matrix](https://www.factualminds.com/examples/architecture-blog-2026/event-throughput/throughput-tier-decision-matrix.md), [throughput cost model CSV](https://www.factualminds.com/examples/architecture-blog-2026/event-throughput/throughput-cost-model.csv). Pricing math uses [SQS calculator](/tools/amazon-sqs-pricing-calculator/) assumptions where applicable.

> **Benchmark pattern (not a cited client)** — Composite order-ingest platform, **~8k TPS** peak with **per-customer ordering**, **~1 KB** payloads, **us-east-1**, three downstream consumers (fraud scoring, fulfillment, analytics). Phase 1 used **SQS FIFO** without high-throughput mode — effective ceiling **~300 TPS/API action**, backlog age peaked **~14 min**, modeled FIFO line **~$95/mo** plus on-call time. Phase 2 enabled **high-throughput FIFO** with **50 message groups** — same business TPS, backlog age **&lt; 30 s**, modeled line **~$118/mo**. Phase 3 (analytics-only fork) moved firehose-style telemetry to **Kinesis on-demand** at **~25k events/s** — modeled **~$890/mo** ingest/retrieval vs mis-sized **MSK provisioned "for jobs"** at **~$1,100/mo** on the CSV failure row.

## The four-tier ladder

| Tier                       | Throughput shape                                   | You buy                                          | You do not get              |
| -------------------------- | -------------------------------------------------- | ------------------------------------------------ | --------------------------- |
| **SQS Standard**           | Nearly unlimited horizontal scale                  | Per-request $, polling discipline                | Global order, stream replay |
| **SQS FIFO**               | 300 TPS/API → 3k batched → **70k high-throughput** | Per-group ordering                               | Single-lane groups at peak  |
| **Kinesis Data Streams**   | Shard or on-demand MB/s                            | Retention, Lambda ESM, **50 EFO consumers**      | Kafka wire protocol         |
| **MSK**                    | Partition + broker hours                           | Kafka Connect, consumer groups, compacted topics | Zero broker thinking        |
| **Flink (on Kinesis/MSK)** | Stateful parallelism                               | Windows, joins, CEP                              | Simple queue semantics      |

**Opinionated take:** **Default SQS Standard for work queues; Kinesis on-demand for AWS-native multi-consumer streams; MSK only with dated Kafka requirements; Flink only when stateful stream SQL/joins are the product.** Escalate tier when a **documented ceiling** bites — not when a resume mentions Kafka.

## Tier 1 — SQS: request economics beat broker hours

AWS documents **nearly unlimited throughput** on **SQS Standard** queues. Your ceiling is almost always **consumer count × handler duration × idempotency**, not SQS itself.

**FIFO** is different. Without high-throughput mode, plan around **300 transactions per second per API action batch** ([async messaging boundaries](/blog/aws-event-driven-async-messaging-boundaries/) — May 2026 refresh). Batching raises practical throughput; **high-throughput FIFO mode** targets up to **~70,000 TPS** with explicit opt-in and per-message-group parallelism.

| Mistake                      | Symptom            | Fix                                                                                            |
| ---------------------------- | ------------------ | ---------------------------------------------------------------------------------------------- |
| Single FIFO message group    | One lane at peak   | Shard groups by `customer_id`, `order_id`, etc.                                                |
| Short polling on idle queues | Empty-receive bill | `WaitTimeSeconds=20` — see [SQS pricing](/blog/amazon-sqs-pricing-64kb-rule-fifo-vs-standard/) |
| 200 KB bodies                | 4× request chunks  | S3 pointer pattern                                                                             |

> **What broke** — Black Friday prep for a retail order pipeline. Ops raised FIFO `maxReceiveCount` but not throughput mode. `ApproximateAgeOfOldestMessage` hit **~22 min** while CloudWatch `NumberOfMessagesSent` plateaued near **~280/s**. Producers reported "no errors." Detection: backlog age alarm + CSV failure row `sqs_fifo_wrong_tier`. Fix: high-throughput FIFO + **40 message groups**; fraud lane stayed FIFO, analytics fork moved to **Kinesis** the following sprint.

## Tier 2 — Kinesis: MB/s, shards, and fan-out

**Provisioned mode** (per shard): **1 MB/s write**, **2 MB/s read** for standard consumers. **On-demand mode** scales shards automatically — ideal for variable ingest if you model retrieval and fan-out.

**Enhanced fan-out (EFO):** each registered consumer gets **2 MB/s** dedicated read — critical when **&gt;5 services** read the same stream. AWS increased the per-stream EFO consumer maximum to **50** (November 2025). Standard consumers still share shard read — adding Lambdas does not add read bandwidth.

**October 2025** also raised max record size to **10 MiB** — fewer chunking hacks for fat events.

Context — AWS CLI 2.x, read-only inventory:

```bash
# List streams and mode (on-demand vs provisioned)
aws kinesis list-streams --region us-east-1
aws kinesis describe-stream-summary --stream-name ORDER_EVENTS --region us-east-1
```

## Tier 3 — MSK: when Kafka protocol is non-negotiable

Choose **MSK** when you need **Kafka consumer groups**, **Kafka Connect**, **compacted topics**, or existing clients without rewrite.

| Mode                | Ceiling (documented)                                | Fit                                 |
| ------------------- | --------------------------------------------------- | ----------------------------------- |
| **MSK Serverless**  | ~**200 MBps** write, ~**400 MBps** read per cluster | Bursty Kafka without cluster sizing |
| **MSK provisioned** | Broker-type dependent                               | Sustained **&gt;500 MB/s** with RIs |

**Do not** provision three `kafka.m5.large` brokers to move **2k job messages/s** — the CSV `wrong_tier_kafka_for_jobs` row models **~$1,100/mo** vs **~$18/mo** SQS for the same shape.

## Tier 4 — Flink: compute layer, not queue replacement

**Managed Service for Apache Flink** belongs when you need **session windows**, **stream joins**, or **CEP** atop Kinesis or MSK. Transport stays on the log; Flink holds state and checkpoints.

Skip Flink when Lambda + DynamoDB + Step Functions already meet latency — you are buying checkpoint operations and key-group skew debugging.

## Consumer parallelism cheat sheet

| Transport    | Scale reads by             | Anti-pattern                     |
| ------------ | -------------------------- | -------------------------------- |
| SQS Standard | More workers               | Assuming exactly-once            |
| SQS FIFO     | More **message group IDs** | One group for all traffic        |
| Kinesis      | Shards + EFO consumers     | 20 Lambdas on standard consumers |
| MSK          | Partitions × group members | Rebalance during peak deploy     |
| Flink        | Parallelism / slots        | Hot key in `keyBy`               |

## What to do this week

1. Write peak **TPS**, **payload KB**, and **ordering scope** on one page.
2. Run the [decision matrix](https://www.factualminds.com/examples/architecture-blog-2026/event-throughput/throughput-tier-decision-matrix.md) — if Kafka is not in the answer column, stop.
3. Plug numbers into the [cost CSV](https://www.factualminds.com/examples/architecture-blog-2026/event-throughput/throughput-cost-model.csv) including the **wrong-tier** rows.
4. For FIFO, confirm high-throughput mode and message-group spread **before** peak season.
5. For Kinesis multi-team reads, model **EFO** cost vs shared-consumer lag.

## What this post doesn't cover

- **Amazon MQ / RabbitMQ** — see [event-driven boundaries](/blog/aws-event-driven-async-messaging-boundaries/).
- **EventBridge Pipes pricing** — see [EventBridge pricing](/blog/amazon-eventbridge-pricing-events-pipes-schema-archive/).
- **IoT MQTT ingest** — see [IoT Core MQTT](/blog/aws-iot-core-mqtt-industrial-workloads/).
- **Exactly-once end-to-end proofs** — see [Kafka partition rebalancing](/blog/kafka-msk-partition-rebalancing-exactly-once-semantics/).

**Related:** [Architecture review](/services/aws-architecture-review/) · [SQS pricing calculator](/tools/amazon-sqs-pricing-calculator/) · [Kinesis vs MSK](/blog/amazon-kinesis-data-streams-vs-msk-which-streaming-platform/)

## FAQ

### When should we stay on SQS Standard instead of Kinesis or MSK?
Stay on SQS Standard when you need a job queue or work buffer without strict global ordering, sustained ingest is under roughly 50k messages per second with idempotent consumers, and you do not need stream retention or replay semantics beyond DLQ redrive. SQS bills per request with nearly unlimited horizontal throughput on Standard queues — you pay for polling discipline and 64 KB chunking, not broker hours. Escalate only when ordering per entity, stream retention, or MB/s-level fan-out forces a streaming primitive.

### When should we NOT choose Amazon MSK for high throughput?
Skip MSK when you have no Kafka clients, Connect jobs, or compacted topics — broker hours for "future Kafka" burn budget on workloads SQS or Kinesis already solve. MSK Serverless caps at roughly 200 MBps write and 400 MBps read per cluster; above sustained hundreds of MB/s, provisioned MSK with Reserved Instances usually wins, but only if Kafka protocol is a hard requirement. If the team cannot operate consumer group rebalances and partition planning, MSK is the wrong first tier regardless of TPS.

### What breaks when FIFO throughput limits are ignored?
SQS FIFO without high-throughput mode is commonly planned around 300 transactions per second per API action batch — producers and consumers appear healthy while backlog age grows because AWS throttles silently rather than throwing obvious errors to every caller. Symptoms: ApproximateAgeOfOldestMessage climbing during peak, downstream lag measured in minutes, and finance seeing FIFO request charges without matching business throughput. Fix: enable high-throughput FIFO (up to roughly 70k TPS with batching), add message group parallelism, or move ordered streams to Kinesis shards.

### How does Kinesis enhanced fan-out change consumer scaling?
Standard Kinesis consumers share each shard 2 MB/s read pipe — adding consumers does not add read bandwidth. Enhanced fan-out gives each registered consumer a dedicated 2 MB/s HTTP/2 pipe; AWS increased the per-stream fan-out consumer limit to 50 in November 2025. Use fan-out when many independent services read the same stream without competing for shard read capacity. You pay per consumer-shard hour for fan-out; model it in the cost CSV before enabling on every microservice.

### Where does Managed Service for Apache Flink fit?
Flink is compute for stateful windows, joins, and complex event processing — not a transport replacement for SQS. Put Kinesis or MSK underneath as the durable log; run Flink when aggregations, sessionization, or late-arriving event handling exceed Lambda plus DynamoDB patterns. Do not adopt Flink for simple map-and-write pipelines — operational surface (checkpoints, state backends, parallelism tuning) is justified only when business logic needs continuous state.

### What could go wrong after tier escalation?
Hot keys: a single Kinesis partition key or SQS FIFO message group becomes one lane at peak. Consumer rebalance: MSK deploys stall consumption during partition reassignment. Cost cliff: Kinesis on-demand is cheap at low MB/s but expensive at sustained high MB/s versus right-sized provisioned shards or MSK with RIs. Always load-test the new tier with production key distribution, not uniform random keys.

---

*Source: https://www.factualminds.com/blog/aws-high-throughput-event-processing-tier-selection-2026/*
