How to Build Reliable Queue Systems on AWS (SQS, Kafka, Redis)
Quick summary: SQS, MSK Kafka, and Redis queues are not interchangeable. Each has different cost models, ordering guarantees, and failure modes. This guide covers when to use each, how to autoscale workers on queue depth, and how to build idempotent consumers.

Table of Contents
Queue systems are one of those infrastructure decisions that seems simple until your production incident postmortem traces a $40,000 billing spike back to a misconfigured visibility timeout. SQS, MSK Kafka, and Redis each solve the same surface problem — move work from a producer to one or more consumers asynchronously — but they do it with different delivery guarantees, cost structures, and operational demands. Picking the wrong one creates problems that are expensive and painful to fix under load.
This guide covers the real decision criteria, not benchmarks run on EC2 instances that bear no resemblance to your actual workload.
The Decision Guide: SQS vs Kafka vs Redis
Before examining each system, here is a comparison table with the numbers that actually matter for production decisions:
| Dimension | SQS Standard | SQS FIFO | MSK Kafka | Redis LIST | Redis Streams |
|---|---|---|---|---|---|
| Ordering | Best-effort | Strict per message group | Per partition | LIFO/FIFO (depends on command) | Append-only, strict insertion order |
| Delivery | At-least-once | Exactly-once (within 5 min) | At-least-once (configurable) | At-most-once | At-least-once with consumer groups |
| Replay | No | No | Yes (configurable retention) | No | Limited (XRANGE) |
| Throughput | Unlimited (soft limits) | 3,000 msg/sec per API action | Millions/sec | ~100K ops/sec per node | ~100K ops/sec per node |
| Latency | 1–10ms typical | 1–10ms typical | 5–15ms typical | <1ms | <1ms |
| Base cost (AWS) | $0.40/M requests | $0.50/M requests | ~$200/month minimum | ElastiCache node cost | ElastiCache node cost |
| Durability | Multi-AZ, 99.999999% | Multi-AZ, 99.999999% | Configurable replication | RDB/AOF snapshots | RDB/AOF snapshots |
| Consumer groups | No (competing consumers) | No | Yes, independent | No | Yes (XREADGROUP) |
| Operational overhead | Minimal | Minimal | High | Low–Medium | Low–Medium |
Use SQS When
SQS is the right default for the majority of async processing workloads on AWS. Use it when:
- You want managed infrastructure with no brokers to operate. SQS has no servers, no clusters, no partition rebalancing events.
- Your message volume is below approximately 100 million messages per day. Above that, MSK Kafka’s flat compute cost may undercut SQS per-request pricing.
- You do not need replay. SQS is a queue, not a log — once a message is consumed and deleted, it is gone.
- Multiple systems need to consume the same events. Use SQS with SNS fan-out: SNS delivers to multiple SQS queues, each with independent consumers.
- You need Lambda integration. SQS triggers Lambda natively with configurable batch sizes and concurrency controls.
Cost reality: At $0.40 per million requests for standard queues, processing 10 million messages per day costs $4/day in API requests. Factor in that each message typically requires two API calls (ReceiveMessage + DeleteMessage) plus any ChangeMessageVisibility calls, so the real cost is closer to $8–12/day for 10 million messages with typical retry patterns.
Use MSK Kafka When
Kafka’s value proposition is the log abstraction: messages are retained and can be re-read by consumers at any offset, at any time, within the retention window.
Use MSK Kafka when:
- Multiple independent consumer groups need to process the same events at their own pace. Your analytics pipeline and your notification service can both consume the same order events without coordinating.
- You need replay for recovery or backfill. When your new recommendation engine deploys, it can replay 30 days of purchase events to build its initial model.
- Your throughput exceeds SQS’s cost-effective range. MSK kafka.t3.small clusters start at approximately $200/month regardless of message volume. At 100 million messages per day, that is cheaper than SQS at $80/day.
- You need strict ordering across millions of messages. Kafka partitions provide strict ordering within a partition with a single consumer per partition in a consumer group.
Cost reality: MSK costs are driven by broker instance type, storage, and data transfer — not message count. A production-grade 3-broker kafka.m5.large cluster with 1TB storage runs approximately $800–1,200/month. Add MSK Connect, Schema Registry, or Cruise Control and costs rise further. If you are currently at 1 million messages/day and considering MSK for its features, make sure those features justify $800/month compared to SQS at $0.80/day.
Use Redis Queues When
Redis queues occupy a specific niche: ultra-low latency and ephemeral workloads.
Use Redis LIST (LPUSH/BRPOP) when:
- You need sub-millisecond enqueue and dequeue latency. Redis operates in memory; network latency is the dominant cost.
- Durability is secondary to speed. If the Redis node crashes and you lose pending messages, your application can recover. Job scheduling (cron-triggered work), cache warming tasks, and low-value notification sends are examples.
- You want simple implementation. LPUSH and BRPOP are four lines of code with no SDK ceremony.
Use Redis Streams (XADD/XREADGROUP) when:
- You want consumer groups (multiple independent consumer groups reading the same stream) without Kafka’s operational complexity.
- You want message acknowledgment — consumers claim messages, process them, and acknowledge with XACK. Unacknowledged messages in the Pending Entries List (PEL) can be reclaimed by other consumers.
- Retention is bounded. Streams grow indefinitely unless trimmed with MAXLEN or MINID. In production, always configure MAXLEN.
Ordering Guarantees in Practice
Ordering is where teams get burned. The systems guarantee ordering at different granularities with different tradeoffs.
SQS Standard: Best-effort ordering means messages are usually delivered close to the order they were sent, but there are no guarantees. A message sent after another may be delivered first. Do not build business logic that depends on SQS Standard ordering.
SQS FIFO: Messages are delivered in strict order within a message group. The message group ID is a string you attach to each message. All messages with the same group ID are delivered in order, one at a time (if using a single consumer). Multiple message groups can be processed in parallel. Throughput cap: 3,000 messages/sec for high-throughput mode (enabled separately), 300/sec for standard mode. FIFO queues cost $0.50/million requests vs $0.40/million for standard.
Kafka: Strict ordering within a partition. Messages with the same partition key always go to the same partition and are consumed in order by the partition’s assigned consumer. Ordering across partitions is not guaranteed. For total ordering across all messages, use a single partition — but this limits your consumer group to one active consumer for that topic.
Redis LIST: LPUSH + BRPOP gives you LIFO (last in, first out). RPUSH + BRPOP gives FIFO. Strict ordering within a single Redis node, but only one consumer can pop a given message (competing consumers model).
Redis Streams: Strict insertion order by stream entry ID (millisecond timestamp + sequence number). Consumer groups allow multiple consumers to process different messages concurrently, but the PEL tracks which messages are in-flight.
Worker Autoscaling on Queue Depth
Idle workers waste money. Insufficient workers cause latency spikes. The goal is workers that scale with actual queue depth.
SQS + ECS Step Scaling
ECS does not natively understand SQS queue depth, but CloudWatch does. The pattern:
- SQS publishes
ApproximateNumberOfMessagesVisibleto CloudWatch automatically (no configuration needed). - Create a Step Scaling policy on your ECS Service tied to this metric.
- Define scale-out steps: +2 tasks when queue depth > 100, +5 tasks when > 500, +10 tasks when > 2000.
- Define scale-in steps with a longer cooldown (5–10 minutes) to avoid thrashing.
The target-tracking equivalent: use a custom metric that represents messages-per-worker. Publish this computed metric (queue depth / current task count) via a Lambda that runs on a 1-minute schedule. Target-track that metric to a value of 100 (100 messages per worker). ECS will scale to maintain the ratio.
For Lambda consumers, SQS event source mapping handles scaling automatically: Lambda scales the number of concurrent executions based on the number of available messages, up to your account’s concurrency limit (default 1,000, configurable).
KEDA for Kubernetes (EKS)
KEDA monitors external metrics and adjusts Kubernetes Deployment or Job replicas accordingly. The SQS scaler:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-worker-scaler
spec:
scaleTargetRef:
name: order-worker
minReplicaCount: 0
maxReplicaCount: 50
cooldownPeriod: 300
triggers:
- type: aws-sqs-queue
authenticationRef:
name: keda-aws-credentials
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
queueLength: "10"
awsRegion: us-east-1
identityOwner: operatorqueueLength: "10" means KEDA targets 10 messages per replica. With 100 messages in the queue, it scales to 10 replicas. With 0 messages, it scales to 0 (minReplicaCount: 0). Scale-to-zero is one of KEDA’s most valuable features: workers cost nothing when there is no work.
For Kafka consumer group lag, the KEDA Kafka scaler monitors the difference between the latest offset and the committed consumer group offset:
triggers:
- type: kafka
metadata:
bootstrapServers: my-msk-cluster.kafka.us-east-1.amazonaws.com:9092
consumerGroup: order-processor
topic: orders
lagThreshold: "50"
offsetResetPolicy: latestIdempotency Patterns
SQS Standard delivers at-least-once. This is not a bug — it is a documented guarantee. Your consumers must be idempotent.
The Deduplication Table Pattern
The most robust pattern for SQS idempotency:
- Before processing a message, attempt to insert the SQS Message ID into a deduplication table with a TTL of the message retention period (default 4 days, max 14 days).
- Use a conditional write: DynamoDB
ConditionExpression: "attribute_not_exists(messageId)", or a database unique constraint. - If the write succeeds: the message has not been processed before. Process it, then delete the SQS message.
- If the write fails (condition failed / unique constraint violation): the message was already processed. Delete the SQS message without reprocessing.
- If processing fails after a successful deduplication write: the message returns to the queue (visibility timeout expires), is redelivered, but the deduplication write will fail — so it will not be reprocessed. This is a problem for jobs where partial processing has side effects.
For jobs with side effects (charging a payment, sending an email), the deduplication table must be written in the same transaction as the business logic side effect. For database-backed operations:
BEGIN;
INSERT INTO processed_messages (message_id, processed_at)
VALUES ($1, NOW())
ON CONFLICT (message_id) DO NOTHING;
-- Check rows affected; if 0, message already processed
-- If 1, proceed with business logic
INSERT INTO orders (id, customer_id, amount) VALUES (...);
COMMIT;Laravel Queue Worker with SQS and Idempotency
<?php
namespace App\Jobs;
use App\Models\ProcessedMessage;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Support\Facades\DB;
use Throwable;
class ProcessOrderJob implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
public int $tries = 5;
public int $backoff = 30;
public function __construct(
public readonly string $orderId,
public readonly array $orderData,
) {}
public function handle(): void
{
$sqsMessageId = $this->job->getJobId();
DB::transaction(function () use ($sqsMessageId) {
// Attempt idempotency insert — unique constraint on message_id
$inserted = DB::table('processed_messages')->insertOrIgnore([
'message_id' => $sqsMessageId,
'job_class' => static::class,
'processed_at' => now(),
]);
if ($inserted === 0) {
// Message already processed; nothing to do
return;
}
// Safe to process — this branch executes exactly once per message ID
$this->processOrder($this->orderId, $this->orderData);
});
}
private function processOrder(string $orderId, array $data): void
{
// Business logic here
}
public function failed(Throwable $exception): void
{
// Clean up failed idempotency record so the message can retry
// Only if you want retry semantics after full failure
DB::table('processed_messages')
->where('message_id', $this->job->getJobId())
->delete();
}
}The processed_messages table needs a unique index on message_id:
CREATE TABLE processed_messages (
id BIGSERIAL PRIMARY KEY,
message_id VARCHAR(255) NOT NULL,
job_class VARCHAR(255) NOT NULL,
processed_at TIMESTAMP NOT NULL,
CONSTRAINT processed_messages_message_id_unique UNIQUE (message_id)
);
-- Partial index for cleanup: delete records older than SQS retention period
CREATE INDEX processed_messages_processed_at_idx ON processed_messages (processed_at);Python Kafka Consumer with Manual Offset Commit
Manual offset commits give you control over exactly-once processing semantics: commit the offset only after the message has been successfully processed and any side effects persisted.
import logging
from confluent_kafka import Consumer, KafkaError, KafkaException
logger = logging.getLogger(__name__)
def create_consumer(bootstrap_servers: str, group_id: str) -> Consumer:
return Consumer({
'bootstrap.servers': bootstrap_servers,
'group.id': group_id,
'auto.offset.reset': 'earliest',
'enable.auto.commit': False, # Manual commit only
'max.poll.interval.ms': 300000, # 5 minutes max processing time
'session.timeout.ms': 45000,
'heartbeat.interval.ms': 15000,
})
def process_orders(bootstrap_servers: str, group_id: str, topic: str) -> None:
consumer = create_consumer(bootstrap_servers, group_id)
consumer.subscribe([topic])
try:
while True:
msg = consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
if msg.error().code() == KafkaError._PARTITION_EOF:
logger.debug(
'Reached end of partition %s [%d] at offset %d',
msg.topic(), msg.partition(), msg.offset()
)
else:
raise KafkaException(msg.error())
continue
try:
process_message(msg.value(), msg.key())
# Commit only after successful processing
# store_offsets commits the offset for this specific message
consumer.store_offsets(msg)
consumer.commit(asynchronous=False)
logger.info(
'Processed message at partition=%d offset=%d',
msg.partition(), msg.offset()
)
except ProcessingError as exc:
logger.error(
'Failed to process message at partition=%d offset=%d: %s',
msg.partition(), msg.offset(), exc
)
# Do NOT commit — message will be reprocessed after consumer restart
# For poison messages, implement a max-retry counter in a separate store
except KeyboardInterrupt:
logger.info('Shutting down consumer')
finally:
# Commit final offsets on clean shutdown
consumer.commit(asynchronous=False)
consumer.close()
def process_message(value: bytes, key: bytes | None) -> None:
"""
Process a single Kafka message.
Must be idempotent — may be called more than once for the same message
on consumer restart or rebalance.
"""
import json
order = json.loads(value)
# Business logic here
_ = order # suppress unused warning in exampleThe critical detail: enable.auto.commit: False. With auto-commit enabled (the default), the Kafka client commits offsets on a timer regardless of whether processing succeeded. If your consumer crashes between the auto-commit and your side effect completing, the message appears processed (offset committed) but the business effect never happened.
Dead-Letter Queue Configuration
Every SQS queue in production should have a DLQ. Messages land in the DLQ when they have been received (and failed) more than maxReceiveCount times.
Choosing maxReceiveCount
- Set it to the number of retries that makes sense for your workload.
- For transient failures (database unavailable, external API timeout): 5–10 retries is appropriate. These failures resolve themselves.
- For poison messages (malformed JSON, missing required fields, logic bugs): any number of retries is wasteful. The message will never succeed. Consider lower values (3–5) and invest in robust DLQ monitoring and replay tooling.
- For Lambda consumers with SQS trigger: Lambda deletes a message from the source queue after successful processing. Failed invocations cause SQS to redeliver the message up to
maxReceiveCounttimes before it routes to the DLQ.
DLQ Replay
When you fix the bug that caused messages to fail, you need to replay them from the DLQ back to the source queue. AWS Console has a “Redrive” button since 2021. Via CLI:
aws sqs start-message-move-task \
--source-arn arn:aws:sqs:us-east-1:123456789:orders-dlq \
--destination-arn arn:aws:sqs:us-east-1:123456789:orders \
--max-number-of-messages-per-second 5Rate-limit the redrive (--max-number-of-messages-per-second 5) to avoid flooding your consumers with backlogged messages on top of live traffic.
Terraform: SQS Queue + DLQ + CloudWatch Alarm
resource "aws_sqs_queue" "orders_dlq" {
name = "orders-dlq"
message_retention_seconds = 1209600 # 14 days — maximum
tags = {
Environment = var.environment
Service = "order-processor"
}
}
resource "aws_sqs_queue" "orders" {
name = "orders"
visibility_timeout_seconds = 60
message_retention_seconds = 345600 # 4 days
receive_wait_time_seconds = 20 # Long polling — reduces empty ReceiveMessage calls
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.orders_dlq.arn
maxReceiveCount = 5
})
tags = {
Environment = var.environment
Service = "order-processor"
}
}
# DLQ policy: allow the source queue to send to DLQ (handled by redrive_policy above)
# but also allow SNS or other producers if needed
resource "aws_sqs_queue_policy" "orders_dlq_policy" {
queue_url = aws_sqs_queue.orders_dlq.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = { AWS = var.worker_role_arn }
Action = ["sqs:ReceiveMessage", "sqs:DeleteMessage", "sqs:GetQueueAttributes"]
Resource = aws_sqs_queue.orders_dlq.arn
}
]
})
}
# CloudWatch alarm: any messages in DLQ = something is wrong
resource "aws_cloudwatch_metric_alarm" "orders_dlq_not_empty" {
alarm_name = "orders-dlq-not-empty"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = 300
statistic = "Sum"
threshold = 0
alarm_description = "Messages in orders DLQ — investigate immediately"
alarm_actions = [var.sns_alert_topic_arn]
ok_actions = [var.sns_alert_topic_arn]
treat_missing_data = "notBreaching"
dimensions = {
QueueName = aws_sqs_queue.orders_dlq.name
}
}
# Optional: alarm on high queue depth indicating consumer lag
resource "aws_cloudwatch_metric_alarm" "orders_queue_depth_high" {
alarm_name = "orders-queue-depth-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = 60
statistic = "Average"
threshold = 1000
alarm_description = "Order queue depth above 1000 — consumers may be lagging"
alarm_actions = [var.sns_alert_topic_arn]
treat_missing_data = "notBreaching"
dimensions = {
QueueName = aws_sqs_queue.orders.name
}
}Edge Cases and Production Gotchas
The Poison Message Problem
A poison message is one that your code can never successfully process — a malformed payload, an invalid foreign key reference, a message that triggers a bug in your code. Every time it is delivered, your worker crashes or returns an error, the visibility timeout expires, and it re-enters the queue. With maxReceiveCount: 10, a single poison message consumes 10× processing attempts before landing in the DLQ.
At scale, a batch of 1,000 poison messages (a common occurrence after a bad deployment ships malformed events) generates 10,000 failed processing attempts before they all land in the DLQ. On a 10-worker pool, this can occupy all workers for minutes.
Defenses:
- Validate message schema before processing. Reject (and immediately send to DLQ equivalent) messages that fail schema validation. This requires a custom DLQ routing mechanism if you want to separate schema errors from processing errors.
- Set
maxReceiveCountlower (3–5) for queues with strict schema requirements. - Implement a fast-fail circuit: if a worker receives the same message (by message ID) more than twice in the same process lifetime, skip it immediately rather than retrying.
Backlog Explosions
A consumer outage while producers continue at full rate creates a backlog. When consumers come back online, they process the backlog at maximum speed, often causing downstream systems (databases, APIs, payment processors) to be overwhelmed by the burst.
Prevention:
- Implement rate limiting in consumers independent of worker count. A semaphore or token bucket that limits processing to N operations per second regardless of how many workers are running.
- Use SQS message delay (
DelaySeconds) during recovery to spread backlog processing across time, though this only applies to new messages — existing backlog messages are not delayed. - Scale consumers gradually during recovery rather than launching the full worker fleet immediately. Step scaling policies with conservative scale-out steps help here.
Kafka Consumer Rebalances
When a Kafka consumer joins or leaves a consumer group, the group coordinator triggers a rebalance — partitions are redistributed among active consumers. During a rebalance, no consumer in the group processes messages. Rebalance duration depends on consumer count, partition count, and the max.poll.interval.ms configuration.
Common causes of unexpected rebalances:
- Consumer processing takes longer than
max.poll.interval.ms(default 5 minutes). Kafka interprets this as a dead consumer and removes it from the group. - Container restarts or deployments. Graceful shutdown should call
consumer.close()to trigger a clean leave rather than waiting for session timeout. - Slow group coordinator elections in MSK during broker rolling updates.
For ECS deployments with Kafka consumers, implement graceful shutdown: catch SIGTERM, stop the poll loop, commit current offsets, and call consumer.close() before exiting. This allows Kafka to reassign partitions immediately rather than waiting for session timeout (45 seconds by default).
Redis List Durability
Redis LIST queues (LPUSH/BRPOP) are only as durable as your Redis persistence configuration. With AOF disabled and RDB snapshots every 15 minutes, a Redis crash loses up to 15 minutes of queued work. For workloads where message loss is acceptable (cache warming, low-priority notifications), this is fine. For workloads where message loss is not acceptable, use SQS or Kafka instead of Redis LIST.
Redis Streams with consumer groups have the PEL (Pending Entries List) that tracks unacknowledged messages. Unacknowledged messages survive consumer crashes — they remain in the PEL and can be claimed by another consumer using XCLAIM. But they still depend on Redis durability for surviving broker restarts.
Putting It Together: Queue Selection Flowchart
Start with these questions in order:
- Do you need replay? If yes, use Kafka. Proceed to #2 otherwise.
- Do you need sub-millisecond latency and can accept message loss? If yes, use Redis LIST. Proceed to #3 otherwise.
- Do you need multiple independent consumer groups on the same stream? If yes, use Redis Streams (low volume, operational simplicity) or Kafka (high volume, replay needed). Proceed to #4 otherwise.
- Is your throughput above 100 million messages/day sustained? If yes, model MSK Kafka costs vs SQS at $0.40/million. Proceed to #5 otherwise.
- Use SQS. It is managed, durable, cheap at typical workloads, and integrates natively with Lambda, ECS, and the rest of AWS.
For most teams building on AWS, SQS covers 80% of queue use cases. The operational overhead of running MSK or managing Redis persistence is a real cost that does not show up in the benchmark blog posts. Start with SQS, instrument it, and move to Kafka only when you hit a concrete limitation that SQS cannot address.
Related reading: AWS SQS Reliable Messaging Patterns for Production and AWS Auto Scaling Strategies for EC2, ECS, and Lambda.
AWS Cloud Architect & AI Expert
AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.



