AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

SQS, MSK Kafka, and Redis queues are not interchangeable. Each has different cost models, ordering guarantees, and failure modes. This guide covers when to use each, how to autoscale workers on queue depth, and how to build idempotent consumers.

Entity Definitions

SQS
SQS is an AWS service discussed in this article.

How to Build Reliable Queue Systems on AWS (SQS, Kafka, Redis)

Cloud Architecture Palaniappan P 16 min read

Quick summary: SQS, MSK Kafka, and Redis queues are not interchangeable. Each has different cost models, ordering guarantees, and failure modes. This guide covers when to use each, how to autoscale workers on queue depth, and how to build idempotent consumers.

How to Build Reliable Queue Systems on AWS (SQS, Kafka, Redis)
Table of Contents

Queue systems are one of those infrastructure decisions that seems simple until your production incident postmortem traces a $40,000 billing spike back to a misconfigured visibility timeout. SQS, MSK Kafka, and Redis each solve the same surface problem — move work from a producer to one or more consumers asynchronously — but they do it with different delivery guarantees, cost structures, and operational demands. Picking the wrong one creates problems that are expensive and painful to fix under load.

This guide covers the real decision criteria, not benchmarks run on EC2 instances that bear no resemblance to your actual workload.

The Decision Guide: SQS vs Kafka vs Redis

Before examining each system, here is a comparison table with the numbers that actually matter for production decisions:

DimensionSQS StandardSQS FIFOMSK KafkaRedis LISTRedis Streams
OrderingBest-effortStrict per message groupPer partitionLIFO/FIFO (depends on command)Append-only, strict insertion order
DeliveryAt-least-onceExactly-once (within 5 min)At-least-once (configurable)At-most-onceAt-least-once with consumer groups
ReplayNoNoYes (configurable retention)NoLimited (XRANGE)
ThroughputUnlimited (soft limits)3,000 msg/sec per API actionMillions/sec~100K ops/sec per node~100K ops/sec per node
Latency1–10ms typical1–10ms typical5–15ms typical<1ms<1ms
Base cost (AWS)$0.40/M requests$0.50/M requests~$200/month minimumElastiCache node costElastiCache node cost
DurabilityMulti-AZ, 99.999999%Multi-AZ, 99.999999%Configurable replicationRDB/AOF snapshotsRDB/AOF snapshots
Consumer groupsNo (competing consumers)NoYes, independentNoYes (XREADGROUP)
Operational overheadMinimalMinimalHighLow–MediumLow–Medium

Use SQS When

SQS is the right default for the majority of async processing workloads on AWS. Use it when:

  • You want managed infrastructure with no brokers to operate. SQS has no servers, no clusters, no partition rebalancing events.
  • Your message volume is below approximately 100 million messages per day. Above that, MSK Kafka’s flat compute cost may undercut SQS per-request pricing.
  • You do not need replay. SQS is a queue, not a log — once a message is consumed and deleted, it is gone.
  • Multiple systems need to consume the same events. Use SQS with SNS fan-out: SNS delivers to multiple SQS queues, each with independent consumers.
  • You need Lambda integration. SQS triggers Lambda natively with configurable batch sizes and concurrency controls.

Cost reality: At $0.40 per million requests for standard queues, processing 10 million messages per day costs $4/day in API requests. Factor in that each message typically requires two API calls (ReceiveMessage + DeleteMessage) plus any ChangeMessageVisibility calls, so the real cost is closer to $8–12/day for 10 million messages with typical retry patterns.

Use MSK Kafka When

Kafka’s value proposition is the log abstraction: messages are retained and can be re-read by consumers at any offset, at any time, within the retention window.

Use MSK Kafka when:

  • Multiple independent consumer groups need to process the same events at their own pace. Your analytics pipeline and your notification service can both consume the same order events without coordinating.
  • You need replay for recovery or backfill. When your new recommendation engine deploys, it can replay 30 days of purchase events to build its initial model.
  • Your throughput exceeds SQS’s cost-effective range. MSK kafka.t3.small clusters start at approximately $200/month regardless of message volume. At 100 million messages per day, that is cheaper than SQS at $80/day.
  • You need strict ordering across millions of messages. Kafka partitions provide strict ordering within a partition with a single consumer per partition in a consumer group.

Cost reality: MSK costs are driven by broker instance type, storage, and data transfer — not message count. A production-grade 3-broker kafka.m5.large cluster with 1TB storage runs approximately $800–1,200/month. Add MSK Connect, Schema Registry, or Cruise Control and costs rise further. If you are currently at 1 million messages/day and considering MSK for its features, make sure those features justify $800/month compared to SQS at $0.80/day.

Use Redis Queues When

Redis queues occupy a specific niche: ultra-low latency and ephemeral workloads.

Use Redis LIST (LPUSH/BRPOP) when:

  • You need sub-millisecond enqueue and dequeue latency. Redis operates in memory; network latency is the dominant cost.
  • Durability is secondary to speed. If the Redis node crashes and you lose pending messages, your application can recover. Job scheduling (cron-triggered work), cache warming tasks, and low-value notification sends are examples.
  • You want simple implementation. LPUSH and BRPOP are four lines of code with no SDK ceremony.

Use Redis Streams (XADD/XREADGROUP) when:

  • You want consumer groups (multiple independent consumer groups reading the same stream) without Kafka’s operational complexity.
  • You want message acknowledgment — consumers claim messages, process them, and acknowledge with XACK. Unacknowledged messages in the Pending Entries List (PEL) can be reclaimed by other consumers.
  • Retention is bounded. Streams grow indefinitely unless trimmed with MAXLEN or MINID. In production, always configure MAXLEN.

Ordering Guarantees in Practice

Ordering is where teams get burned. The systems guarantee ordering at different granularities with different tradeoffs.

SQS Standard: Best-effort ordering means messages are usually delivered close to the order they were sent, but there are no guarantees. A message sent after another may be delivered first. Do not build business logic that depends on SQS Standard ordering.

SQS FIFO: Messages are delivered in strict order within a message group. The message group ID is a string you attach to each message. All messages with the same group ID are delivered in order, one at a time (if using a single consumer). Multiple message groups can be processed in parallel. Throughput cap: 3,000 messages/sec for high-throughput mode (enabled separately), 300/sec for standard mode. FIFO queues cost $0.50/million requests vs $0.40/million for standard.

Kafka: Strict ordering within a partition. Messages with the same partition key always go to the same partition and are consumed in order by the partition’s assigned consumer. Ordering across partitions is not guaranteed. For total ordering across all messages, use a single partition — but this limits your consumer group to one active consumer for that topic.

Redis LIST: LPUSH + BRPOP gives you LIFO (last in, first out). RPUSH + BRPOP gives FIFO. Strict ordering within a single Redis node, but only one consumer can pop a given message (competing consumers model).

Redis Streams: Strict insertion order by stream entry ID (millisecond timestamp + sequence number). Consumer groups allow multiple consumers to process different messages concurrently, but the PEL tracks which messages are in-flight.

Worker Autoscaling on Queue Depth

Idle workers waste money. Insufficient workers cause latency spikes. The goal is workers that scale with actual queue depth.

SQS + ECS Step Scaling

ECS does not natively understand SQS queue depth, but CloudWatch does. The pattern:

  1. SQS publishes ApproximateNumberOfMessagesVisible to CloudWatch automatically (no configuration needed).
  2. Create a Step Scaling policy on your ECS Service tied to this metric.
  3. Define scale-out steps: +2 tasks when queue depth > 100, +5 tasks when > 500, +10 tasks when > 2000.
  4. Define scale-in steps with a longer cooldown (5–10 minutes) to avoid thrashing.

The target-tracking equivalent: use a custom metric that represents messages-per-worker. Publish this computed metric (queue depth / current task count) via a Lambda that runs on a 1-minute schedule. Target-track that metric to a value of 100 (100 messages per worker). ECS will scale to maintain the ratio.

For Lambda consumers, SQS event source mapping handles scaling automatically: Lambda scales the number of concurrent executions based on the number of available messages, up to your account’s concurrency limit (default 1,000, configurable).

KEDA for Kubernetes (EKS)

KEDA monitors external metrics and adjusts Kubernetes Deployment or Job replicas accordingly. The SQS scaler:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-worker-scaler
spec:
  scaleTargetRef:
    name: order-worker
  minReplicaCount: 0
  maxReplicaCount: 50
  cooldownPeriod: 300
  triggers:
    - type: aws-sqs-queue
      authenticationRef:
        name: keda-aws-credentials
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
        queueLength: "10"
        awsRegion: us-east-1
        identityOwner: operator

queueLength: "10" means KEDA targets 10 messages per replica. With 100 messages in the queue, it scales to 10 replicas. With 0 messages, it scales to 0 (minReplicaCount: 0). Scale-to-zero is one of KEDA’s most valuable features: workers cost nothing when there is no work.

For Kafka consumer group lag, the KEDA Kafka scaler monitors the difference between the latest offset and the committed consumer group offset:

triggers:
  - type: kafka
    metadata:
      bootstrapServers: my-msk-cluster.kafka.us-east-1.amazonaws.com:9092
      consumerGroup: order-processor
      topic: orders
      lagThreshold: "50"
      offsetResetPolicy: latest

Idempotency Patterns

SQS Standard delivers at-least-once. This is not a bug — it is a documented guarantee. Your consumers must be idempotent.

The Deduplication Table Pattern

The most robust pattern for SQS idempotency:

  1. Before processing a message, attempt to insert the SQS Message ID into a deduplication table with a TTL of the message retention period (default 4 days, max 14 days).
  2. Use a conditional write: DynamoDB ConditionExpression: "attribute_not_exists(messageId)", or a database unique constraint.
  3. If the write succeeds: the message has not been processed before. Process it, then delete the SQS message.
  4. If the write fails (condition failed / unique constraint violation): the message was already processed. Delete the SQS message without reprocessing.
  5. If processing fails after a successful deduplication write: the message returns to the queue (visibility timeout expires), is redelivered, but the deduplication write will fail — so it will not be reprocessed. This is a problem for jobs where partial processing has side effects.

For jobs with side effects (charging a payment, sending an email), the deduplication table must be written in the same transaction as the business logic side effect. For database-backed operations:

BEGIN;
INSERT INTO processed_messages (message_id, processed_at)
VALUES ($1, NOW())
ON CONFLICT (message_id) DO NOTHING;

-- Check rows affected; if 0, message already processed
-- If 1, proceed with business logic
INSERT INTO orders (id, customer_id, amount) VALUES (...);
COMMIT;

Laravel Queue Worker with SQS and Idempotency

<?php

namespace App\Jobs;

use App\Models\ProcessedMessage;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Support\Facades\DB;
use Throwable;

class ProcessOrderJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public int $tries = 5;
    public int $backoff = 30;

    public function __construct(
        public readonly string $orderId,
        public readonly array $orderData,
    ) {}

    public function handle(): void
    {
        $sqsMessageId = $this->job->getJobId();

        DB::transaction(function () use ($sqsMessageId) {
            // Attempt idempotency insert — unique constraint on message_id
            $inserted = DB::table('processed_messages')->insertOrIgnore([
                'message_id' => $sqsMessageId,
                'job_class' => static::class,
                'processed_at' => now(),
            ]);

            if ($inserted === 0) {
                // Message already processed; nothing to do
                return;
            }

            // Safe to process — this branch executes exactly once per message ID
            $this->processOrder($this->orderId, $this->orderData);
        });
    }

    private function processOrder(string $orderId, array $data): void
    {
        // Business logic here
    }

    public function failed(Throwable $exception): void
    {
        // Clean up failed idempotency record so the message can retry
        // Only if you want retry semantics after full failure
        DB::table('processed_messages')
            ->where('message_id', $this->job->getJobId())
            ->delete();
    }
}

The processed_messages table needs a unique index on message_id:

CREATE TABLE processed_messages (
    id BIGSERIAL PRIMARY KEY,
    message_id VARCHAR(255) NOT NULL,
    job_class VARCHAR(255) NOT NULL,
    processed_at TIMESTAMP NOT NULL,
    CONSTRAINT processed_messages_message_id_unique UNIQUE (message_id)
);

-- Partial index for cleanup: delete records older than SQS retention period
CREATE INDEX processed_messages_processed_at_idx ON processed_messages (processed_at);

Python Kafka Consumer with Manual Offset Commit

Manual offset commits give you control over exactly-once processing semantics: commit the offset only after the message has been successfully processed and any side effects persisted.

import logging
from confluent_kafka import Consumer, KafkaError, KafkaException

logger = logging.getLogger(__name__)

def create_consumer(bootstrap_servers: str, group_id: str) -> Consumer:
    return Consumer({
        'bootstrap.servers': bootstrap_servers,
        'group.id': group_id,
        'auto.offset.reset': 'earliest',
        'enable.auto.commit': False,  # Manual commit only
        'max.poll.interval.ms': 300000,  # 5 minutes max processing time
        'session.timeout.ms': 45000,
        'heartbeat.interval.ms': 15000,
    })


def process_orders(bootstrap_servers: str, group_id: str, topic: str) -> None:
    consumer = create_consumer(bootstrap_servers, group_id)
    consumer.subscribe([topic])

    try:
        while True:
            msg = consumer.poll(timeout=1.0)

            if msg is None:
                continue

            if msg.error():
                if msg.error().code() == KafkaError._PARTITION_EOF:
                    logger.debug(
                        'Reached end of partition %s [%d] at offset %d',
                        msg.topic(), msg.partition(), msg.offset()
                    )
                else:
                    raise KafkaException(msg.error())
                continue

            try:
                process_message(msg.value(), msg.key())

                # Commit only after successful processing
                # store_offsets commits the offset for this specific message
                consumer.store_offsets(msg)
                consumer.commit(asynchronous=False)

                logger.info(
                    'Processed message at partition=%d offset=%d',
                    msg.partition(), msg.offset()
                )

            except ProcessingError as exc:
                logger.error(
                    'Failed to process message at partition=%d offset=%d: %s',
                    msg.partition(), msg.offset(), exc
                )
                # Do NOT commit — message will be reprocessed after consumer restart
                # For poison messages, implement a max-retry counter in a separate store

    except KeyboardInterrupt:
        logger.info('Shutting down consumer')
    finally:
        # Commit final offsets on clean shutdown
        consumer.commit(asynchronous=False)
        consumer.close()


def process_message(value: bytes, key: bytes | None) -> None:
    """
    Process a single Kafka message.
    Must be idempotent — may be called more than once for the same message
    on consumer restart or rebalance.
    """
    import json
    order = json.loads(value)
    # Business logic here
    _ = order  # suppress unused warning in example

The critical detail: enable.auto.commit: False. With auto-commit enabled (the default), the Kafka client commits offsets on a timer regardless of whether processing succeeded. If your consumer crashes between the auto-commit and your side effect completing, the message appears processed (offset committed) but the business effect never happened.

Dead-Letter Queue Configuration

Every SQS queue in production should have a DLQ. Messages land in the DLQ when they have been received (and failed) more than maxReceiveCount times.

Choosing maxReceiveCount

  • Set it to the number of retries that makes sense for your workload.
  • For transient failures (database unavailable, external API timeout): 5–10 retries is appropriate. These failures resolve themselves.
  • For poison messages (malformed JSON, missing required fields, logic bugs): any number of retries is wasteful. The message will never succeed. Consider lower values (3–5) and invest in robust DLQ monitoring and replay tooling.
  • For Lambda consumers with SQS trigger: Lambda deletes a message from the source queue after successful processing. Failed invocations cause SQS to redeliver the message up to maxReceiveCount times before it routes to the DLQ.

DLQ Replay

When you fix the bug that caused messages to fail, you need to replay them from the DLQ back to the source queue. AWS Console has a “Redrive” button since 2021. Via CLI:

aws sqs start-message-move-task \
  --source-arn arn:aws:sqs:us-east-1:123456789:orders-dlq \
  --destination-arn arn:aws:sqs:us-east-1:123456789:orders \
  --max-number-of-messages-per-second 5

Rate-limit the redrive (--max-number-of-messages-per-second 5) to avoid flooding your consumers with backlogged messages on top of live traffic.

Terraform: SQS Queue + DLQ + CloudWatch Alarm

resource "aws_sqs_queue" "orders_dlq" {
  name                      = "orders-dlq"
  message_retention_seconds = 1209600 # 14 days — maximum

  tags = {
    Environment = var.environment
    Service     = "order-processor"
  }
}

resource "aws_sqs_queue" "orders" {
  name                       = "orders"
  visibility_timeout_seconds = 60
  message_retention_seconds  = 345600 # 4 days
  receive_wait_time_seconds  = 20     # Long polling — reduces empty ReceiveMessage calls

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.orders_dlq.arn
    maxReceiveCount     = 5
  })

  tags = {
    Environment = var.environment
    Service     = "order-processor"
  }
}

# DLQ policy: allow the source queue to send to DLQ (handled by redrive_policy above)
# but also allow SNS or other producers if needed
resource "aws_sqs_queue_policy" "orders_dlq_policy" {
  queue_url = aws_sqs_queue.orders_dlq.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect    = "Allow"
        Principal = { AWS = var.worker_role_arn }
        Action    = ["sqs:ReceiveMessage", "sqs:DeleteMessage", "sqs:GetQueueAttributes"]
        Resource  = aws_sqs_queue.orders_dlq.arn
      }
    ]
  })
}

# CloudWatch alarm: any messages in DLQ = something is wrong
resource "aws_cloudwatch_metric_alarm" "orders_dlq_not_empty" {
  alarm_name          = "orders-dlq-not-empty"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 300
  statistic           = "Sum"
  threshold           = 0
  alarm_description   = "Messages in orders DLQ — investigate immediately"
  alarm_actions       = [var.sns_alert_topic_arn]
  ok_actions          = [var.sns_alert_topic_arn]
  treat_missing_data  = "notBreaching"

  dimensions = {
    QueueName = aws_sqs_queue.orders_dlq.name
  }
}

# Optional: alarm on high queue depth indicating consumer lag
resource "aws_cloudwatch_metric_alarm" "orders_queue_depth_high" {
  alarm_name          = "orders-queue-depth-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 60
  statistic           = "Average"
  threshold           = 1000
  alarm_description   = "Order queue depth above 1000 — consumers may be lagging"
  alarm_actions       = [var.sns_alert_topic_arn]
  treat_missing_data  = "notBreaching"

  dimensions = {
    QueueName = aws_sqs_queue.orders.name
  }
}

Edge Cases and Production Gotchas

The Poison Message Problem

A poison message is one that your code can never successfully process — a malformed payload, an invalid foreign key reference, a message that triggers a bug in your code. Every time it is delivered, your worker crashes or returns an error, the visibility timeout expires, and it re-enters the queue. With maxReceiveCount: 10, a single poison message consumes 10× processing attempts before landing in the DLQ.

At scale, a batch of 1,000 poison messages (a common occurrence after a bad deployment ships malformed events) generates 10,000 failed processing attempts before they all land in the DLQ. On a 10-worker pool, this can occupy all workers for minutes.

Defenses:

  • Validate message schema before processing. Reject (and immediately send to DLQ equivalent) messages that fail schema validation. This requires a custom DLQ routing mechanism if you want to separate schema errors from processing errors.
  • Set maxReceiveCount lower (3–5) for queues with strict schema requirements.
  • Implement a fast-fail circuit: if a worker receives the same message (by message ID) more than twice in the same process lifetime, skip it immediately rather than retrying.

Backlog Explosions

A consumer outage while producers continue at full rate creates a backlog. When consumers come back online, they process the backlog at maximum speed, often causing downstream systems (databases, APIs, payment processors) to be overwhelmed by the burst.

Prevention:

  • Implement rate limiting in consumers independent of worker count. A semaphore or token bucket that limits processing to N operations per second regardless of how many workers are running.
  • Use SQS message delay (DelaySeconds) during recovery to spread backlog processing across time, though this only applies to new messages — existing backlog messages are not delayed.
  • Scale consumers gradually during recovery rather than launching the full worker fleet immediately. Step scaling policies with conservative scale-out steps help here.

Kafka Consumer Rebalances

When a Kafka consumer joins or leaves a consumer group, the group coordinator triggers a rebalance — partitions are redistributed among active consumers. During a rebalance, no consumer in the group processes messages. Rebalance duration depends on consumer count, partition count, and the max.poll.interval.ms configuration.

Common causes of unexpected rebalances:

  • Consumer processing takes longer than max.poll.interval.ms (default 5 minutes). Kafka interprets this as a dead consumer and removes it from the group.
  • Container restarts or deployments. Graceful shutdown should call consumer.close() to trigger a clean leave rather than waiting for session timeout.
  • Slow group coordinator elections in MSK during broker rolling updates.

For ECS deployments with Kafka consumers, implement graceful shutdown: catch SIGTERM, stop the poll loop, commit current offsets, and call consumer.close() before exiting. This allows Kafka to reassign partitions immediately rather than waiting for session timeout (45 seconds by default).

Redis List Durability

Redis LIST queues (LPUSH/BRPOP) are only as durable as your Redis persistence configuration. With AOF disabled and RDB snapshots every 15 minutes, a Redis crash loses up to 15 minutes of queued work. For workloads where message loss is acceptable (cache warming, low-priority notifications), this is fine. For workloads where message loss is not acceptable, use SQS or Kafka instead of Redis LIST.

Redis Streams with consumer groups have the PEL (Pending Entries List) that tracks unacknowledged messages. Unacknowledged messages survive consumer crashes — they remain in the PEL and can be claimed by another consumer using XCLAIM. But they still depend on Redis durability for surviving broker restarts.

Putting It Together: Queue Selection Flowchart

Start with these questions in order:

  1. Do you need replay? If yes, use Kafka. Proceed to #2 otherwise.
  2. Do you need sub-millisecond latency and can accept message loss? If yes, use Redis LIST. Proceed to #3 otherwise.
  3. Do you need multiple independent consumer groups on the same stream? If yes, use Redis Streams (low volume, operational simplicity) or Kafka (high volume, replay needed). Proceed to #4 otherwise.
  4. Is your throughput above 100 million messages/day sustained? If yes, model MSK Kafka costs vs SQS at $0.40/million. Proceed to #5 otherwise.
  5. Use SQS. It is managed, durable, cheap at typical workloads, and integrates natively with Lambda, ECS, and the rest of AWS.

For most teams building on AWS, SQS covers 80% of queue use cases. The operational overhead of running MSK or managing Redis persistence is a real cost that does not show up in the benchmark blog posts. Start with SQS, instrument it, and move to Kafka only when you hit a concrete limitation that SQS cannot address.

Related reading: AWS SQS Reliable Messaging Patterns for Production and AWS Auto Scaling Strategies for EC2, ECS, and Lambda.

PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Ready to discuss your AWS strategy?

Our certified architects can help you implement these solutions.

Recommended Reading

Explore All Articles »