What is an SQS infinite retry loop and how do I prevent it?

An SQS infinite retry loop occurs when a Lambda function consuming messages fails to process them, the messages return to the queue after the visibility timeout expires, and Lambda picks them up again — repeating indefinitely. At high queue depth with 1,000s of messages failing, this generates millions of Lambda invocations and SQS receives per day with no useful work done. Prevention requires three things: a dead letter queue (DLQ) configured on every SQS queue, a maximum receive count set to 3–5 (not the default 1 and not a large safety buffer), and an alarm on DLQ message count to detect when messages start accumulating.

What are zombie resources in AWS and how do I find them?

Zombie resources are provisioned AWS resources that are running, billing, and no longer used for their intended purpose: development EC2 instances never terminated, EBS volumes from terminated instances where delete-on-termination was not set, RDS instances from completed migrations, ElastiCache clusters from load tests. Find them with: aws ec2 describe-volumes --filters Name=status,Values=available for unattached EBS volumes; aws rds describe-db-instances combined with DatabaseConnections metric for zero-connection RDS instances; and AWS Trusted Advisor (Business/Enterprise Support) for low-utilization EC2 detection.

How can a misconfigured public endpoint cause AWS cost explosions?

Any publicly accessible endpoint without authentication receives internet scanner traffic — automated tools probing for exposed APIs. For standard web APIs, scanner traffic increases compute costs proportionally with request volume. For AI inference endpoints, each scanner request can consume seconds of GPU compute. An unauthenticated AI inference endpoint receiving standard internet scanner traffic can generate thousands of dollars in GPU costs within hours. All AI inference endpoints require authentication before any public DNS record is created.

How do I detect and prevent accidental multi-region AWS deployments?

Accidental multi-region deployments occur when CloudFormation StackSets, CDK Apps, or Terraform modules deploy to more regions than intended. Resources in unexpected regions appear in billing but not in region-specific console views engineers monitor. Detection: use Cost Explorer grouped by Region to spot spending in unintended regions. Prevention: create AWS Budgets with a zero-dollar threshold for every region that should be inactive — they trigger immediately if any resource is created there.

How Startups Burn $100k/month on AWS: Real Failure Patterns

Part 7 of 8: The AWS Cost Trap — Why Your Bill Keeps Surprising You

Nobody sets out to generate a $100,000 AWS bill. The costs that reach that level at unexpected speed come from a small set of failure patterns, each of which is independently innocuous-looking but devastating in combination with scale and time.

These are not hypothetical scenarios. They are architectural failure modes that recur across startups, scale-ups, and mature engineering organizations. Understanding the mechanism of each one is the prerequisite to building the guards that prevent them.

Pattern 1: The Infinite Retry Loop

The setup: a Lambda function processes messages from an SQS queue. The Lambda function calls a downstream API. The downstream API is intermittently unavailable. The Lambda function throws an exception when the API call fails.

SQS behavior on Lambda exception: the message becomes visible again in the queue after the visibility timeout expires. Lambda picks it up and retries. The downstream API is still unavailable. Another exception. The message returns to the queue. Lambda picks it up again.

If the visibility timeout is 30 seconds and the Lambda function runs for 10 seconds before failing, the message cycles through Lambda every 40 seconds. In one hour, that message generates 90 Lambda invocations, each failing, each billed. If 10,000 messages are stuck in this failure mode simultaneously, you have 900,000 failed Lambda invocations per hour.

Lambda billing is per invocation and per GB-second of execution duration. 900,000 invocations per hour at 10 seconds each, with 512 MB memory, is a significant charge — continuous, for as long as the downstream service is unavailable. If the downstream service is unavailable for 12 hours, the Lambda invocation cost from that incident alone can be substantial.

The compounding factor: the SQS queue is also retaining those messages. Each failed message is re-received multiple times before hitting the maximum receive count and landing in the dead letter queue (if one is configured). SQS charges per million API requests, including receives. High-volume retry storms generate their own SQS request charges on top of the Lambda compute charges.

The configuration errors that create this pattern:

No dead letter queue configured, so messages never expire out of the retry loop
Maximum receive count set too high (default is 1, but often set to 10 or more “for safety”)
Visibility timeout set too low, causing messages to become visible again before Lambda timeout
No circuit breaker in the Lambda function code to stop retrying a consistently failing downstream

The fixes:

Every SQS queue with a Lambda trigger must have a DLQ configured. Non-negotiable.
Set maximum receive count based on your actual retry strategy, not as a large safety buffer. Three to five retries is usually sufficient.
Implement exponential backoff in your Lambda function code for retryable errors, and fail fast (without retry) for non-retryable errors.
Monitor NumberOfMessagesSent vs. NumberOfMessagesDeleted on your SQS queues. A widening gap indicates messages are accumulating — the precursor to a retry storm.

Pattern 2: The Misconfigured Public Endpoint

A startup launches a product. The API is public, behind an Application Load Balancer. The product is not yet launched publicly, but the ALB DNS name is resolvable.

A security researcher discovers the endpoint. A bot scanner discovers the endpoint. A competitor’s scraper discovers the endpoint. None of these are your users. All of them generate valid HTTP requests that your infrastructure processes.

For web applications, the cost of serving unwanted traffic is bounded by the compute cost of your auto-scaled backend — which scales to serve the unexpected load — plus the data transfer cost of all responses sent to sources you did not intend to serve.

For AI inference endpoints, the cost profile is different and far more severe. A single inference request to a large language model endpoint can consume multiple seconds of GPU time. An endpoint that is publicly accessible and not authenticated, hit by automated scanners sending large prompts, can generate GPU compute costs that reach thousands of dollars in hours. The scanners pay nothing. You pay for every inference.

This pattern is not about sophisticated attacks. Standard internet scanner traffic — automated tools probing for exposed APIs — is sufficient to generate material costs on unprotected AI inference endpoints. The security failure (unauthenticated public endpoint) is also a cost failure.

The specific failure modes:

AI inference endpoints deployed without authentication for “quick testing” that never get auth added
Development/staging environments with public endpoints that receive production-scale bot traffic
S3 buckets configured for static website hosting with no access controls, receiving listing requests from crawlers that enumerate bucket contents at scale
Lambda function URLs (a feature that provides a direct HTTPS endpoint for Lambda without API Gateway) deployed without auth, discoverable via certificate transparency logs

The fixes:

All AI inference endpoints require authentication before public DNS resolution is configured. There is no valid reason to have an unauthenticated AI inference endpoint.
Use AWS WAF on all public-facing ALBs and API Gateways. A basic rate-limiting rule that blocks sources making more than a configured number of requests per minute prevents both scraping and retry-storm traffic.
Configure AWS Shield Standard (free, automatic for ALB and CloudFront) and understand what it does and does not protect.
Enable access logging on all public-facing services. The log data volume is small. The ability to identify unexpected traffic sources is essential for both security and cost attribution.

Pattern 3: Data Pipeline Duplication

Data pipelines are one of the most common sources of unexpected costs because they operate asynchronously, often at night, and their resource consumption is not directly visible in the application monitoring that engineers watch during the day.

The duplication failure pattern: a scheduled Glue job or Step Functions workflow is triggered multiple times due to a configuration error, event bridge rule misconfiguration, or manual re-run. Each trigger processes the same data. Each trigger reads from S3, processes in Glue or Lambda, and writes results back to S3. Multiple triggers create multiple copies of the result, generating:

S3 GET request costs for each full read of the source data
Glue DPU-hours for each job run
S3 PUT request costs for each result write
Storage costs for duplicate result sets

A Glue job that reads 1 TB of source data and runs for two hours, triggered ten times due to a misconfigured Event Bridge rule, costs ten times what one run would cost. The source data is the same each time. The output is ten copies of the same result.

Step Functions have a different duplication risk. Step Functions charges per state transition. A workflow with many states, triggered at high frequency due to an event source misconfiguration, generates state transition charges that accumulate rapidly. Step Functions Express Workflows charge per execution and per duration, which can be cost-efficient for high-frequency short workflows — but a recursive workflow or an execution that loops due to a bug in the state machine definition generates continuous state transitions until a human intervenes.

The detection mechanism for pipeline duplication is simple but rarely implemented: track job run counts per time window as an operational metric. If a nightly Glue job runs twice in one night, that is an anomaly worth alerting on. CloudWatch Events and Step Functions both emit execution-level metrics that can be used as the basis for this alert.

Pattern 4: Zombie Resources

Zombie resources are AWS resources that are provisioned, paying, and not being used for their intended purpose. They accumulate over time as:

Development environments that were never torn down
EC2 instances left running after a proof-of-concept ended
EBS volumes and snapshots from terminated instances where “delete on termination” was not set
RDS instances from long-completed migrations that were never decommissioned
ElastiCache clusters provisioned for load testing that was completed months ago
EKS clusters from “temporary” experiments that have been forgotten

Each individual zombie resource is not expensive. A single unused t3.medium EC2 instance costs less than $40 per month. An unused db.t3.medium RDS instance costs less than $50 per month. But organizations that create resources frequently and have no deprovisioning discipline accumulate dozens or hundreds of zombie resources. A mature startup with two years of infrastructure history can easily have $10,000–$20,000 per month of zombie resource cost.

Zombie resources are invisible in operational monitoring because they are not serving traffic. There are no alarms firing, no users complaining, no dashboards showing anomalies. They exist in a billing dead zone: running, paying, and never observed.

The detection approach:

For EC2: aws ec2 describe-instances --filters Name=instance-state-name,Values=running combined with CPU utilization data from CloudWatch. Instances with CPU utilization consistently below 5% for 7 days are candidates for termination review.

For RDS: aws rds describe-db-instances combined with DatabaseConnections metric. RDS instances with zero database connections for 7 days are likely unused.

For EBS: aws ec2 describe-volumes --filters Name=status,Values=available returns all unattached volumes. Unattached EBS volumes are paying for storage with no associated compute consuming them.

For EKS/ECS clusters: check whether the cluster has running tasks or nodes. A cluster with no running workloads is still billing for control plane (EKS charges per hour for the cluster control plane regardless of node count).

AWS Trusted Advisor flags low-utilization EC2 instances and idle RDS instances automatically (available to Business and Enterprise Support tier accounts). For accounts without Business Support, the same analysis can be performed manually using the AWS CLI queries above.

The organizational fix is a deprovisioning process that runs on a schedule. Monthly: review all resources created in the previous 90 days and confirm they have an active owner and purpose. Quarterly: review all resources older than 6 months for continued necessity. The process does not need to be complex. A spreadsheet of resource IDs, owners, purposes, and last-verified dates, reviewed monthly, catches zombie resources before they accumulate to significant cost.

Pattern 5: The Misconfigured CloudWatch Logs Subscription

CloudWatch Logs Subscription Filters allow you to stream log data from CloudWatch Logs to Lambda, Kinesis, or other destinations for real-time processing. Each subscription filter is charged based on the volume of data filtered.

The failure pattern: a subscription filter is configured to stream all log data (no filter pattern) from a high-volume log group to a Lambda function for processing. The Lambda function processes the log data and writes results to S3. The log group receives several GB per hour. The subscription filter streams several GB per hour to Lambda. Lambda processes several GB per hour and writes to S3.

Billing impact: CloudWatch Logs data scanned by subscription filters (charged per GB), Lambda invocations and duration (at several GB per hour of data processed), S3 PUT requests and storage for results.

If the Lambda function that receives the streamed data is not working correctly — due to a bug, permission issue, or downstream dependency failure — it may fail silently. The subscription filter continues streaming. CloudWatch Logs continues charging for data scanned. Lambda continues invoking and failing. The data is lost (not written to S3). The charges accumulate.

The fix: subscription filters should always have:

A specific filter pattern that matches only the events you need, not an empty pattern matching everything
A DLQ on the Lambda function receiving the stream
An alarm on Lambda errors for the receiving function
A confirmation that data is actually arriving in the destination (S3 object count, Kinesis record count)

Pattern 6: Accidental Multi-Region Deployment

CloudFormation StackSets and CDK Apps can deploy to multiple regions. Terraform modules can be applied to multiple regions by looping over provider configurations. These are legitimate multi-region deployment patterns.

The failure mode: a stack or module is deployed to multiple regions accidentally — a configuration change that expands region scope, a wildcard in a StackSet deployment target, or a loop variable that includes more regions than intended. Resources are now running in regions the team does not monitor, does not use, and cannot see in the console views they check regularly.

Costs in unexpected regions appear in Cost Explorer under the correct region but are not visible in region-specific console views that are commonly used for operational review. An engineer checking the EC2 console in us-east-1 does not see the EKS cluster running in ap-southeast-1 from an accidental deployment. The costs appear in billing, not in operational tooling.

Detection: Cost Explorer with grouping by Region, filtered to the current month, reveals spending in regions you did not intend to use. A Budget Alert per region for regions that should be inactive (with a $0 threshold) will trigger immediately if any resource is created in those regions.

Building the Failure-Pattern Radar

No single monitoring configuration catches all of these patterns. The effective approach is a layered radar:

Tier 1 — Operational (same-hour response):

CloudWatch alarms on SQS queue depth vs. messages deleted ratio
Lambda error rate above 5% on any function
NAT Gateway bytes processed per hour above baseline
Custom metric: Lambda invocation count per hour above 10× baseline

Tier 2 — Cost proxy (daily review):

Cost Explorer daily spend by service, compared to 7-day rolling average
CloudWatch log ingestion volume per day per log group
S3 request count per day per bucket

Tier 3 — Resource audit (monthly):

Unattached EBS volumes
EC2 instances with low CPU utilization
RDS instances with zero connections
EKS clusters with no running workloads
IAM roles created for specific temporary purposes that are still active

Tier 4 — Architecture review (quarterly):

All SQS queues: DLQ configured? Retry count appropriate? Message retention appropriate?
All Lambda functions: error rate trends? Invocation count trends? Timeout configuration?
All public-facing endpoints: authentication required? WAF configured? Access logging enabled?

The goal of this radar is not to eliminate all failures — some cost events are unavoidable. The goal is to reduce mean time to detection from weeks to hours, and mean time to remediation from days to minutes.

Related reading: AWS SQS: Reliable Messaging Patterns for Production covers dead letter queue configuration, visibility timeout tuning, and message lifecycle management — the operational fixes for the SQS retry patterns described in this post. AWS Step Functions: Workflow Orchestration Patterns covers state machine design that avoids the recursive loop failures described in Pattern 3.

Next in the series: Part 8 — Cost Control Is Architecture, Not Discounts. The actionable playbook: architectural patterns that reduce costs structurally, rightsizing and reserved capacity strategy, and the principles for designing cost-aware systems from the ground up.

The AWS Cost Trap — Full Series

Part 1 — Billing Complexity as a System Problem · Part 2 — Data Transfer Costs · Part 3 — Autoscaling + AI Workloads · Part 4 — Observability & Logging Costs · Part 5 — S3 Storage Cost Traps · Part 6 — The FinOps Gap · Part 7 — Real Failure Patterns · Part 8 — Optimization Playbook

How Startups Accidentally Burn $100k/month

Pattern 1: The Infinite Retry Loop

Pattern 2: The Misconfigured Public Endpoint

Pattern 3: Data Pipeline Duplication

Pattern 4: Zombie Resources

Pattern 5: The Misconfigured CloudWatch Logs Subscription

Pattern 6: Accidental Multi-Region Deployment

Building the Failure-Pattern Radar

Ready to discuss your AWS strategy?

Recommended Reading

Autoscaling Broke Your Budget (AI Made It Worse)

How to Eliminate AWS Surprise Bills From Autoscaling

AWS Pricing Is Not Transparent — It's Emergent Behavior

How to Prevent Queue-Based Cost Explosions on AWS

AI & assistant-friendly summary

Summary

Key Facts

Related Content

Pattern 1: The Infinite Retry Loop

Pattern 2: The Misconfigured Public Endpoint

Pattern 3: Data Pipeline Duplication

Pattern 4: Zombie Resources

Pattern 5: The Misconfigured CloudWatch Logs Subscription

Pattern 6: Accidental Multi-Region Deployment

Building the Failure-Pattern Radar

Ready to discuss your AWS strategy?

Recommended Reading

Autoscaling Broke Your Budget (AI Made It Worse)

How to Eliminate AWS Surprise Bills From Autoscaling

AWS Pricing Is Not Transparent — It's Emergent Behavior

How to Prevent Queue-Based Cost Explosions on AWS