AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

Savings Plans and Reserved Instances reduce the rate you pay. Architecture determines the volume you pay at. The most durable cost reductions in AWS come from designing systems that structurally generate less spend — not from negotiating a lower price for the same behavior.

Key Facts

  • The most durable cost reductions in AWS come from designing systems that structurally generate less spend — not from negotiating a lower price for the same behavior
  • The most durable cost reductions in AWS come from designing systems that structurally generate less spend — not from negotiating a lower price for the same behavior

Cost Control Is Architecture, Not Discounts

Quick summary: Savings Plans and Reserved Instances reduce the rate you pay. Architecture determines the volume you pay at. The most durable cost reductions in AWS come from designing systems that structurally generate less spend — not from negotiating a lower price for the same behavior.

Key Takeaways

  • The most durable cost reductions in AWS come from designing systems that structurally generate less spend — not from negotiating a lower price for the same behavior
  • The most durable cost reductions in AWS come from designing systems that structurally generate less spend — not from negotiating a lower price for the same behavior
Cost Control Is Architecture, Not Discounts
Table of Contents

Part 8 of 8: The AWS Cost Trap — Why Your Bill Keeps Surprising You


When AWS bills exceed expectations, the most common first response is to look for discounts: purchase Reserved Instances, sign up for Savings Plans, negotiate an Enterprise Discount Program. These are valid tools. A one-year Compute Savings Plan can reduce EC2 costs by 30–40% compared to on-demand pricing. Reserved Instances for predictable workloads reduce costs further.

But discounts reduce the rate you pay. They do not reduce the volume of resources you consume. A system that generates $100,000 per month in on-demand compute costs will cost $60,000–$70,000 per month with a Savings Plan — still expensive, and still growing linearly with scale. The architectural patterns that generated $100,000 in compute costs are unchanged. As you grow, you will return to $100,000 and beyond, but now with a committed spend that cannot be reduced without forfeiting commitment fees.

Durable cost reduction requires changing the architecture: reducing the volume of resources consumed, not just the price per unit. The playbook for this is a set of structural patterns — each one targeting a specific cost driver identified in the previous posts in this series.

Reduce Cross-AZ Chatter

Inter-Availability Zone data transfer is charged per GB in both directions. In microservices architectures, east-west traffic between services in different AZs generates continuous transfer charges. The fix is architectural: locality-aware routing.

AZ affinity in ECS. When ECS tasks call other ECS services, configure the AWS Cloud Map service discovery or the Application Load Balancer to prefer targets in the same AZ as the calling task. ECS Service Connect supports traffic routing with AZ awareness. The goal is that a task in us-east-1a calls other tasks in us-east-1a rather than tasks in us-east-1b or us-east-1c.

Topology-aware routing in Kubernetes. EKS (Kubernetes 1.21+) supports topology-aware routing via the service.kubernetes.io/topology-mode: Auto annotation on Services. When enabled, kube-proxy prefers endpoints in the same zone as the calling pod. This reduces cross-AZ service-mesh traffic without changing application code.

Placement groups for tightly coupled EC2. A cluster placement group ensures that EC2 instances are placed on hardware within the same availability zone and as close together as possible. For workloads with high-bandwidth, low-latency requirements between a fixed set of instances — HPC, distributed databases, large in-memory caches — placement groups both reduce cross-AZ transfer charges and improve performance.

The tradeoff: AZ affinity reduces resilience. A system that strictly routes traffic within AZ will experience degraded capacity if that AZ has an outage. The correct design is a soft preference — prefer same-AZ, but fall back to cross-AZ — rather than strict locality that creates a single-AZ dependency. All the mechanisms above support soft preferences; use them that way.

Eliminate Unnecessary NAT Gateway Traffic

As discussed in Part 2, NAT Gateway charges per GB processed. Any private subnet resource calling an AWS service through NAT Gateway is generating avoidable costs. VPC Endpoints route that traffic directly through the AWS network without NAT processing charges.

Gateway Endpoints (free):

  • Amazon S3
  • Amazon DynamoDB

Interface Endpoints (hourly + per-GB charge, but lower than NAT):

  • AWS Secrets Manager
  • AWS Systems Manager (SSM)
  • Amazon ECR (container image pulls)
  • Amazon CloudWatch Logs
  • AWS STS
  • Amazon SQS
  • Amazon SNS

The decision rule: for any AWS service your private-subnet resources call more than a few times per day, evaluate whether a VPC Endpoint reduces costs compared to NAT Gateway processing charges. For S3 and DynamoDB specifically, Gateway Endpoints are always the right choice — they are free, they improve performance, and they eliminate NAT Gateway processing on what are typically the highest-volume inter-service calls.

Audit your NAT Gateway traffic by enabling CloudWatch metrics on your NAT Gateway (BytesOutToDestination grouped by destination) or by sampling VPC Flow Logs for a 24-hour period. The top destination addresses from private subnet sources will tell you which services are generating the most NAT traffic. Create VPC Endpoints for the AWS services in that top list.

Caching: The Structural Cost Reducer

Every cache hit is a request that did not reach the origin. Every request that does not reach the origin does not generate:

  • Origin compute cost (Lambda invocation, EC2 CPU)
  • Database read cost (DynamoDB RCU, RDS query)
  • S3 GET request cost
  • Network transfer from origin to cache to caller

Caching reduces costs at every layer simultaneously, not just at the cached layer. This multiplicative cost reduction is why caching is the highest-return optimization in most architectures — not because each individual hit is valuable, but because a 90% cache hit rate eliminates 90% of origin resource consumption at all levels of the stack.

ElastiCache for database offload. The most common and highest-ROI caching pattern: place ElastiCache (Redis or Memcached) in front of your primary database. Cache the results of expensive queries, the records of frequently accessed entities, and the session data for authenticated users. A database query that runs 10,000 times per hour and takes 10 ms each time can be reduced to 1,000 database queries and 9,000 ElastiCache cache hits — a 90% reduction in database load at a fraction of the database cost.

ElastiCache itself has a cost: instance hours plus data transfer. For workloads where the database is a cost driver, ElastiCache almost always reduces total cost because ElastiCache instance costs are substantially lower than the equivalent RDS capacity required to handle peak load without caching.

Lambda response caching with API Gateway. API Gateway supports response caching at the gateway level, with configurable TTLs per resource and method. For API endpoints that return data that changes infrequently (catalog data, configuration, reference data), gateway-level caching eliminates Lambda invocations for cached responses. The API Gateway cache has a per-hour cost based on cache size, but at moderate-to-high request rates, the Lambda cost savings exceed the cache cost within days.

CloudFront caching for everything edge-deliverable. CloudFront should cache not just static assets but any API response that can be shared across users with the same request parameters. Product listings, category pages, search results for common queries, and pricing data are candidates for edge caching. Each cache hit from CloudFront edge does not reach your origin, does not invoke Lambda, does not consume DynamoDB reads, and does not traverse NAT Gateway.

The cache-control headers on your origin responses determine whether CloudFront caches a response. An origin that returns Cache-Control: no-cache on all responses provides no benefit from being behind CloudFront (beyond DDoS protection and geographic distribution). Auditing your cache-control headers and maximizing cacheable TTLs for non-personalized content is one of the highest-leverage configuration changes for cost reduction.

Batch vs. Real-Time: The Architecture Decision That Drives Costs

Many workloads process data in real-time that does not actually require real-time processing. The default architecture for data processing has shifted to streaming (Kinesis, Kafka, Lambda) because it is feasible — but feasibility does not imply cost efficiency.

A real-time Lambda function triggered on every S3 upload, running for 5 seconds per invocation, costs differently than a batch Lambda function that processes 1,000 uploads together every 5 minutes. Both architectures process the same data. The real-time architecture generates 1,000 invocations per processing window. The batch architecture generates 1 invocation. At the same compute cost per second, the batch architecture is not necessarily cheaper (it processes the same total data volume), but it:

  • Reduces per-invocation overhead (start, end, initialization) by 1,000×
  • Reduces CloudWatch log lines by 1,000×
  • Reduces SQS or EventBridge event costs if those services trigger the Lambda
  • Enables larger-batch processing optimizations (file compaction, bulk database writes)

The batch vs. real-time decision framework:

  • What is the maximum acceptable latency between data arrival and processing completion?
  • If the answer is “seconds,” real-time processing is justified.
  • If the answer is “minutes” or “hours,” batch processing is almost always cheaper and often simpler.

For data analytics pipelines, ETL workloads, report generation, and notification systems with non-urgent delivery requirements, batch processing reduces cost substantially without degrading user experience.

Rightsizing: What It Actually Means

Rightsizing is not “use smaller instances.” It is “use instances that are appropriately sized for your actual workload characteristics.”

An oversized instance wastes money. An undersized instance creates performance problems that engineers respond to by scaling out (more instances) rather than scaling up (correctly sized instances), which often costs more than a single correctly sized instance would.

The rightsizing process:

  1. Collect CloudWatch utilization metrics for CPU, memory (requires CloudWatch Agent for EC2), network, and disk IO over a minimum 14-day period covering peak and off-peak patterns.
  2. Identify the utilization at the 95th percentile — not the average, not the maximum. The 95th percentile captures your sustained peak without being distorted by one-time spikes.
  3. Target 60–70% utilization at the 95th percentile. This gives headroom for unexpected spikes without wasting capacity at average load.
  4. Select the instance type that achieves 60–70% utilization at 95th percentile load.

AWS Compute Optimizer performs this analysis automatically and provides recommendations with projected cost impact. It uses actual CloudWatch metric data from your running instances, not generic benchmarks. The recommendations are not always correct — they cannot account for application-specific behavior — but they are a useful starting point that surfaces clear over-provisioning.

Memory-optimized vs. compute-optimized vs. general-purpose: The instance family matters as much as the size. A workload that is memory-constrained running on a compute-optimized instance is over-provisioned on CPU and under-provisioned on memory simultaneously. Matching the instance family to the workload bottleneck (memory, CPU, network, storage) is the first step in rightsizing, not the final one.

Savings Plans and Reserved Instances: When to Use Them

Savings Plans and Reserved Instances should be purchased after architectural optimization, not before. Purchasing a commitment for a system that will be significantly changed by cost optimization work locks you into spending on capacity that no longer matches your architecture.

The sequence:

  1. Architect to reduce volume (cross-AZ reduction, caching, batch/real-time trade-offs)
  2. Right-size to reduce over-provisioning
  3. Measure stable baseline compute after steps 1 and 2
  4. Purchase Savings Plans or Reserved Instances to cover that stable baseline at a discount

Compute Savings Plans are the most flexible commitment: they apply to any EC2 instance family, size, region, and OS, as well as Lambda and Fargate. They are the right starting point for most organizations because they provide flexibility if instance types change.

EC2 Instance Savings Plans commit to a specific instance family in a specific region and provide a deeper discount than Compute Savings Plans. Use these when your instance type and region are stable and unlikely to change within the commitment period.

The coverage target: aim for Savings Plans or Reserved Instances to cover 70–80% of your stable baseline compute. Leave 20–30% on on-demand pricing to absorb spikes and workload changes without forfeiting committed spend. An account with 100% committed spend has no flexibility for growth or architectural changes.

AWS Trusted Advisor as a Starting Point

AWS Trusted Advisor (Business and Enterprise Support tiers) provides automated checks across cost, performance, security, and fault tolerance. The cost checks that provide the most value:

  • Idle EC2 instances: instances with less than 10% average daily CPU and minimal network activity over 14 days
  • Underutilized EBS volumes: volumes with less than 1 IOPS average over 7 days
  • Idle RDS DB instances: RDS instances with no connections over the past 7 days
  • Savings Plans and Reserved Instance coverage: what fraction of your usage is covered by commitments

Trusted Advisor is not a comprehensive FinOps solution — it surfaces obvious inefficiencies, not architectural patterns. But it provides a regular automated scan that catches the clearest zombie resources and over-provisioning without requiring manual audit work.

For accounts without Business Support, AWS Cost Explorer Rightsizing Recommendations (available to all accounts) and AWS Compute Optimizer (free) provide similar functionality for EC2 and ECS without the Trusted Advisor subscription requirement.

The Principle: Design for Cost From Day One

The themes across all eight posts in this series converge on a single principle: cost is an emergent property of your architecture, not a billing artifact you optimize after the fact.

Every architectural decision has a cost dimension:

  • Synchronous vs. asynchronous communication → latency vs. cost trade-off
  • Microservices vs. modular monolith → operational flexibility vs. inter-service data transfer cost
  • Multi-AZ distribution → resilience vs. cross-AZ transfer cost
  • Real-time vs. batch processing → latency vs. invocation overhead
  • High-cardinality metrics vs. structured logs → observability granularity vs. CloudWatch cost

None of these trade-offs has a universally correct answer. The right answer depends on your workload, your scale, your latency requirements, and your cost targets. What matters is that the trade-off is explicit — made with awareness of the cost dimension — rather than implicit, where the cost dimension is discovered only when the bill arrives.

The organizations that manage AWS costs effectively are not the ones with the best Savings Plan coverage. They are the ones where cost awareness is embedded in the engineering culture: in architecture reviews, in PR checklists, in sprint retrospectives, and in the operational dashboards that engineers look at every day.

Cost control is not a FinOps function. It is an engineering function, informed by FinOps data. The distinction matters because the people who can change costs are engineers. Finance can report on costs. Engineers can design them down.


Related reading: 5 AWS Cost Optimization Strategies Most Teams Overlook is a quick tactical companion to this post — right-sizing, lifecycle policies, and anomaly detection in a faster format. AWS ElastiCache: Redis Caching Strategies for Production covers ElastiCache architecture and cache invalidation strategy in operational depth. For Savings Plans and Reserved Instance monitoring workflow, see AWS Cost Explorer and Budgets: A Cloud Cost Management Guide.


The AWS Cost Trap — Full Series

Part 1 — Billing Complexity as a System Problem · Part 2 — Data Transfer Costs · Part 3 — Autoscaling + AI Workloads · Part 4 — Observability & Logging Costs · Part 5 — S3 Storage Cost Traps · Part 6 — The FinOps Gap · Part 7 — Real Failure Patterns · Part 8 — Optimization Playbook


This concludes The AWS Cost Trap series. We covered billing complexity as a system property (Part 1), data transfer patterns that break budgets (Part 2), autoscaling feedback loops (Part 3), observability cost anti-patterns (Part 4), S3 usage traps (Part 5), the FinOps organizational gap (Part 6), real failure patterns (Part 7), and the architectural playbook for durable cost reduction (Part 8).

If you are working through these patterns in your own AWS environment and want a structured review, contact the FactualMinds team. As an AWS Select Tier Consulting Partner specializing in cloud cost optimization and architecture, we run cost audits that identify the specific patterns from this series in your account — with prioritized recommendations ranked by cost impact.

PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Ready to discuss your AWS strategy?

Our certified architects can help you implement these solutions.

Recommended Reading

Explore All Articles »
Designing AWS Architectures with Predictable, Stable Costs

Designing AWS Architectures with Predictable, Stable Costs

The most expensive AWS architectures are not the ones that use the most resources — they are the ones whose costs respond unpredictably to inputs. This is the design discipline for building systems where costs are structurally bounded and forecasting is accurate.

Autoscaling Broke Your Budget (AI Made It Worse)

Autoscaling Broke Your Budget (AI Made It Worse)

Autoscaling was supposed to make costs predictable by matching capacity to demand. Instead, it introduced feedback loops, burst amplification, and — with AI workloads — a new class of non-deterministic spend that no scaling policy anticipates.

Logging Yourself Into Bankruptcy

Logging Yourself Into Bankruptcy

Observability is not free, and the industry has collectively underpriced it. CloudWatch log ingestion, metrics explosion, and X-Ray trace volume can together exceed your compute bill — especially once AI workloads introduce high-cardinality telemetry at scale.

AWS Cost Prediction in 2026: The Playbook for Accurate Forecasting

AWS Cost Prediction in 2026: The Playbook for Accurate Forecasting

Most AWS cost forecasts miss by 30–50% not because engineers are careless, but because the forecasting model does not match how AWS actually charges. This is the playbook for getting forecasts right: which metrics to measure, which models to use, and where the structural gaps are.