Cost Control Is Architecture, Not Discounts
Quick summary: Savings Plans and Reserved Instances reduce the rate you pay. Architecture determines the volume you pay at. The most durable cost reductions in AWS come from designing systems that structurally generate less spend — not from negotiating a lower price for the same behavior.
Key Takeaways
- The most durable cost reductions in AWS come from designing systems that structurally generate less spend — not from negotiating a lower price for the same behavior
- The most durable cost reductions in AWS come from designing systems that structurally generate less spend — not from negotiating a lower price for the same behavior

Table of Contents
Part 8 of 8: The AWS Cost Trap — Why Your Bill Keeps Surprising You
When AWS bills exceed expectations, the most common first response is to look for discounts: purchase Reserved Instances, sign up for Savings Plans, negotiate an Enterprise Discount Program. These are valid tools. A one-year Compute Savings Plan can reduce EC2 costs by 30–40% compared to on-demand pricing. Reserved Instances for predictable workloads reduce costs further.
But discounts reduce the rate you pay. They do not reduce the volume of resources you consume. A system that generates $100,000 per month in on-demand compute costs will cost $60,000–$70,000 per month with a Savings Plan — still expensive, and still growing linearly with scale. The architectural patterns that generated $100,000 in compute costs are unchanged. As you grow, you will return to $100,000 and beyond, but now with a committed spend that cannot be reduced without forfeiting commitment fees.
Durable cost reduction requires changing the architecture: reducing the volume of resources consumed, not just the price per unit. The playbook for this is a set of structural patterns — each one targeting a specific cost driver identified in the previous posts in this series.
Reduce Cross-AZ Chatter
Inter-Availability Zone data transfer is charged per GB in both directions. In microservices architectures, east-west traffic between services in different AZs generates continuous transfer charges. The fix is architectural: locality-aware routing.
AZ affinity in ECS. When ECS tasks call other ECS services, configure the AWS Cloud Map service discovery or the Application Load Balancer to prefer targets in the same AZ as the calling task. ECS Service Connect supports traffic routing with AZ awareness. The goal is that a task in us-east-1a calls other tasks in us-east-1a rather than tasks in us-east-1b or us-east-1c.
Topology-aware routing in Kubernetes. EKS (Kubernetes 1.21+) supports topology-aware routing via the service.kubernetes.io/topology-mode: Auto annotation on Services. When enabled, kube-proxy prefers endpoints in the same zone as the calling pod. This reduces cross-AZ service-mesh traffic without changing application code.
Placement groups for tightly coupled EC2. A cluster placement group ensures that EC2 instances are placed on hardware within the same availability zone and as close together as possible. For workloads with high-bandwidth, low-latency requirements between a fixed set of instances — HPC, distributed databases, large in-memory caches — placement groups both reduce cross-AZ transfer charges and improve performance.
The tradeoff: AZ affinity reduces resilience. A system that strictly routes traffic within AZ will experience degraded capacity if that AZ has an outage. The correct design is a soft preference — prefer same-AZ, but fall back to cross-AZ — rather than strict locality that creates a single-AZ dependency. All the mechanisms above support soft preferences; use them that way.
Eliminate Unnecessary NAT Gateway Traffic
As discussed in Part 2, NAT Gateway charges per GB processed. Any private subnet resource calling an AWS service through NAT Gateway is generating avoidable costs. VPC Endpoints route that traffic directly through the AWS network without NAT processing charges.
Gateway Endpoints (free):
- Amazon S3
- Amazon DynamoDB
Interface Endpoints (hourly + per-GB charge, but lower than NAT):
- AWS Secrets Manager
- AWS Systems Manager (SSM)
- Amazon ECR (container image pulls)
- Amazon CloudWatch Logs
- AWS STS
- Amazon SQS
- Amazon SNS
The decision rule: for any AWS service your private-subnet resources call more than a few times per day, evaluate whether a VPC Endpoint reduces costs compared to NAT Gateway processing charges. For S3 and DynamoDB specifically, Gateway Endpoints are always the right choice — they are free, they improve performance, and they eliminate NAT Gateway processing on what are typically the highest-volume inter-service calls.
Audit your NAT Gateway traffic by enabling CloudWatch metrics on your NAT Gateway (BytesOutToDestination grouped by destination) or by sampling VPC Flow Logs for a 24-hour period. The top destination addresses from private subnet sources will tell you which services are generating the most NAT traffic. Create VPC Endpoints for the AWS services in that top list.
Caching: The Structural Cost Reducer
Every cache hit is a request that did not reach the origin. Every request that does not reach the origin does not generate:
- Origin compute cost (Lambda invocation, EC2 CPU)
- Database read cost (DynamoDB RCU, RDS query)
- S3 GET request cost
- Network transfer from origin to cache to caller
Caching reduces costs at every layer simultaneously, not just at the cached layer. This multiplicative cost reduction is why caching is the highest-return optimization in most architectures — not because each individual hit is valuable, but because a 90% cache hit rate eliminates 90% of origin resource consumption at all levels of the stack.
ElastiCache for database offload. The most common and highest-ROI caching pattern: place ElastiCache (Redis or Memcached) in front of your primary database. Cache the results of expensive queries, the records of frequently accessed entities, and the session data for authenticated users. A database query that runs 10,000 times per hour and takes 10 ms each time can be reduced to 1,000 database queries and 9,000 ElastiCache cache hits — a 90% reduction in database load at a fraction of the database cost.
ElastiCache itself has a cost: instance hours plus data transfer. For workloads where the database is a cost driver, ElastiCache almost always reduces total cost because ElastiCache instance costs are substantially lower than the equivalent RDS capacity required to handle peak load without caching.
Lambda response caching with API Gateway. API Gateway supports response caching at the gateway level, with configurable TTLs per resource and method. For API endpoints that return data that changes infrequently (catalog data, configuration, reference data), gateway-level caching eliminates Lambda invocations for cached responses. The API Gateway cache has a per-hour cost based on cache size, but at moderate-to-high request rates, the Lambda cost savings exceed the cache cost within days.
CloudFront caching for everything edge-deliverable. CloudFront should cache not just static assets but any API response that can be shared across users with the same request parameters. Product listings, category pages, search results for common queries, and pricing data are candidates for edge caching. Each cache hit from CloudFront edge does not reach your origin, does not invoke Lambda, does not consume DynamoDB reads, and does not traverse NAT Gateway.
The cache-control headers on your origin responses determine whether CloudFront caches a response. An origin that returns Cache-Control: no-cache on all responses provides no benefit from being behind CloudFront (beyond DDoS protection and geographic distribution). Auditing your cache-control headers and maximizing cacheable TTLs for non-personalized content is one of the highest-leverage configuration changes for cost reduction.
Batch vs. Real-Time: The Architecture Decision That Drives Costs
Many workloads process data in real-time that does not actually require real-time processing. The default architecture for data processing has shifted to streaming (Kinesis, Kafka, Lambda) because it is feasible — but feasibility does not imply cost efficiency.
A real-time Lambda function triggered on every S3 upload, running for 5 seconds per invocation, costs differently than a batch Lambda function that processes 1,000 uploads together every 5 minutes. Both architectures process the same data. The real-time architecture generates 1,000 invocations per processing window. The batch architecture generates 1 invocation. At the same compute cost per second, the batch architecture is not necessarily cheaper (it processes the same total data volume), but it:
- Reduces per-invocation overhead (start, end, initialization) by 1,000×
- Reduces CloudWatch log lines by 1,000×
- Reduces SQS or EventBridge event costs if those services trigger the Lambda
- Enables larger-batch processing optimizations (file compaction, bulk database writes)
The batch vs. real-time decision framework:
- What is the maximum acceptable latency between data arrival and processing completion?
- If the answer is “seconds,” real-time processing is justified.
- If the answer is “minutes” or “hours,” batch processing is almost always cheaper and often simpler.
For data analytics pipelines, ETL workloads, report generation, and notification systems with non-urgent delivery requirements, batch processing reduces cost substantially without degrading user experience.
Rightsizing: What It Actually Means
Rightsizing is not “use smaller instances.” It is “use instances that are appropriately sized for your actual workload characteristics.”
An oversized instance wastes money. An undersized instance creates performance problems that engineers respond to by scaling out (more instances) rather than scaling up (correctly sized instances), which often costs more than a single correctly sized instance would.
The rightsizing process:
- Collect CloudWatch utilization metrics for CPU, memory (requires CloudWatch Agent for EC2), network, and disk IO over a minimum 14-day period covering peak and off-peak patterns.
- Identify the utilization at the 95th percentile — not the average, not the maximum. The 95th percentile captures your sustained peak without being distorted by one-time spikes.
- Target 60–70% utilization at the 95th percentile. This gives headroom for unexpected spikes without wasting capacity at average load.
- Select the instance type that achieves 60–70% utilization at 95th percentile load.
AWS Compute Optimizer performs this analysis automatically and provides recommendations with projected cost impact. It uses actual CloudWatch metric data from your running instances, not generic benchmarks. The recommendations are not always correct — they cannot account for application-specific behavior — but they are a useful starting point that surfaces clear over-provisioning.
Memory-optimized vs. compute-optimized vs. general-purpose: The instance family matters as much as the size. A workload that is memory-constrained running on a compute-optimized instance is over-provisioned on CPU and under-provisioned on memory simultaneously. Matching the instance family to the workload bottleneck (memory, CPU, network, storage) is the first step in rightsizing, not the final one.
Savings Plans and Reserved Instances: When to Use Them
Savings Plans and Reserved Instances should be purchased after architectural optimization, not before. Purchasing a commitment for a system that will be significantly changed by cost optimization work locks you into spending on capacity that no longer matches your architecture.
The sequence:
- Architect to reduce volume (cross-AZ reduction, caching, batch/real-time trade-offs)
- Right-size to reduce over-provisioning
- Measure stable baseline compute after steps 1 and 2
- Purchase Savings Plans or Reserved Instances to cover that stable baseline at a discount
Compute Savings Plans are the most flexible commitment: they apply to any EC2 instance family, size, region, and OS, as well as Lambda and Fargate. They are the right starting point for most organizations because they provide flexibility if instance types change.
EC2 Instance Savings Plans commit to a specific instance family in a specific region and provide a deeper discount than Compute Savings Plans. Use these when your instance type and region are stable and unlikely to change within the commitment period.
The coverage target: aim for Savings Plans or Reserved Instances to cover 70–80% of your stable baseline compute. Leave 20–30% on on-demand pricing to absorb spikes and workload changes without forfeiting committed spend. An account with 100% committed spend has no flexibility for growth or architectural changes.
AWS Trusted Advisor as a Starting Point
AWS Trusted Advisor (Business and Enterprise Support tiers) provides automated checks across cost, performance, security, and fault tolerance. The cost checks that provide the most value:
- Idle EC2 instances: instances with less than 10% average daily CPU and minimal network activity over 14 days
- Underutilized EBS volumes: volumes with less than 1 IOPS average over 7 days
- Idle RDS DB instances: RDS instances with no connections over the past 7 days
- Savings Plans and Reserved Instance coverage: what fraction of your usage is covered by commitments
Trusted Advisor is not a comprehensive FinOps solution — it surfaces obvious inefficiencies, not architectural patterns. But it provides a regular automated scan that catches the clearest zombie resources and over-provisioning without requiring manual audit work.
For accounts without Business Support, AWS Cost Explorer Rightsizing Recommendations (available to all accounts) and AWS Compute Optimizer (free) provide similar functionality for EC2 and ECS without the Trusted Advisor subscription requirement.
The Principle: Design for Cost From Day One
The themes across all eight posts in this series converge on a single principle: cost is an emergent property of your architecture, not a billing artifact you optimize after the fact.
Every architectural decision has a cost dimension:
- Synchronous vs. asynchronous communication → latency vs. cost trade-off
- Microservices vs. modular monolith → operational flexibility vs. inter-service data transfer cost
- Multi-AZ distribution → resilience vs. cross-AZ transfer cost
- Real-time vs. batch processing → latency vs. invocation overhead
- High-cardinality metrics vs. structured logs → observability granularity vs. CloudWatch cost
None of these trade-offs has a universally correct answer. The right answer depends on your workload, your scale, your latency requirements, and your cost targets. What matters is that the trade-off is explicit — made with awareness of the cost dimension — rather than implicit, where the cost dimension is discovered only when the bill arrives.
The organizations that manage AWS costs effectively are not the ones with the best Savings Plan coverage. They are the ones where cost awareness is embedded in the engineering culture: in architecture reviews, in PR checklists, in sprint retrospectives, and in the operational dashboards that engineers look at every day.
Cost control is not a FinOps function. It is an engineering function, informed by FinOps data. The distinction matters because the people who can change costs are engineers. Finance can report on costs. Engineers can design them down.
Related reading: 5 AWS Cost Optimization Strategies Most Teams Overlook is a quick tactical companion to this post — right-sizing, lifecycle policies, and anomaly detection in a faster format. AWS ElastiCache: Redis Caching Strategies for Production covers ElastiCache architecture and cache invalidation strategy in operational depth. For Savings Plans and Reserved Instance monitoring workflow, see AWS Cost Explorer and Budgets: A Cloud Cost Management Guide.
The AWS Cost Trap — Full Series
Part 1 — Billing Complexity as a System Problem · Part 2 — Data Transfer Costs · Part 3 — Autoscaling + AI Workloads · Part 4 — Observability & Logging Costs · Part 5 — S3 Storage Cost Traps · Part 6 — The FinOps Gap · Part 7 — Real Failure Patterns · Part 8 — Optimization Playbook
This concludes The AWS Cost Trap series. We covered billing complexity as a system property (Part 1), data transfer patterns that break budgets (Part 2), autoscaling feedback loops (Part 3), observability cost anti-patterns (Part 4), S3 usage traps (Part 5), the FinOps organizational gap (Part 6), real failure patterns (Part 7), and the architectural playbook for durable cost reduction (Part 8).
If you are working through these patterns in your own AWS environment and want a structured review, contact the FactualMinds team. As an AWS Select Tier Consulting Partner specializing in cloud cost optimization and architecture, we run cost audits that identify the specific patterns from this series in your account — with prioritized recommendations ranked by cost impact.
AWS Cloud Architect & AI Expert
AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.


