---
title: What DevOps Guides Don't Tell You About Production AWS
description: Most DevOps guides teach what AWS services are. Production teaches what happens when 200 engineers use them together. Here's the gap.
url: https://www.factualminds.com/blog/devops-exercises-aws-production-reality/
datePublished: 2026-04-11T00:00:00.000Z
dateModified: 2026-04-16T00:00:00.000Z
author: Palaniappan P
category: DevOps & CI/CD
tags: aws, devops, terraform, ecs, kubernetes, production
---

# What DevOps Guides Don't Tell You About Production AWS

> Most DevOps guides teach what AWS services are. Production teaches what happens when 200 engineers use them together. Here's the gap.

Most DevOps guides—whether books, courses, certifications, or online platforms—follow the same pattern. They teach AWS concepts clearly: EC2, VPCs, Terraform, containers, IAM. The explanations are correct.

But they teach _what things are_. They don't teach _what happens when 200 engineers use them together_. They don't teach the failures you only see at scale.

This post maps the key AWS topics from common guides to production reality—the patterns, failure modes, and trade-offs that hiring an AWS consulting partner (or building a strong internal DevOps team) actually addresses.

---

## The Gap: Common Knowledge vs. Production Operations

Most DevOps guides cover AWS concepts well, often in a question-and-answer format. What happens when you use Spot Instances at scale? How do you version Terraform modules? What's the difference between ECS and EKS?

These are _correct_ questions. And the answers are _technically accurate_.

But they're not _complete_.

Here's the difference:

| Topic               | Common Knowledge                                    | Production Reality                                                                                                               |
| ------------------- | --------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| **Spot Instances**  | "Use Auto Scaling Groups to manage EC2 capacity"    | Spot Spot fleet diversity, interruption handling, fallback to On-Demand, capacity-optimized allocation strategy                  |
| **Terraform State** | "Use remote backend instead of local state"         | State locking, S3 MFA delete, concurrent apply failures, workspaces per environment, how to recover from a corrupt state file    |
| **ECS vs EKS**      | "EKS is more powerful, ECS is simpler"              | ECS costs 50% less, better for teams <15 people, EKS if you need Kubernetes ecosystem, Karpenter vs Cluster Autoscaler tradeoffs |
| **VPC Security**    | "Security groups are stateful, NACLs are stateless" | Security group rule limits at scale, transit routing complexity, why /16 blocks cause problems, NACLs rarely needed              |
| **CloudWatch**      | "Send logs and metrics, set alarms"                 | EMF vs custom metrics, X-Ray sampling overhead, alert fatigue in multi-account setups, log retention costs                       |

The common answers are _right_. Production answers are _right AND operational_.

---

## 1. AWS Compute: From Theory to Fleet Management

Common DevOps guides cover compute well. You'll see questions like:

- What's the difference between AMI, EBS, and instance store?
- When do you use Spot Instances?
- What's Lambda's cold start?

**What most guides miss:**

### EC2 Fleet Diversity

You can't just "use Auto Scaling Groups." You need **capacity-optimized allocation** across instance types.

Why? Spot Instances are interrupted. In production, you don't pick one instance type—you pick 8-12 compatible types (same vCPU/memory ratio) and let AWS spread the load.

```hcl
# Wrong (common approach)
desired_capacity = 10
instance_type = "t3.medium"

# Right (production)
mixed_instances_policy {
  instances_distribution {
    on_demand_percentage_above_base_capacity = 20
  }
  launch_template {
    overrides = [
      { instance_type = "t3.medium" },
      { instance_type = "t3a.medium" },
      { instance_type = "m5.large" },
      # ... 5-7 more compatible types
    ]
  }
}
```

The gotcha: **You'll still get interrupted.** The diversity isn't perfection—it's _survivability_. With 8 types, average interruption rate drops from "every few hours" to "every few days."

### Lambda Cold Starts: AWS Lambda Power Tuning

The repo asks: "Does Lambda have cold starts?" Yes. When? When there's no warm container.

Production asks: "What's the financial trade-off of cold starts?"

[AWS Lambda Power Tuning](https://github.com/alexcasalboni/aws-lambda-power-tuning) is free. It tests your function at 128MB, 256MB, 512MB, up to 10GB and shows you cost vs. latency curves. Most teams overprovision memory (paying for CPU you don't use). This tool finds the sweet spot.

```
128MB: 25 invocations/sec, $12/month
512MB: 75 invocations/sec, $18/month
1024MB: 100 invocations/sec, $25/month
```

Pick 512MB. You'll save money _and_ reduce cold starts.

### ECS Task Placement

The repo says: "ECS distributes tasks across container instances."

Production: "ECS placement constraints can cause deployment failures if misconfigured."

If you constrain tasks to specific EC2 instance types but those instances don't have capacity, your deployment will silently fail (no error, just 0 running tasks). You need:

1. **Spread tasks across availability zones** (not optional for HA)
2. **Monitor task placement failures** in CloudWatch
3. **Test failover scenarios** quarterly

---

## 2. Networking & VPC: Where Assumptions Break

Most DevOps guides cover VPC security well. But they don't cover _scale_.

### VPC IPAM and /16 Block Fragmentation

The common answer: "Subnets are /24 blocks within a VPC."

Production reality: When you have 50 VPCs across 3 regions, you don't manually assign subnets. You use **VPC IPAM** (IP Address Manager). It prevents collisions and fragments your /16 blocks efficiently.

Without IPAM, you'll eventually:

- Double-allocate an IP range
- Run out of contiguous space for a new VPC
- Waste 10+ /24 blocks to gaps

IPAM adds 5 minutes to setup. Fixing a fragmented /10 block takes a week.

### Transit Gateway vs. VPC Peering

You'll often see this question: "How do you connect VPCs?"

**VPC Peering** (one-to-one, point-to-point):

- Low latency
- Doesn't scale past 5-10 VPCs
- No transitive routing

**Transit Gateway** (hub-and-spoke):

- Scales to 100+ VPCs
- Transitive routing works
- Slightly higher latency (negligible)
- Costs money ($0.05/hour)

Most teams don't need Transit Gateway until they have 8+ VPCs. Until then, accept peering debt.

### Security Groups at Scale

Common training teaches: "Security groups are stateful firewalls."

Production reality: Security groups have a **200-rule limit per group**. At scale (100+ microservices), you'll hit this.

Your options:

1. **Create service-specific security groups** (one per app) — manageable
2. **Use NACLs for coarse filtering** (rarely done, adds complexity)
3. **Refactor your network** (consolidate services, use service mesh)

Most teams pick option 1. It works, but it requires good naming and tooling.

---

## 3. Infrastructure as Code: Terraform State in the Real World

Most Terraform guides cover the basics: remote state, modules, workspaces.

They don't teach what breaks.

### State Locking and Concurrent Applies

You run a Terraform apply. Your colleague runs one at the same time. What happens?

Without state locking: Both apply, second one wins, first one's changes are lost.

With state locking (DynamoDB): Second apply waits for the first to finish.

Setup is 3 lines:

```hcl
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    dynamodb_table = "terraform-lock"
    encrypt        = true
  }
}
```

Gotcha: If a apply crashes, the lock persists. You'll see `Error acquiring the state lock`. Recovery:

```bash
terraform force-unlock <LOCK_ID>
```

Document this. Your team _will_ need it.

### Module Versioning: The Source of Silent Bugs

Standard practice teaches: "Modules make Terraform DRY."

Production teaches: "Unversioned modules cause silent breaking changes."

```hcl
# Bad (floats to latest)
source = "git::https://github.com/my-org/terraform-modules.git//vpc"

# Good (pinned to tag)
source = "git::https://github.com/my-org/terraform-modules.git//vpc?ref=v2.1.0"
```

If you don't pin, your colleague updates a module, releases v2.2.0, and your next `terraform init` silently upgrades you. If v2.2.0 changed outputs or behavior, your apply might fail in unexpected ways.

Use semantic versioning. Pin to minor version (`v2.1.*`):

```hcl
source = "git::https://github.com/my-org/terraform-modules.git//vpc?ref=v2.1"
```

---

## 4. Containers: ECS vs. EKS, and the Operational Tax

A question that comes up in every DevOps discussion: "What's the difference between ECS and EKS?"

Both run containers. EKS is Kubernetes. ECS is AWS-native. The common answer is: "EKS is more powerful."

Here's the production decision tree:

| Factor             | ECS                             | EKS                                   |
| ------------------ | ------------------------------- | ------------------------------------- |
| **Cost**           | ~$0.25/hour (cluster) + compute | ~$0.10/hour (control plane) + compute |
| **Overhead**       | Low (AWS manages control plane) | High (you patch, secure, monitor it)  |
| **Learning curve** | 1-2 weeks for AWS teams         | 2-3 months for Kubernetes learners    |
| **Ecosystem**      | AWS-specific                    | Multi-cloud, large community          |
| **Jobs market**    | Less portable                   | Very portable                         |

**Default choice: ECS.** It costs less, requires less operational knowledge, and AWS handles patching.

**Switch to EKS when:**

- You have multiple cloud providers (AWS + GCP)
- You already know Kubernetes
- You need Helm charts or Operators
- Your team is >15 people (Kubernetes is worth the investment)

### Karpenter vs. Cluster Autoscaler

If you pick EKS, you need to autoscale. Two options:

**Cluster Autoscaler** (older):

- Adjusts EC2 capacity based on pending pods
- Works well for stable workloads
- Slow scaling (1-2 minutes)

**Karpenter** (newer, purpose-built):

- Instant node scaling (seconds, not minutes)
- Automatic Spot Instance diversification
- Bin-packing optimization (fewer waste)
- AWS-native (not multi-cloud)

Karpenter scales 10x faster and saves ~30% on compute. Use it if you run EKS.

### Fargate Cold Starts

The repo says: "Use Fargate to avoid managing EC2."

True. But Fargate has cold starts. New task launches take 30-60 seconds.

In production, this matters:

- Autoscaling policies that depend on Fargate will page you at 3 AM
- Batch jobs that need tight SLAs suffer
- One-off tasks (migrations, backups) are slow

If latency is critical, use EC2. If you can tolerate 30-60s, Fargate saves headcount.

---

## 5. Observability: Beyond CloudWatch Basics

Most DevOps guides barely touch observability. "Send metrics and logs, set alarms."

That's 10% of what you need.

### CloudWatch EMF vs. Custom Metrics

The standard advice: "Use CloudWatch Metrics for visibility."

Production has high-cardinality data (millions of unique metric values). CloudWatch custom metrics cost $0.30 per metric per month. At scale, that's expensive.

**EMF (Embedded Metrics Format)** lets you send structured logs _and_ metrics in one call:

```json
{
  "_aws": {
    "CloudWatch": {
      "Namespace": "MyApp",
      "MetricData": [{ "Name": "Latency", "Value": 142 }]
    }
  },
  "userId": "12345",
  "requestId": "abc-123"
}
```

One log line = one metric. Costs 1 log write instead of 1 metric + 1 log. Savings: $0.29.

### X-Ray Sampling: The Cost Killer

X-Ray traces are deep. Detailed. And expensive.

By default, it samples 1 of every 100 requests. That's usually fine. But if you have 10,000 requests/sec, 100 traces/sec × $5 per million traces = $400+/month.

If you're in the top 5% of latency-sensitive services, X-Ray is worth it. Otherwise:

- Sample 1 in 1000 requests
- Only trace errors (100% sample errors, 0.1% sample success)
- Use CloudWatch Insights for aggregation instead

### Alert Fatigue in Multi-Account Setups

Standard training teaches: "Set CloudWatch alarms for critical metrics."

Production reality: One CloudWatch alarm per account × 10 accounts × 50 apps = 500 alarms. If alert fatigue is high, oncall starts ignoring pages.

Solution: **Aggregate alarms at organization level.**

Use **AWS CloudWatch Synthetics** (active monitoring) + **SNS topics** (aggregation) to create a single "is the app down?" question. Oncall gets one page, not 20.

---

## 6. IAM: Least Privilege at Scale

Standard DevOps training covers: "IAM users, roles, policies. Use least privilege."

It doesn't cover what least privilege looks like with 200 engineers.

### Service Control Policies (SCPs)

SCPs are **organization-wide IAM boundaries.** They say "no EC2 in eu-west-1" across all 50 accounts.

If you don't use SCPs by your 5th AWS account, you'll eventually:

- Have unencrypted S3 buckets in a regulated account
- Run EC2 in the wrong region
- Leak secrets from a dev account

SCPs are guardrails, not permissions. Combined with role-based access (who can use the role), they prevent mistakes.

### Resource-Based Policies

Standard training teaches: "Attach policies to users."

Production: "Attach policies to roles, because users are temporary. But also control _resources_ with resource-based policies."

Example: S3 bucket should only accept encrypted uploads. That's a bucket policy, not a user policy.

```json
{
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:PutObject",
  "Resource": "arn:aws:s3:::my-bucket/*",
  "Condition": {
    "StringNotEquals": {
      "s3:x-amz-server-side-encryption": "AES256"
    }
  }
}
```

User can call `PutObject`, but S3 rejects it if unencrypted. Prevents mistakes at the resource level.

---

## 7. The Real Gap: Multi-Account Governance and When to Hire Help

Here's what standard DevOps training can't teach—because it's not a technical question:

### Multi-Account Strategy

Single AWS account works for 1-5 services. Beyond that, you need 3-8 accounts:

- **Prod** (production workloads)
- **Staging** (pre-production, prod-like)
- **Dev** (sandboxes, experiments)
- **Security** (centralized logging, IAM)
- **Network** (hub VPCs, DNS, transitive routing)

Managing 8 accounts requires:

- AWS Organizations setup
- SCPs and permission boundaries
- Cross-account role assumptions
- Centralized logging and auditing
- Backup and disaster recovery strategies

This is 2-3 weeks of engineering work.

### Blast Radius Isolation

In one account, one developer can accidentally delete the production database.

With multi-account architecture, the dev account can't touch prod. Blast radius = one account, one service, not entire infrastructure.

Setting this up requires:

- Cross-account IAM roles (not programmatic keys)
- S3 bucket policies preventing cross-account access
- VPC isolation and transit gateway design
- Audit logging in a separate security account

This is 4-6 weeks of engineering work.

### When to Hire an AWS Consulting Partner

Here's the honest truth: Everything in this post can be learned. But time has a cost.

Hiring an AWS consulting partner (like FactualMinds) is the answer to these questions:

- "Should we set up 5 accounts now or wait?" (Answer: Now. Cost of fixing later = 10x)
- "Is our Terraform state strategy secure?" (Answer: Depends. Let's audit.)
- "How do we scale observability from 10 to 100 microservices?" (Answer: EMF, aggregation, and sampling strategy)
- "What's our disaster recovery plan?" (Answer: Let's design it.)
- "Are we paying too much for compute?" (Answer: Probably. Let's optimize.)

The repo teaches you to think like an engineer. A consulting partner teaches you to _build like a team_—with governance, security, cost discipline, and playbooks.

---

## Key Takeaways

1. **Common knowledge teaches concepts. Production teaches patterns.**
   Standard DevOps training is excellent for building concepts. But production is different. Spot diversity, state locking, multi-account isolation, alert aggregation—these aren't in textbooks. They're operational requirements.

2. **Default choices matter.**
   ECS over EKS (until you need K8s). Fargate for simplicity, EC2 for latency. Terraform for infrastructure, CloudFormation for quick stacks. Small choices compound into architectural decisions.

3. **Scaling ops is invisible until it breaks.**
   5 engineers with Terraform and CloudWatch? You're fine. 50 engineers? You need state locking, EMF, SCPs, and playbooks. Build it before you need it.

4. **Fundamentals are the foundation, not the answer.**
   Engineers who succeed in production combine foundational knowledge with years of operational patterns. Learning the fundamentals is step one.

---

## What's Next?

If you're building DevOps or cloud architecture skills:

1. Study the fundamentals (certifications, guides — you need the concepts)
2. **Then** study production patterns—state management, multi-account design, observability at scale
3. Build it. Deploy it. Break it. Fix it. That's the education that matters.

If you're building AWS infrastructure and hitting these questions:

- Multi-account strategy unclear?
- Terraform state management fragile?
- Alert fatigue from CloudWatch alarms?
- Want a disaster recovery plan that actually works?

We've helped 50+ companies move from foundational knowledge to operational excellence. [Let's talk about your infrastructure](/services/devops-pipeline-setup/).

## FAQ

### What are the most common mistakes DevOps teams make on AWS?
Treating AWS like a single account (before scaling to 5+), not locking Terraform state, choosing EKS when ECS is cheaper, ignoring alert fatigue in multi-account setups, and missing blast radius isolation. Most happen because foundational guides don't address scale.

### When should you choose ECS over EKS on AWS?
ECS is AWS-native, simpler, cheaper, and good for most container workloads. Choose ECS by default. Switch to EKS only if you need Kubernetes ecosystem features, multi-cloud, or have a large team (>15 people). ECS costs 50% less and requires less operational overhead.

### How do you manage Terraform state in a production team?
Use Terraform Cloud or a remote backend (S3 + DynamoDB) with state locking enabled. Never commit state files to Git. Enable MFA delete on S3. Use workspaces or separate state files per environment. Enforce locking to prevent concurrent applies.

### Why do production AWS environments fail differently than expected?
Most guides cover theory (definitions, trade-offs). Production fails on: Spot interruptions, blast radius at scale, state conflicts, alert fatigue, multi-account permission boundaries, and cost surprises. These aren't in textbooks—they're operational patterns.

### What is AWS VPC IPAM and why does it matter?
VPC IPAM (IP Address Manager) automatically manages IP address allocation across multiple VPCs and regions. Without it, you'll eventually fragment your IP space, double-allocate ranges, or run out of contiguous blocks. Critical for 10+ VPCs. Takes 5 minutes to set up.

### How do you reduce CloudWatch alert fatigue in multi-account AWS?
Aggregate alarms at organization level using AWS CloudWatch Synthetics (active monitoring) + SNS topic aggregation. One "is the app down?" question per app, not 20 alarms per account. Reduces oncall noise, improves response quality.

---

*Source: https://www.factualminds.com/blog/devops-exercises-aws-production-reality/*