Skip to main content

Engineering Guides

Systems Engineering Guides for AWS Architects

From TCP and transaction isolation to Kafka and Kubernetes—each guide connects the mechanism to the AWS service decision you'd make in a design review.

9
Learning tracks
44
Guides in library
AWS
Service mappings
Select
Partner reviewed

Nine tracks for architects who need the why before the console click. Pick a track below or start with high-traffic topics: CAP on AWS, Kafka exactly-once, HTTP/3, connection pool exhaustion, and Prometheus cardinality.

Networking & Protocol Engineering

TCP through QUIC, TLS termination, and server I/O primitives—mapped to ALB, CloudFront, and EC2 tuning decisions.

4 guides · ~5 min total read

  1. 1

    Modern Web Transport on AWS: TCP Congestion, HTTP/2, HTTP/3, and QUIC

    Packet loss on mobile networks still punishes HTTP/1.1 head-of-line blocking—but HTTP/3 only helps if CloudFront terminates QUIC and your origin connection pools are sized for multiplexed streams. This guide connects Reno, Cubic, BBR, HPACK, and QUIC to ALB and CloudFront decisions.

    2 min
  2. 2

    TLS 1.3 Handshake Internals on AWS: ALB, CloudFront, and ACM

    A full TLS handshake on every API call adds RTTs your p99 cannot afford. This guide walks TLS 1.3 1-RTT resumption, ACM cert rotation, and security policies on ALB and CloudFront.

    1 min
  3. 3

    High-Concurrency Server I/O: epoll, Syscalls, and Zero-Copy on AWS EC2

    C10k is solved until syscall overhead and context switches eat your Graviton cores. epoll, sendfile, and SO_REUSEPORT behaviors on EC2—and why Lambda caps concurrency differently.

    1 min
  4. 4

    CPU Cache Coherence and False Sharing for Cloud Backend Engineers

    Two goroutines updating adjacent counters can saturate memory bus on a c7g.8xlarge. Memory barriers, cache lines, and false sharing—why placement groups do not fix application-level contention.

    1 min

Database Internals & Performance

Isolation levels, storage engines, connection pools, and sharding—wired to RDS, Aurora, and DynamoDB.

6 guides · ~30 min total read

  1. 1

    PostgreSQL Transaction Isolation and ACID vs BASE on AWS RDS and Aurora

    Serializable sounds safest until your checkout times out under row locks. This guide maps READ COMMITTED, REPEATABLE READ, and SERIALIZABLE to RDS/Aurora defaults—and when DynamoDB conditional writes are the BASE alternative.

    2 min
  2. 2

    B-Tree vs LSM and Query Planner Internals on AWS Databases

    Why Aurora PostgreSQL loves B-tree indexes on OLTP but DynamoDB feels like an LSM—and how cost-based optimization surprises you when statistics go stale on RDS.

    1 min
  3. 3

    Database Deadlocks, Connection Pool Exhaustion, and Prepared Statements on RDS

    Too many "too many connections" pages are fixed by raising max_connections—which trades one outage for OOM on the writer. This guide traces deadlocks, pool sizing, RDS Proxy, and prepared statement caching on Aurora.

    2 min
  4. 4

    PostgreSQL Vacuum, Index Bloat, and Sharding Hot Partitions on AWS

    Autovacuum cannot keep up after Black Friday bulk deletes—and your BRIN index is not helping point lookups. Vacuum strategy on Aurora, plus Aurora Limitless and DynamoDB hot key mitigation.

    1 min
  5. 5

    RDS vs Aurora: Read Replicas, Failover, and When to Switch

    Guide compare
  6. 6

    When to Use RDS vs Aurora (Production Decision Guide)

    Guide

Distributed Systems Architecture

CAP, coordination, consensus, and event-sourced patterns on AWS multi-Region workloads.

7 guides · ~30 min total read

  1. 1

    CAP Theorem in Practice on AWS: What Architects Actually Need for Multi-Region

    CAP is not a trivia question—it is the reason your global DynamoDB table shows stale inventory or why Aurora Global reads lag 80 ms behind the writer. This guide maps partition tolerance, consistency, and availability trade-offs to concrete AWS controls.

    2 min
  2. 2

    CRDTs and Eventual Consistency Anti-Patterns on AWS

    Last-write-wins is not a CRDT—it is how Global Tables lose cart merges. When to use counters, OR-Sets, and conflict-free merges vs when to keep a single Aurora writer.

    1 min
  3. 3

    Distributed Locking, Redlock, and Consistent Hashing on AWS

    Redlock debates matter because ElastiCache is not a consensus system. Consistent hashing for sharding workers and ALB target stickiness—with DynamoDB conditional writes as the boring alternative.

    1 min
  4. 4

    Paxos, Raft, and Byzantine Fault Tolerance: What Cloud Architects Need

    You rarely implement Raft on EC2—you buy it in Aurora, DynamoDB, and EKS etcd. This guide explains quorum math so you trust managed services and avoid rolling your own coordinator.

    1 min
  5. 5

    Exactly-Once, CQRS, and Event Sourcing Replay on AWS

    Exactly-once is a myth end-to-end—but idempotent consumers plus event stores get you close. CQRS read models on DynamoDB streams, Kinesis, and EventBridge replay semantics.

    1 min
  6. 6

    Microservices Design Patterns on AWS (2026 Production Guide)

    Guide
  7. 7

    Event-Driven Microservices Reference Pattern

    Guide pattern

Messaging, Streaming & Event-Driven Systems

Kafka, ordering, backpressure, and the AWS async stack (SQS, SNS, EventBridge).

5 guides · ~39 min total read

  1. 1

    Kafka on MSK: Partition Rebalancing and Exactly-Once Semantics

    Consumer group rebalance storms stall processing longer than broker outages. This guide covers cooperative rebalancing, idempotent producers, and transactional reads on Amazon MSK—with when SQS FIFO is simpler.

    2 min
  2. 2

    Message Ordering, Backpressure, and RabbitMQ DLQs on AWS

    FIFO guarantees shrink throughput—and unbounded queues only move backpressure to your AWS bill. Ordering, flow control, and Amazon MQ dead-letter patterns vs Kinesis resharding.

    1 min
  3. 3

    Reliable Queue Systems: SQS, Kafka, and Redis on AWS

    Guide
  4. 4

    SQS Reliable Messaging Patterns for Production

    Guide
  5. 5

    EventBridge Event-Driven Architecture Patterns

    Guide

API & Application Architecture

Auth, rate limiting, and modern API protocols on API Gateway, Cognito, and AppSync.

4 guides · ~15 min total read

  1. 1

    OAuth2 Token Introspection vs JWT Validation on Cognito and API Gateway

    Local JWT validation is fast until revocation lags bite you. When to introspect at Cognito, use API Gateway JWT authorizers, and add Verified Permissions for fine-grained authz.

    1 min
  2. 2

    Rate Limiting: Token Bucket vs Leaky Bucket on AWS WAF and API Gateway

    Token buckets allow bursts; leaky buckets smooth traffic—WAF rate rules and API Gateway usage plans implement neither perfectly but both matter for layered defense.

    1 min
  3. 3

    gRPC, GraphQL, Protobuf, and API Contracts on AWS

    Protobuf on the wire saves bytes; GraphQL saves round trips until resolvers N+1 your Aurora cluster. ALB gRPC, AppSync, and consumer-driven contracts with Pact.

    1 min
  4. 4

    API Gateway Patterns: REST, HTTP, and WebSocket on AWS

    Guide

Reliability Engineering & Observability

Tracing, metrics cardinality, logs, SLOs, and chaos—beyond default CloudWatch.

6 guides · ~51 min total read

  1. 1

    Observability Beyond CloudWatch: OTel, Prometheus, and Grafana on AWS

    Guide
  2. 2

    Prometheus Cardinality Explosion on AWS: AMP, EMF, and Cost-Aware Metrics

    That `user_id` label on every HTTP metric turns Amazon Managed Prometheus into a five-figure line item. This guide explains cardinality mechanics, EMF vs remote write, and Application Signals defaults worth disabling.

    2 min
  3. 3

    Log Aggregation and Intelligent Sampling with CloudWatch and OpenTelemetry

    Ingesting every debug log to CloudWatch is how observability becomes a FinOps incident. Tail sampling with ADOT, Logs Insights, and Firehose to S3 for the long tail.

    1 min
  4. 4

    Resilience: Retries, Circuit Breakers, and Graceful Shutdown

    Guide
  5. 5

    Customer-Facing SLA and SLO Design on AWS

    Guide
  6. 6

    Chaos Engineering and Resilience Program with FIS (2026)

    Guide

Kubernetes, Cloud Native & AWS

Deployments, PDBs, service mesh, container security, and multi-Region EKS patterns.

7 guides · ~51 min total read

  1. 1

    Blue-Green vs Canary Deployment Decision Guide (2026)

    Guide
  2. 2

    Kubernetes Pod Disruption Budgets on EKS: Zero-Downtime Upgrades

    Cluster upgrades and Karpenter consolidation look healthy in the console while PDB-blocked evictions freeze your node drain for 45 minutes. This guide wires minAvailable, maxUnavailable, and EKS managed node group semantics.

    1 min
  3. 3

    Service Mesh Traffic Shifting: VPC Lattice, Istio on EKS, and App Mesh EOL

    App Mesh is legacy path—new meshes should start with VPC Lattice for AWS-native east-west or Istio on EKS when you need full L7 policy. Traffic shifting without duplicating load balancers per service.

    1 min
  4. 4

    Container Runtime Security: seccomp, AppArmor, and EKS Pod Security

    Default Docker seccomp is not enough for regulated workloads. EKS Pod Security Standards, seccomp profiles, and Fargate platform version constraints.

    1 min
  5. 5

    EKS + Karpenter Cost-Optimized Autoscaling (How-To)

    Guide
  6. 6

    Serverless Cold Starts and Ingress Scale on AWS

    Guide
  7. 7

    Multi-Region AWS Without Doubling Costs

    Guide

Concurrency, Runtime & Performance Engineering

JVM GC, virtual threads, and low-level concurrency choices for AWS runtimes.

2 guides · ~2 min total read

  1. 1

    JVM G1 and ZGC Tuning on AWS Corretto for ECS and EKS

    Heap too small triggers G1 humongous allocations; too large balloons pause times on Graviton. Corretto on ECS/EKS/Lambda Java—when ZGC generational beats G1 for API heaps.

    1 min
  2. 2

    Virtual Threads, Lock-Free Structures, and High-Throughput Runtimes on AWS

    Project Loom virtual threads help I/O-bound Java on ECS—not CPU-bound aggregation. Compare actor models, lock-free queues, and when Lambda concurrency beats pinning threads on EC2.

    1 min

Caching & Performance Optimization

Cache layers, invalidation, and probabilistic structures on ElastiCache and CloudFront.

3 guides · ~15 min total read

  1. 1

    ElastiCache Redis Caching Strategies for Production

    Guide
  2. 2

    Distributed Cache Invalidation and Multi-Level Caching on AWS

    Cache-aside without an invalidation story ships stale pricing to 2% of users—the hardest 2% to debug. This guide layers CloudFront, ElastiCache, and DAX with TTL, event-driven purge, and when write-through beats cache-aside.

    2 min
  3. 3

    Bloom Filters and HyperLogLog in Production on ElastiCache Redis

    Bloom filters shave 90% of negative lookups; HyperLogLog estimates cardinality without storing every user ID. Redis modules on ElastiCache for abuse detection and feed deduplication.

    1 min

Engineering Guides FAQ

Who are these engineering guides for?
Senior engineers, platform teams, SREs, and architects who already know AWS service names but want the systems fundamentals behind design reviews—connection pooling, consensus, cardinality, transport protocols, and how they map to RDS, EKS, MSK, and API Gateway.
How is this different from How-To Guides?
How-To Guides are step-by-step implementation walkthroughs (Bedrock, Karpenter, compliance setup). Engineering Guides explain mechanisms first—then show which AWS control implements them. Start here for theory; jump to How-To Guides when you are ready to ship configuration.
Do I need to read the tracks in order?
Each track is ordered from foundations to production trade-offs. You can enter at any guide that matches your current incident or design question, but reading a track top-to-bottom builds a coherent mental model.
Are these guides kept current with AWS changes?
Yes. Each guide pins AWS features and versions in the opening section and carries an updateDate. Service mesh content reflects App Mesh deprecation in favor of VPC Lattice; observability guides reference Amazon Managed Prometheus and Application Signals as of 2026.

Stuck Between Theory and Production?

Our AWS architects run design reviews where these trade-offs become concrete—RDS Proxy sizing, MSK consumer groups, mesh vs Lattice, cardinality budgets.