# Monolith scale-in-place checklist (before decomposition)

Run these **in order**. Do not split services until step 5 fails a written capacity
test — microservices add network tax, not free scale.

> Reflects **June 2026** patterns: ECS Express Mode (Nov 2025 GA, GovCloud Jun 2026),
> RDS Proxy multiplexing, ElastiCache Serverless for cache bursts.

## Step 0 — Baseline (required)

- [ ] Capture p50/p99 API latency, DB `max_connections`, CPU on app tier, top 5 slow queries
- [ ] Tag deploy events in metrics — rollouts often masquerade as "mysterious slowness"
- [ ] Document peak RPS and concurrent users (not average dashboard)

**Rollback trigger:** No baseline → stop; you cannot prove the lever worked.

## Step 1 — Vertical scale + Graviton canary

- [ ] Right-size EC2/ECS task CPU and memory (not smallest that "worked in staging")
- [ ] Canary one AZ/task set on Graviton if x86-bound on CPU (not if heavy native deps)
- [ ] Enable ALB target connection draining on deploy

**Rollback trigger:** p99 worse after vertical bump → profile first; more CPU won't fix lock contention.

## Step 2 — Connection pool + RDS Proxy

- [ ] Cap per-task pool size (typical **10–20**, not 100)
- [ ] Put **RDS Proxy** in front of Aurora/RDS when task count × pool &gt; 40% of `max_connections`
- [ ] Kill `idle in transaction` sessions (ORM leak is the #1 false "need more replicas")

**Rollback trigger:** Proxy connection borrow timeouts → fix query duration before raising pool.

## Step 3 — Read path split (replicas + routing)

- [ ] Route read-only queries to Aurora reader endpoint or RDS read replica
- [ ] Measure replica lag — if p99 reads &gt; **500 ms** lag during writes, tighten routing rules
- [ ] Cache **read models** that tolerate seconds of staleness

**Rollback trigger:** Users see stale inventory/pricing → shrink cache TTL or route critical reads to writer.

## Step 4 — Cache hot keys (ElastiCache / in-process)

- [ ] Cache idempotent GETs with explicit TTL per entity type
- [ ] Stampede protection (single-flight / jittered TTL) on hot keys
- [ ] Do **not** cache personalized auth/session payloads in shared Redis without key isolation

**Rollback trigger:** Cache hit ratio &lt; 60% after 48h — wrong keys cached; fix queries before bigger Redis.

## Step 5 — Async offload (SQS / EventBridge)

- [ ] Move email, webhooks, PDF gen, search index updates off request path
- [ ] Idempotent workers + DLQ + `maxReceiveCount` tuned
- [ ] See [throughput tier matrix](../event-throughput/throughput-tier-decision-matrix.md) if queue depth exceeds SQS FIFO caps

**Rollback trigger:** User-facing flow still synchronous because UX wasn't updated — fix API contract, not queue size.

## Step 6 — Horizontal scale (ECS Express Mode or standard ECS)

- [ ] **ECS Express Mode** (three inputs → HTTPS on Fargate) for stateless web tier without platform team
- [ ] Autoscale on CPU **and** ALB request count; min tasks ≥ 2 for AZ redundancy
- [ ] Shared ALB across ≤25 Express services — isolate blast radius with host rules

**Rollback trigger:** 5xx during scale-out → health check grace period too aggressive.

## Step 7 — Decomposition decision gate

Only proceed to microservices when **all** are true:

- [ ] Team can operate ≥3 independently deployable services (CI/CD + on-call)
- [ ] Domain boundaries are stable (not "split by controller folder")
- [ ] Cross-service transactions replaced with sagas/outbox — **written**, not assumed

Otherwise stay modular monolith.

## Related posts

- [Scale monolith before decomposition](/blog/aws-legacy-monolith-scale-in-place-before-decomposition-2026/)
- [Connection pools on RDS](/blog/database-deadlocks-connection-pools-prepared-statements-rds/)
- [ECS Express Mode](/blog/amazon-ecs-express-mode/)
- [Microservices vs monolith decision](/blog/microservices-vs-monolith-on-aws-architecture-decision-guide/)
