# Wave cutover runbook — skeleton

A phase-gated runbook for a single wave in a data centre exit programme. Copy
this file per wave, fill in the **Wave**, **Workload(s)**, and **Cutover window**
header, and replace bracketed `[…]` placeholders with the specifics.

The skeleton is opinionated: every phase has a **go/no-go gate** and a
**rollback layer**. Skip neither.

---

## Header

| Field | Value |
|-------|-------|
| Wave | `[Wave N]` |
| Workload(s) | `[W?-01, W?-02 …]` |
| Criticality | `[low / medium / high / critical]` |
| Cutover window (UTC) | `[Sat 02:00 – 06:00 UTC, 2026-MM-DD]` |
| Cutover IC | `[name]` |
| Communications IC | `[name]` |
| Rollback decision authority | `[name + alternate]` |
| Vendor on-site (if any) | `[name + contact]` |
| Master comms channel | `[#dc-exit-wave-N]` |

---

## T-7 days — Final readiness gate

**Goal:** confirm the wave is ready to cut. Do not cancel the window unless
this gate fails — the next available window is usually 2+ weeks out and the
landlord clock does not stop.

- [ ] Source-side replication lag has been < 60s for any 30-min rolling window in the last 24h
- [ ] Target-side replication monitor independently confirms < 60s lag
- [ ] All wave-internal dependencies have completed their pre-cutover checks
- [ ] Out-of-wave dependencies are confirmed unchanged (no surprise changes)
- [ ] Landing-zone account/OU/SCP baseline is in place
- [ ] Network capacity (DX/VPN) under stress test ≥ 2.5× projected dual-run load
- [ ] Comms to customers/partners sent (where applicable)
- [ ] Cutover command-room booked and on-call rota confirmed
- [ ] Rollback layer documented below is intact and tested
- [ ] Vendor on-site (if any) confirmed in writing
- [ ] Sign-off in writing from: Cutover IC, Workload owner, Security, Network

**Gate outcome:** `[GO / NO-GO]` — recorded by `[Cutover IC]` at `[timestamp]`.

---

## T-1 day — Pre-cutover checklist

- [ ] Snapshot all stateful systems on source (backup retention ≥ rollback window)
- [ ] Freeze any non-emergency changes on source workload(s)
- [ ] Re-verify replication lag (< 60s for 30 min) and record screenshot
- [ ] Confirm DNS TTLs reduced to ≤ 60 seconds 24h ahead of cutover
- [ ] Validate target services can absorb production traffic via shadow / synthetic
- [ ] Comms team posts T-24h status to master channel
- [ ] Cutover IC publishes the planned timeline by hour

---

## T-0 — Cutover window

### Phase 1 — Final replication catch-up (T+0 to T+30 min)

- [ ] Quiesce writes on source (application maintenance page on, or write traffic paused)
- [ ] Confirm replication queue drains to 0
- [ ] Confirm target side is at last source LSN / sequence
- [ ] **Gate:** replication caught up & verified by two engineers. `[GO / NO-GO]`

### Phase 2 — DNS / endpoint swap (T+30 to T+60 min)

- [ ] Update DNS / ALB target / load-balancer config to point to AWS target
- [ ] Verify DNS propagation in ≥ 3 different resolvers
- [ ] Run smoke tests on target (synthetic transactions + golden-path checks)
- [ ] **Gate:** smoke tests pass; no error rate spike on target. `[GO / NO-GO]`

### Phase 3 — Real-traffic verification (T+60 min to T+3h)

- [ ] Monitor: error rate, p95 latency, queue depth, replication-back-to-source (if dual-run)
- [ ] Acceptance criteria met for **30 minutes consecutively** at production load
- [ ] No customer-visible incident reported via support channel
- [ ] **Gate:** wave is **provisionally green**; source remains warm in standby. `[GO / NO-GO]`

### Phase 4 — Comms (T+3h)

- [ ] Cutover IC posts completion summary to master channel
- [ ] Customer-facing comms (if any) updated to "completed"
- [ ] On-call coverage reduced to standard (from cutover surge)

---

## T+1 day — First-day-of-life checks

- [ ] Daily batch jobs completed on target (compare runtimes vs source baseline)
- [ ] Reconciliation reports (financial / inventory / audit) match source baseline within tolerance
- [ ] No regression in p95 latency, error rate, or cost vs target baseline
- [ ] On-call shift report does not contain wave-related incidents above SEV-3

---

## T+7 days — Steady-state gate

- [ ] No incidents above SEV-3 attributed to the wave in the last 7 days
- [ ] Replication-back-to-source (if running) reports green
- [ ] Cost-monitoring on AWS target is within projected band
- [ ] FinOps tagging confirmed on all wave resources
- [ ] **Gate:** wave is **green**. Begin decommission countdown for source. `[GO / NO-GO]`

---

## T+30 days (or wave-specific decom date) — Decommission

- [ ] Source workload(s) moved to read-only / stopped
- [ ] No traffic to source for 14 consecutive days
- [ ] **Gate:** approval to decom from workload owner + Security. `[GO / NO-GO]`
- [ ] Decom executed; certificate of destruction / archival recorded
- [ ] Source artefacts archived to S3 Glacier Deep Archive (where retention applies)

---

## Rollback layer

Every wave must have a documented rollback layer. Write it during planning, not
during the cutover.

**Rollback trigger conditions** (any one suffices):

- Error rate on target > `[X]%` of source baseline for ≥ 15 minutes
- p95 latency on target > `[Y]ms` (vs `[Z]ms` source baseline) for ≥ 15 minutes
- Replication-back-to-source fails or lag exceeds `[N] minutes`
- Customer-visible incident at or above SEV-2
- Rollback decision authority calls it for any other reason

**Rollback steps** (specific to this wave — write them out):

1. [Step 1 — e.g. revert DNS / ALB target]
2. [Step 2 — e.g. stop writes on target, replay queued writes to source]
3. [Step 3 — re-enable writes on source]
4. [Step 4 — confirm source health; post incident comms]
5. [Step 5 — schedule retrospective and re-plan the wave]

**Rollback window:** source is kept warm for `[N]` days post-cutover. After
that window, rollback requires a restore-from-backup, not a DNS flip.

---

## Post-cutover retrospective (run within 7 days)

- What went well?
- What broke (specifically) — at what time, detected how, fixed how?
- What did the runbook miss?
- Update the next wave's runbook with the lessons before that wave's T-7 gate.

A wave is not closed until the retrospective is done and the next wave's
runbook is updated.
