# GameDay runbook template

Copy this per experiment. A GameDay is a *controlled* failure rehearsal — the
goal is to learn and to confirm fixes, not to break prod for sport. Fill every
section before you run; an empty "Steady state" or "Stop conditions" is a hard
stop.

---

## 1. Experiment

- **Title:**
- **Date / window:** (low-traffic window; announced to on-call + stakeholders)
- **Owner / facilitator:**
- **Participants (roles):** on-call SRE, service owner, observer/scribe
- **FIS experiment template ID / scenario:** (e.g. AZ Availability: Power Interruption)

## 2. Hypothesis (steady state)

> "We believe that when **\<fault\>** occurs, **\<system\>** will **\<expected
> behavior\>**, and the steady-state metric **\<metric\>** will stay within
> **\<bound\>**."

- **Steady-state metric(s):** (p99 latency, success rate, orders/min)
- **Healthy bound:**
- **Where it's measured:** (CloudWatch dashboard link)

## 3. Blast radius & safety

- **Scope (tags / accounts / AZ):**
- **Target selection:** tag-based; skip if no valid target
- **Stop conditions (CloudWatch alarms):** (alarm ARNs that auto-halt)
- **Manual abort owner + command:**
- **Rollback plan (tested? Y/N):**

## 4. Pre-checks (T-minus)

- [ ] On-call notified; stakeholders aware
- [ ] Dashboards open; baseline steady-state captured
- [ ] Stop-condition alarms in ALARM-test confirmed wired to the experiment
- [ ] Backups / DR posture confirmed for in-scope resources
- [ ] Rollback path confirmed

## 5. Run log (fill live)

| Time | Action | Observed | Steady state OK? |
|------|--------|----------|------------------|
| | inject | | |
| | | | |
| | recover | | |

## 6. Result

- **Hypothesis held? (Y/N):**
- **What actually happened:**
- **What surprised us:**
- **Did anything fail to recover?:**

## 7. Findings → backlog (the part that makes it a program)

| Finding | Severity | Owner | Ticket | Re-test date |
|---------|----------|-------|--------|--------------|
| | | | | |

> A GameDay without a tracked finding and a **re-test to confirm the fix** is
> theater. The loop is: experiment → finding → fix → re-run the same experiment
> → confirm the hypothesis now holds.
