# Resilience program maturity matrix (chaos engineering)

A *tutorial* teaches you to run one AWS Fault Injection Service (FIS)
experiment. A *program* makes resilience testing routine, safe, and tied to
real reliability targets. This matrix is for deciding where you are and what the
next concrete step is — not for scoring yourself a 5/5 and stopping.

> FIS scenarios and capabilities reflect the mid-2026 model. The AZ Availability:
> Power Interruption and cross-Region connectivity scenarios are in the FIS
> Scenario Library; confirm current scenarios and supported actions in the FIS
> docs.

| Dimension | L0 — Ad hoc | L1 — Starting | L2 — Routine | L3 — Program |
|-----------|-------------|---------------|--------------|--------------|
| **Hypothesis** | "Let's see what breaks" | Written steady-state metric per experiment | Hypothesis tied to an SLO/error budget | Experiments derived from the risk register |
| **Blast radius** | Prod, no limits (don't) | Non-prod only | Prod with tag-scoped targets + stop conditions | Prod by default, automated stop on SLO burn |
| **Safety** | None | Manual abort | **CloudWatch alarm stop conditions** wired | Stop conditions + auto-rollback + on-call paged |
| **Scope** | Single resource | Single action (CPU/term) | Scenario Library (AZ power interruption) | Multi-account / multi-AZ, cross-Region scenarios |
| **Cadence** | Never / once | One GameDay | Scheduled (EventBridge Scheduler) | In CI/CD pre-prod gates + recurring GameDays |
| **Ownership** | Nobody | One champion | A team | Org program w/ exec sponsor + Resilience Hub targets |
| **Learning loop** | Findings lost | Notes in a doc | Action items tracked | Findings → backlog → re-tested to confirm fix |

## How to read it

- Most teams that "do chaos engineering" are at **L1**: they ran one FIS
  experiment in a demo account and never made it routine. The jump that matters
  is **L1 → L2**: a *steady-state hypothesis*, *stop conditions*, and a
  *schedule*.
- You do **not** need L3 everywhere. Tier-1 revenue paths deserve L2–L3;
  internal tools can sit at L1.

## The non-negotiables before you inject anything in prod

1. **Steady-state definition** — the metric that says "healthy" (p99 latency,
   success rate, orders/min). If you can't name it, you can't run the experiment.
2. **Stop conditions** — CloudWatch alarms that auto-halt the experiment when
   the steady state breaches. FIS supports alarm-based stop conditions; use them.
3. **Tag-scoped targets** — never target by "all instances." Scope to a tag and
   skip actions when no valid target is found.
4. **Blast-radius cap** — start with one AZ / one service / a percentage of
   targets, not the whole fleet.
5. **A rollback you've tested** — the experiment proves recovery; it shouldn't be
   the first time you exercise it.

## Next concrete step by level

- **L0 → L1:** run the FIS **AZ Availability: Power Interruption** scenario in a
  non-prod account against a tagged test workload. Watch what fails over and what
  doesn't.
- **L1 → L2:** add a CloudWatch-alarm stop condition, write the steady-state
  hypothesis, and put the experiment on an EventBridge schedule.
- **L2 → L3:** move to multi-account targeting, add a cross-Region connectivity
  experiment for your DR story, and gate a pre-prod pipeline stage on a small FIS
  run. Track targets in AWS Resilience Hub.
