From One FIS Experiment to a Resilience Program (2026): AWS Fault Injection Service, Stop Conditions, and GameDays That Actually Change Behavior
Quick summary: Running one AWS FIS experiment in a demo account is not chaos engineering — it is a screenshot. A program ties experiments to SLOs, scopes blast radius with tags, halts on CloudWatch alarm stop conditions, schedules via EventBridge, and closes the loop by re-testing the fix. FIS now ships AZ Power Interruption and cross-Region connectivity scenarios in its Scenario Library. Here is the L0→L3 maturity matrix, a GameDay runbook, and a stop-condition-wired experiment skeleton.
Key Takeaways
- Running one AWS FIS experiment in a demo account is not chaos engineering — it is a screenshot
- A program ties experiments to SLOs, scopes blast radius with tags, halts on CloudWatch alarm stop conditions, schedules via EventBridge, and closes the loop by re-testing the fix
- Here is the L0→L3 maturity matrix, a GameDay runbook, and a stop-condition-wired experiment skeleton
- Running one AWS FIS experiment in a sandbox account is not chaos engineering — it's a screenshot for the slide deck
- The gap between teams that "tried chaos engineering" and teams that get fewer 3 a
Table of Contents
Running one AWS FIS experiment in a sandbox account is not chaos engineering — it’s a screenshot for the slide deck. The gap between teams that “tried chaos engineering” and teams that get fewer 3 a.m. pages is not the tooling; it’s whether experiments are tied to SLOs, scoped safely, scheduled, and — the part almost everyone skips — re-run after the fix to prove the fix worked. As of mid-2026, AWS Fault Injection Service (FIS) ships a Scenario Library with pre-built AZ Availability: Power Interruption and cross-Region connectivity scenarios, CloudWatch alarm stop conditions, tag-scoped targeting, and EventBridge Scheduler integration — the safety rails that turn fault injection from reckless to routine. This post is about the program, not a single experiment.
This is for SRE, platform, and reliability owners who’ve maybe run a hands-on chaos tutorial and want to make it organizational. We ship a resilience-program maturity matrix, a GameDay runbook template, and a stop-condition-wired FIS experiment skeleton.
Benchmark pattern (not a cited client) — A composite multi-AZ platform that believed it was AZ-resilient because the architecture diagram had three AZs. First FIS AZ Power Interruption run in a non-prod clone (30-min interruption + 30-min recovery, tag-scoped): the stateless tier rode through, but a singleton background worker pinned to one AZ stalled the job queue, and a self-managed cache had no cross-AZ failover. Neither was visible on the diagram. No dollar figure — the value was finding two single-AZ dependencies before a real AZ event, turning a hypothetical “we’re multi-AZ” into a tested, then re-tested, claim.
The maturity ladder: where teams actually are
Score yourself honestly against the maturity matrix:
- L0 — Ad hoc: “let’s see what breaks.” No hypothesis, no stop conditions. (Most teams overestimate where they are.)
- L1 — Starting: one experiment, non-prod, a written steady-state metric. The screenshot stage.
- L2 — Routine: hypothesis tied to an SLO, CloudWatch stop conditions, tag-scoped prod targets, scheduled via EventBridge.
- L3 — Program: experiments derived from a risk register, multi-account/cross-Region scenarios, pre-prod pipeline gates, findings re-tested, executive sponsor, Resilience Hub targets.
Opinionated take: the only jump worth obsessing over is L1 → L2. A steady-state hypothesis, a stop condition, and a schedule is what converts a demo into a practice. Chasing L3 tooling while still at L1 discipline is how you get an impressive runbook nobody runs.
The five non-negotiables before injecting in prod
You earn production chaos; you don’t start there.
- Steady-state definition — the metric that means healthy. Can’t name it? Don’t run it. (This is why observability is a prerequisite, not a nice-to-have.)
- Stop conditions — CloudWatch alarms wired into the FIS template that auto-halt on breach.
- Tag-scoped targets — never “all instances”; scope to a tag, skip when no valid target.
- Blast-radius cap — one AZ / one service / a percentage, not the fleet.
- A tested rollback — the experiment validates recovery; it shouldn’t be recovery’s first rehearsal.
What broke — A team scheduled a recurring FIS experiment via EventBridge Scheduler but reused the non-prod experiment template, which had its CloudWatch stop-condition alarm pointed at a non-prod alarm that didn’t exist in the prod account. The schedule fired against prod, the steady-state breached, and because the stop-condition alarm reference was invalid the experiment didn’t auto-halt as expected — the on-call aborted it manually ~6 minutes in. Blast radius was contained (tag scope held), but the lesson was sharp: a stop condition you haven’t verified resolves to a real, firing alarm in the target account is not a stop condition. They added a pre-check (now step 4 of the GameDay runbook) that ALARM-tests the stop condition before every run.
What to do this week
- Score your top tier-1 system on the maturity matrix. Be honest about L0 vs L1.
- Run the AZ Power Interruption scenario in a non-prod clone against a tagged workload. Watch what doesn’t fail over.
- For one experiment, write the steady-state hypothesis and wire a CloudWatch stop condition — that’s your L1→L2 move.
- Schedule your first GameDay using the runbook template, and commit to a re-test date for every finding.
What this post doesn’t cover
- A hands-on FIS + OpenTelemetry tutorial — see the OTel demo game post.
- DR architecture choices (pilot light / warm standby / multi-site) — see disaster recovery strategies.
- Application-level resilience patterns (retries, circuit breakers, graceful shutdown) — see resilience: retries, circuits, graceful shutdown.
- Exact FIS actions, scenario contents, and pricing — confirm in the FIS docs; scenario behavior here is the mid-2026 model.
Related: OTel + chaos engineering tutorial · Disaster recovery strategies · Resilience patterns · SLA/SLO design · AWS managed services
If you only do one thing: Take one tier-1 service, write its steady-state hypothesis, wire a CloudWatch stop condition into an FIS experiment, and schedule it. That single L1→L2 step does more for reliability than any amount of unscheduled, hypothesis-free fault injection.
Related reading
- The AWS CLI Bug That Broke /dev/null Across Your Entire System
- AWS Environment Parity: Why Dev/Staging/Prod Drift Costs More Than It Saves
- What DevOps Guides Don
- DevOps on AWS: CodePipeline vs GitHub Actions vs Jenkins
- Two Free LocalStack Alternatives in 2026: MiniStack vs floci
- The Terraform Command Cheat Sheet for AWS Engineers (2026 Edition)
- How to Build Ultra-Fast Asset Pipelines with Bun, Vite, and Rust-Based Tooling (2026)
AWS Cloud Architect & AI Expert
AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.