What is the difference between running a chaos experiment and having a resilience program?

A single experiment proves one thing once — that a given fault did or did not break a given system on the day you ran it. A program makes resilience testing routine, safe, and tied to reliability targets: every experiment starts from a written steady-state hypothesis (the metric that defines healthy), runs with CloudWatch alarm stop conditions and tag-scoped blast radius, is scheduled rather than ad hoc, and — the part most teams skip — produces a tracked finding that gets fixed and then re-tested to confirm the fix holds. The maturity matrix in this post frames it as L0 (ad hoc) through L3 (program with executive sponsorship and Resilience Hub targets). Most teams that say they "do chaos engineering" are at L1: one experiment in a sandbox that never became routine. The jump that matters is L1 to L2.

How do I run a chaos experiment in production without causing an outage?

Never inject anything in production without five non-negotiables in place. First, a steady-state definition — the metric that says healthy (p99 latency, success rate, orders per minute); if you cannot name it you cannot run the experiment. Second, CloudWatch alarm stop conditions wired into the FIS experiment template so it auto-halts the moment steady state breaches. Third, tag-scoped targets — never target "all instances"; scope to a tag and configure the experiment to skip when no valid target is found. Fourth, a blast-radius cap — start with one AZ, one service, or a percentage of targets, not the whole fleet. Fifth, a rollback you have already tested. AWS FIS provides the stop-condition and tag-targeting mechanisms specifically so production experiments are controlled rather than reckless.

What is the AWS FIS AZ Availability: Power Interruption scenario?

It is a pre-built scenario in the AWS Fault Injection Service Scenario Library that simulates a complete power outage in a single Availability Zone so you can validate that a multi-AZ application rides through it. It replicates the expected symptoms of a zonal power loss — EC2/EKS/ECS compute loss, blocked instance provisioning, subnet connectivity loss, RDS and ElastiCache failovers, impaired S3 Express One Zone access, and unresponsive EBS volumes — and by default injects those symptoms for 30 minutes followed by a 30-minute recovery phase, targeting resources by tag and skipping where no valid target exists. There is also a cross-Region connectivity scenario for testing disaster-recovery posture. Using a managed scenario from the library is preferable to hand-building the equivalent fault set, because it covers the full symptom surface you would otherwise forget pieces of.

When should we NOT do chaos engineering?

Do not run fault-injection experiments before you have observability good enough to see the steady-state metric and detect when it breaches — injecting faults blind teaches you nothing and risks an outage you cannot even measure. Do not run in production before you have rehearsed in non-production and wired stop conditions. Do not target a tier you cannot afford to degrade during business hours without first proving the experiment in a low-traffic window with a tight blast radius. And do not apply L3 program rigor uniformly: tier-1 revenue paths justify scheduled production experiments, but an internal admin tool may reasonably sit at L1 forever. Chaos engineering is a tool for systems whose resilience claims you need to verify — not a box every workload must tick.

How does AWS FIS differ from just terminating instances with a script?

A homegrown "kill a random instance" script gives you the fault but none of the safety or breadth. AWS Fault Injection Service is a managed service that adds the things that make fault injection safe to run against real workloads: CloudWatch alarm-based stop conditions that auto-halt the experiment, tag-based targeting with skip-when-no-target behavior, a Scenario Library of pre-built real-world scenarios (AZ power interruption, cross-Region connectivity loss) that cover multi-service symptom sets, single- and multi-account targeting, IAM-scoped permissions, full experiment visibility, and EventBridge Scheduler integration for recurring runs. Many actions are agentless, though instance-level faults like CPU or memory stress require the SSM agent. The script gives you chaos; FIS gives you controlled chaos with a stop button — which is the entire point.

How do GameDays fit into a resilience program?

A GameDay is a scheduled, facilitated failure rehearsal where the team runs a controlled experiment together, watches the system respond, and captures what they learn. It is where a resilience program becomes a team habit rather than one engineer's side project: you announce the window, define the steady-state hypothesis, run the FIS experiment with stop conditions, log observations live, and — crucially — convert findings into tracked backlog items with a re-test date. A GameDay without a tracked finding and a re-test to confirm the fix is theater. The runbook template in this post structures it: experiment, hypothesis, blast radius and safety, pre-checks, live run log, result, and findings-to-backlog. Run them on a cadence (quarterly for tier-1 systems is common) and use them to validate both the technology and the human response.

AWS Chaos Engineering 2026: FIS, GameDays, Stop Conditions

From One FIS Experiment to a Resilience Program (2026): AWS Fault Injection Service, Stop Conditions, and GameDays That Actually Change Behavior

Quick summary: Running one AWS FIS experiment in a demo account is not chaos engineering — it is a screenshot. A program ties experiments to SLOs, scopes blast radius with tags, halts on CloudWatch alarm stop conditions, schedules via EventBridge, and closes the loop by re-testing the fix. FIS now ships AZ Power Interruption and cross-Region connectivity scenarios in its Scenario Library. Here is the L0→L3 maturity matrix, a GameDay runbook, and a stop-condition-wired experiment skeleton.

Key Takeaways

Running one AWS FIS experiment in a demo account is not chaos engineering — it is a screenshot
A program ties experiments to SLOs, scopes blast radius with tags, halts on CloudWatch alarm stop conditions, schedules via EventBridge, and closes the loop by re-testing the fix
Here is the L0→L3 maturity matrix, a GameDay runbook, and a stop-condition-wired experiment skeleton
Running one AWS FIS experiment in a sandbox account is not chaos engineering — it's a screenshot for the slide deck
The gap between teams that "tried chaos engineering" and teams that get fewer 3 a

Running one AWS FIS experiment in a sandbox account is not chaos engineering — it’s a screenshot for the slide deck. The gap between teams that “tried chaos engineering” and teams that get fewer 3 a.m. pages is not the tooling; it’s whether experiments are tied to SLOs, scoped safely, scheduled, and — the part almost everyone skips — re-run after the fix to prove the fix worked. As of mid-2026, AWS Fault Injection Service (FIS) ships a Scenario Library with pre-built AZ Availability: Power Interruption and cross-Region connectivity scenarios, CloudWatch alarm stop conditions, tag-scoped targeting, and EventBridge Scheduler integration — the safety rails that turn fault injection from reckless to routine. This post is about the program, not a single experiment.

Symptom → mechanism → AWS control

Production symptom	Mechanism	AWS control
Failover SLO unknown until prod incident	No controlled failure injection	FIS AZ impairment, RDS failover experiment template
GameDay causes customer outage	Missing stop conditions	FIS CloudWatch alarm stop condition
Experiments don’t change behavior	One-off heroics, no tracking	Experiment registry, remediation tickets in backlog

Opinionated take: Run FIS experiments with CloudWatch stop conditions in production quarterly—one-off chaos without stop conditions is negligence, not engineering.

This is for SRE, platform, and reliability owners who’ve maybe run a hands-on chaos tutorial and want to make it organizational. We ship a resilience-program maturity matrix, a GameDay runbook template, and a stop-condition-wired FIS experiment skeleton.

Benchmark pattern (not a cited client) — A composite multi-AZ platform that believed it was AZ-resilient because the architecture diagram had three AZs. First FIS AZ Power Interruption run in a non-prod clone (30-min interruption + 30-min recovery, tag-scoped): the stateless tier rode through, but a singleton background worker pinned to one AZ stalled the job queue, and a self-managed cache had no cross-AZ failover. Neither was visible on the diagram. No dollar figure — the value was finding two single-AZ dependencies before a real AZ event, turning a hypothetical “we’re multi-AZ” into a tested, then re-tested, claim.

The maturity ladder: where teams actually are

Score yourself honestly against the maturity matrix:

L0 — Ad hoc: “let’s see what breaks.” No hypothesis, no stop conditions. (Most teams overestimate where they are.)
L1 — Starting: one experiment, non-prod, a written steady-state metric. The screenshot stage.
L2 — Routine: hypothesis tied to an SLO, CloudWatch stop conditions, tag-scoped prod targets, scheduled via EventBridge.
L3 — Program: experiments derived from a risk register, multi-account/cross-Region scenarios, pre-prod pipeline gates, findings re-tested, executive sponsor, Resilience Hub targets.

Opinionated take: the only jump worth obsessing over is L1 → L2. A steady-state hypothesis, a stop condition, and a schedule is what converts a demo into a practice. Chasing L3 tooling while still at L1 discipline is how you get an impressive runbook nobody runs.

The five non-negotiables before injecting in prod

You earn production chaos; you don’t start there.

Steady-state definition — the metric that means healthy. Can’t name it? Don’t run it. (This is why observability is a prerequisite, not a nice-to-have.)
Stop conditions — CloudWatch alarms wired into the FIS template that auto-halt on breach.
Tag-scoped targets — never “all instances”; scope to a tag, skip when no valid target.
Blast-radius cap — one AZ / one service / a percentage, not the fleet.
A tested rollback — the experiment validates recovery; it shouldn’t be recovery’s first rehearsal.

What broke — A team scheduled a recurring FIS experiment via EventBridge Scheduler but reused the non-prod experiment template, which had its CloudWatch stop-condition alarm pointed at a non-prod alarm that didn’t exist in the prod account. The schedule fired against prod, the steady-state breached, and because the stop-condition alarm reference was invalid the experiment didn’t auto-halt as expected — the on-call aborted it manually ~6 minutes in. Blast radius was contained (tag scope held), but the lesson was sharp: a stop condition you haven’t verified resolves to a real, firing alarm in the target account is not a stop condition. They added a pre-check (now step 4 of the GameDay runbook) that ALARM-tests the stop condition before every run.

AWS services map

Need	Service	Skip when
Controlled fault injection	AWS FIS	Pre-production only, no prod GameDays
Blast radius containment	FIS experiment scope + IAM	Full-region untargeted chaos
Observability during experiments	CloudWatch dashboards as stop conditions	No metrics on experiment path

What to do this week

Score your top tier-1 system on the maturity matrix. Be honest about L0 vs L1.
Run the AZ Power Interruption scenario in a non-prod clone against a tagged workload. Watch what doesn’t fail over.
For one experiment, write the steady-state hypothesis and wire a CloudWatch stop condition — that’s your L1→L2 move.
Schedule your first GameDay using the runbook template, and commit to a re-test date for every finding.

What this post doesn’t cover

A hands-on FIS + OpenTelemetry tutorial — see the OTel demo game post.
DR architecture choices (pilot light / warm standby / multi-site) — see disaster recovery strategies.
Application-level resilience patterns (retries, circuit breakers, graceful shutdown) — see resilience: retries, circuits, graceful shutdown.
Exact FIS actions, scenario contents, and pricing — confirm in the FIS docs; scenario behavior here is the mid-2026 model.

If you only do one thing: Take one tier-1 service, write its steady-state hypothesis, wire a CloudWatch stop condition into an FIS experiment, and schedule it. That single L1→L2 step does more for reliability than any amount of unscheduled, hypothesis-free fault injection.

From One FIS Experiment to a Resilience Program (2026): AWS Fault Injection Service, Stop Conditions, and GameDays That Actually Change Behavior

Symptom → mechanism → AWS control

The maturity ladder: where teams actually are

The five non-negotiables before injecting in prod

AWS services map

What to do this week

What this post doesn’t cover

More in This Track

Recommended Reading

AWS CDK vs CloudFormation vs AWS Blocks: Enterprise IaC Decision Guide (2026)

Designing a Customer-Facing SLA on AWS (2026): SLO Error Budgets and the Composite-Availability Math Most Teams Skip

Observability Beyond CloudWatch (2026): When to Add Application Signals, ADOT, Managed Prometheus, and Grafana — and When Not To

Log Aggregation and Intelligent Sampling with CloudWatch and OpenTelemetry

AI & assistant-friendly summary

Summary

Key Facts

Entity Definitions

Related Content

Symptom → mechanism → AWS control

The maturity ladder: where teams actually are

The five non-negotiables before injecting in prod

AWS services map

What to do this week

What this post doesn’t cover

More in This Track

Related reading

Recommended Reading

AWS CDK vs CloudFormation vs AWS Blocks: Enterprise IaC Decision Guide (2026)

Designing a Customer-Facing SLA on AWS (2026): SLO Error Budgets and the Composite-Availability Math Most Teams Skip

Observability Beyond CloudWatch (2026): When to Add Application Signals, ADOT, Managed Prometheus, and Grafana — and When Not To

Log Aggregation and Intelligent Sampling with CloudWatch and OpenTelemetry