---
title: From One FIS Experiment to a Resilience Program (2026): AWS Fault Injection Service, Stop Conditions, and GameDays That Actually Change Behavior
description: Running one AWS FIS experiment in a demo account is not chaos engineering — it is a screenshot. A program ties experiments to SLOs, scopes blast radius with tags, halts on CloudWatch alarm stop conditions, schedules via EventBridge, and closes the loop by re-testing the fix. FIS now ships AZ Power Interruption and cross-Region connectivity scenarios in its Scenario Library. Here is the L0→L3 maturity matrix, a GameDay runbook, and a stop-condition-wired experiment skeleton.
url: https://www.factualminds.com/blog/aws-chaos-engineering-resilience-program-fis-2026/
datePublished: 2026-06-10T00:00:00.000Z
dateModified: 2026-06-10T00:00:00.000Z
author: Palaniappan P
category: DevOps & CI/CD
tags: aws, chaos-engineering, resilience, aws-fis, reliability
---

# From One FIS Experiment to a Resilience Program (2026): AWS Fault Injection Service, Stop Conditions, and GameDays That Actually Change Behavior

> Running one AWS FIS experiment in a demo account is not chaos engineering — it is a screenshot. A program ties experiments to SLOs, scopes blast radius with tags, halts on CloudWatch alarm stop conditions, schedules via EventBridge, and closes the loop by re-testing the fix. FIS now ships AZ Power Interruption and cross-Region connectivity scenarios in its Scenario Library. Here is the L0→L3 maturity matrix, a GameDay runbook, and a stop-condition-wired experiment skeleton.

**Running one AWS FIS experiment in a sandbox account is not chaos engineering — it's a screenshot for the slide deck.** The gap between teams that "tried chaos engineering" and teams that get fewer 3 a.m. pages is not the tooling; it's whether experiments are tied to SLOs, scoped safely, scheduled, and — the part almost everyone skips — re-run after the fix to prove the fix worked. As of mid-2026, **AWS Fault Injection Service (FIS)** ships a **Scenario Library** with pre-built **AZ Availability: Power Interruption** and **cross-Region connectivity** scenarios, **CloudWatch alarm stop conditions**, tag-scoped targeting, and **EventBridge Scheduler** integration — the safety rails that turn fault injection from reckless to routine. This post is about the _program_, not a single experiment.

This is for SRE, platform, and reliability owners who've maybe run a [hands-on chaos tutorial](/blog/otel-demo-game-aws-observability-chaos-engineering/) and want to make it organizational. We ship a [resilience-program maturity matrix](https://bitbucket.org/baymail/factualminds-astro/src/main/examples/architecture-blog-2026/chaos-resilience-program/resilience-program-maturity-matrix.md), a [GameDay runbook template](https://bitbucket.org/baymail/factualminds-astro/src/main/examples/architecture-blog-2026/chaos-resilience-program/gameday-runbook-template.md), and a [stop-condition-wired FIS experiment skeleton](https://bitbucket.org/baymail/factualminds-astro/src/main/examples/architecture-blog-2026/chaos-resilience-program/fis-az-power-interruption-experiment.json).

> **Benchmark pattern (not a cited client)** — A composite multi-AZ platform that _believed_ it was AZ-resilient because the architecture diagram had three AZs. First FIS AZ Power Interruption run in a non-prod clone (30-min interruption + 30-min recovery, tag-scoped): the stateless tier rode through, but a singleton background worker pinned to one AZ stalled the job queue, and a self-managed cache had no cross-AZ failover. Neither was visible on the diagram. No dollar figure — the value was finding two single-AZ dependencies _before_ a real AZ event, turning a hypothetical "we're multi-AZ" into a tested, then re-tested, claim.

## The maturity ladder: where teams actually are

Score yourself honestly against the [maturity matrix](https://bitbucket.org/baymail/factualminds-astro/src/main/examples/architecture-blog-2026/chaos-resilience-program/resilience-program-maturity-matrix.md):

- **L0 — Ad hoc:** "let's see what breaks." No hypothesis, no stop conditions. (Most teams overestimate where they are.)
- **L1 — Starting:** one experiment, non-prod, a written steady-state metric. The screenshot stage.
- **L2 — Routine:** hypothesis tied to an SLO, **CloudWatch stop conditions**, tag-scoped prod targets, **scheduled** via EventBridge.
- **L3 — Program:** experiments derived from a risk register, multi-account/cross-Region scenarios, pre-prod pipeline gates, findings re-tested, executive sponsor, [Resilience Hub](/blog/aws-disaster-recovery-strategies-pilot-light-warm-standby-multi-site/) targets.

**Opinionated take:** the only jump worth obsessing over is **L1 → L2**. A steady-state hypothesis, a stop condition, and a schedule is what converts a demo into a practice. Chasing L3 tooling while still at L1 discipline is how you get an impressive runbook nobody runs.

## The five non-negotiables before injecting in prod

You earn production chaos; you don't start there.

1. **Steady-state definition** — the metric that means healthy. Can't name it? Don't run it. (This is why [observability](/blog/aws-cloudwatch-observability-metrics-logs-alarms-best-practices/) is a prerequisite, not a nice-to-have.)
2. **Stop conditions** — CloudWatch alarms wired into the FIS template that auto-halt on breach.
3. **Tag-scoped targets** — never "all instances"; scope to a tag, skip when no valid target.
4. **Blast-radius cap** — one AZ / one service / a percentage, not the fleet.
5. **A tested rollback** — the experiment validates recovery; it shouldn't be recovery's first rehearsal.

> **What broke** — A team scheduled a recurring FIS experiment via EventBridge Scheduler but reused the _non-prod_ experiment template, which had its CloudWatch stop-condition alarm pointed at a non-prod alarm that didn't exist in the prod account. The schedule fired against prod, the steady-state breached, and because the stop-condition alarm reference was invalid the experiment didn't auto-halt as expected — the on-call aborted it manually ~6 minutes in. Blast radius was contained (tag scope held), but the lesson was sharp: a stop condition you haven't verified resolves to a real, firing alarm in the _target_ account is not a stop condition. They added a pre-check (now step 4 of the GameDay runbook) that ALARM-tests the stop condition before every run.

## What to do this week

1. Score your top tier-1 system on the [maturity matrix](https://bitbucket.org/baymail/factualminds-astro/src/main/examples/architecture-blog-2026/chaos-resilience-program/resilience-program-maturity-matrix.md). Be honest about L0 vs L1.
2. Run the **AZ Power Interruption** scenario in a **non-prod** clone against a tagged workload. Watch what _doesn't_ fail over.
3. For one experiment, write the **steady-state hypothesis** and wire a **CloudWatch stop condition** — that's your L1→L2 move.
4. Schedule your first **GameDay** using the [runbook template](https://bitbucket.org/baymail/factualminds-astro/src/main/examples/architecture-blog-2026/chaos-resilience-program/gameday-runbook-template.md), and commit to a **re-test date** for every finding.

## What this post doesn't cover

- **A hands-on FIS + OpenTelemetry tutorial** — see [the OTel demo game post](/blog/otel-demo-game-aws-observability-chaos-engineering/).
- **DR architecture choices** (pilot light / warm standby / multi-site) — see [disaster recovery strategies](/blog/aws-disaster-recovery-strategies-pilot-light-warm-standby-multi-site/).
- **Application-level resilience patterns** (retries, circuit breakers, graceful shutdown) — see [resilience: retries, circuits, graceful shutdown](/blog/aws-resilience-retries-circuits-graceful-shutdown/).
- **Exact FIS actions, scenario contents, and pricing** — confirm in the FIS docs; scenario behavior here is the mid-2026 model.

---

**Related:** [OTel + chaos engineering tutorial](/blog/otel-demo-game-aws-observability-chaos-engineering/) · [Disaster recovery strategies](/blog/aws-disaster-recovery-strategies-pilot-light-warm-standby-multi-site/) · [Resilience patterns](/blog/aws-resilience-retries-circuits-graceful-shutdown/) · [SLA/SLO design](/blog/customer-facing-sla-slo-design-aws/) · [AWS managed services](/services/aws-managed-services/)

**If you only do one thing:** Take one tier-1 service, write its steady-state hypothesis, wire a CloudWatch stop condition into an FIS experiment, and schedule it. That single L1→L2 step does more for reliability than any amount of unscheduled, hypothesis-free fault injection.

## Related reading

- [The AWS CLI Bug That Broke /dev/null Across Your Entire System](/blog/aws-cli-chmod-dev-null-streaming-bug-2026/)
- [AWS Environment Parity: Why Dev/Staging/Prod Drift Costs More Than It Saves](/blog/aws-environment-parity-dev-staging-production/)
- [What DevOps Guides Don](/blog/devops-exercises-aws-production-reality/)
- [DevOps on AWS: CodePipeline vs GitHub Actions vs Jenkins](/blog/devops-on-aws-codepipeline-vs-github-actions-vs-jenkins/)
- [Two Free LocalStack Alternatives in 2026: MiniStack vs floci](/blog/ministack-free-localstack-alternative-aws-emulator/)
- [The Terraform Command Cheat Sheet for AWS Engineers (2026 Edition)](/blog/terraform-commands-cheat-sheet-aws-2026/)
- [How to Build Ultra-Fast Asset Pipelines with Bun, Vite, and Rust-Based Tooling (2026)](/blog/ultra-fast-asset-pipelines-bun-vite-rust/)

## FAQ

### What is the difference between running a chaos experiment and having a resilience program?
A single experiment proves one thing once — that a given fault did or did not break a given system on the day you ran it. A program makes resilience testing routine, safe, and tied to reliability targets: every experiment starts from a written steady-state hypothesis (the metric that defines healthy), runs with CloudWatch alarm stop conditions and tag-scoped blast radius, is scheduled rather than ad hoc, and — the part most teams skip — produces a tracked finding that gets fixed and then re-tested to confirm the fix holds. The maturity matrix in this post frames it as L0 (ad hoc) through L3 (program with executive sponsorship and Resilience Hub targets). Most teams that say they "do chaos engineering" are at L1: one experiment in a sandbox that never became routine. The jump that matters is L1 to L2.

### How do I run a chaos experiment in production without causing an outage?
Never inject anything in production without five non-negotiables in place. First, a steady-state definition — the metric that says healthy (p99 latency, success rate, orders per minute); if you cannot name it you cannot run the experiment. Second, CloudWatch alarm stop conditions wired into the FIS experiment template so it auto-halts the moment steady state breaches. Third, tag-scoped targets — never target "all instances"; scope to a tag and configure the experiment to skip when no valid target is found. Fourth, a blast-radius cap — start with one AZ, one service, or a percentage of targets, not the whole fleet. Fifth, a rollback you have already tested. AWS FIS provides the stop-condition and tag-targeting mechanisms specifically so production experiments are controlled rather than reckless.

### What is the AWS FIS AZ Availability: Power Interruption scenario?
It is a pre-built scenario in the AWS Fault Injection Service Scenario Library that simulates a complete power outage in a single Availability Zone so you can validate that a multi-AZ application rides through it. It replicates the expected symptoms of a zonal power loss — EC2/EKS/ECS compute loss, blocked instance provisioning, subnet connectivity loss, RDS and ElastiCache failovers, impaired S3 Express One Zone access, and unresponsive EBS volumes — and by default injects those symptoms for 30 minutes followed by a 30-minute recovery phase, targeting resources by tag and skipping where no valid target exists. There is also a cross-Region connectivity scenario for testing disaster-recovery posture. Using a managed scenario from the library is preferable to hand-building the equivalent fault set, because it covers the full symptom surface you would otherwise forget pieces of.

### When should we NOT do chaos engineering?
Do not run fault-injection experiments before you have observability good enough to see the steady-state metric and detect when it breaches — injecting faults blind teaches you nothing and risks an outage you cannot even measure. Do not run in production before you have rehearsed in non-production and wired stop conditions. Do not target a tier you cannot afford to degrade during business hours without first proving the experiment in a low-traffic window with a tight blast radius. And do not apply L3 program rigor uniformly: tier-1 revenue paths justify scheduled production experiments, but an internal admin tool may reasonably sit at L1 forever. Chaos engineering is a tool for systems whose resilience claims you need to verify — not a box every workload must tick.

### How does AWS FIS differ from just terminating instances with a script?
A homegrown "kill a random instance" script gives you the fault but none of the safety or breadth. AWS Fault Injection Service is a managed service that adds the things that make fault injection safe to run against real workloads: CloudWatch alarm-based stop conditions that auto-halt the experiment, tag-based targeting with skip-when-no-target behavior, a Scenario Library of pre-built real-world scenarios (AZ power interruption, cross-Region connectivity loss) that cover multi-service symptom sets, single- and multi-account targeting, IAM-scoped permissions, full experiment visibility, and EventBridge Scheduler integration for recurring runs. Many actions are agentless, though instance-level faults like CPU or memory stress require the SSM agent. The script gives you chaos; FIS gives you controlled chaos with a stop button — which is the entire point.

### How do GameDays fit into a resilience program?
A GameDay is a scheduled, facilitated failure rehearsal where the team runs a controlled experiment together, watches the system respond, and captures what they learn. It is where a resilience program becomes a team habit rather than one engineer's side project: you announce the window, define the steady-state hypothesis, run the FIS experiment with stop conditions, log observations live, and — crucially — convert findings into tracked backlog items with a re-test date. A GameDay without a tracked finding and a re-test to confirm the fix is theater. The runbook template in this post structures it: experiment, hypothesis, blast radius and safety, pre-checks, live run log, result, and findings-to-backlog. Run them on a cadence (quarterly for tier-1 systems is common) and use them to validate both the technology and the human response.

---

*Source: https://www.factualminds.com/blog/aws-chaos-engineering-resilience-program-fis-2026/*
