Skip to main content

AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

Running one AWS FIS experiment in a sandbox account is not chaos engineering — it's a screenshot for the slide deck. The gap between teams that "tried chaos engineering" and teams that get fewer 3 a. m

Key Facts

  • Running one AWS FIS experiment in a demo account is not chaos engineering — it is a screenshot
  • A program ties experiments to SLOs, scopes blast radius with tags, halts on CloudWatch alarm stop conditions, schedules via EventBridge, and closes the loop by re-testing the fix
  • Here is the L0→L3 maturity matrix, a GameDay runbook, and a stop-condition-wired experiment skeleton
  • Running one AWS FIS experiment in a sandbox account is not chaos engineering — it's a screenshot for the slide deck
  • The gap between teams that "tried chaos engineering" and teams that get fewer 3 a

Entity Definitions

CloudWatch
CloudWatch is an AWS service discussed in this article.
EventBridge
EventBridge is an AWS service discussed in this article.
CodePipeline
CodePipeline is an AWS service discussed in this article.
DevOps
DevOps is a cloud computing concept discussed in this article.
Terraform
Terraform is a development tool discussed in this article.
GitHub Actions
GitHub Actions is a development tool discussed in this article.
Jenkins
Jenkins is a development tool discussed in this article.

From One FIS Experiment to a Resilience Program (2026): AWS Fault Injection Service, Stop Conditions, and GameDays That Actually Change Behavior

DevOps & CI/CD Palaniappan P 5 min read

Quick summary: Running one AWS FIS experiment in a demo account is not chaos engineering — it is a screenshot. A program ties experiments to SLOs, scopes blast radius with tags, halts on CloudWatch alarm stop conditions, schedules via EventBridge, and closes the loop by re-testing the fix. FIS now ships AZ Power Interruption and cross-Region connectivity scenarios in its Scenario Library. Here is the L0→L3 maturity matrix, a GameDay runbook, and a stop-condition-wired experiment skeleton.

Key Takeaways

  • Running one AWS FIS experiment in a demo account is not chaos engineering — it is a screenshot
  • A program ties experiments to SLOs, scopes blast radius with tags, halts on CloudWatch alarm stop conditions, schedules via EventBridge, and closes the loop by re-testing the fix
  • Here is the L0→L3 maturity matrix, a GameDay runbook, and a stop-condition-wired experiment skeleton
  • Running one AWS FIS experiment in a sandbox account is not chaos engineering — it's a screenshot for the slide deck
  • The gap between teams that "tried chaos engineering" and teams that get fewer 3 a
From One FIS Experiment to a Resilience Program (2026): AWS Fault Injection Service, Stop Conditions, and GameDays That Actually Change Behavior
Table of Contents

Running one AWS FIS experiment in a sandbox account is not chaos engineering — it’s a screenshot for the slide deck. The gap between teams that “tried chaos engineering” and teams that get fewer 3 a.m. pages is not the tooling; it’s whether experiments are tied to SLOs, scoped safely, scheduled, and — the part almost everyone skips — re-run after the fix to prove the fix worked. As of mid-2026, AWS Fault Injection Service (FIS) ships a Scenario Library with pre-built AZ Availability: Power Interruption and cross-Region connectivity scenarios, CloudWatch alarm stop conditions, tag-scoped targeting, and EventBridge Scheduler integration — the safety rails that turn fault injection from reckless to routine. This post is about the program, not a single experiment.

This is for SRE, platform, and reliability owners who’ve maybe run a hands-on chaos tutorial and want to make it organizational. We ship a resilience-program maturity matrix, a GameDay runbook template, and a stop-condition-wired FIS experiment skeleton.

Benchmark pattern (not a cited client) — A composite multi-AZ platform that believed it was AZ-resilient because the architecture diagram had three AZs. First FIS AZ Power Interruption run in a non-prod clone (30-min interruption + 30-min recovery, tag-scoped): the stateless tier rode through, but a singleton background worker pinned to one AZ stalled the job queue, and a self-managed cache had no cross-AZ failover. Neither was visible on the diagram. No dollar figure — the value was finding two single-AZ dependencies before a real AZ event, turning a hypothetical “we’re multi-AZ” into a tested, then re-tested, claim.

The maturity ladder: where teams actually are

Score yourself honestly against the maturity matrix:

  • L0 — Ad hoc: “let’s see what breaks.” No hypothesis, no stop conditions. (Most teams overestimate where they are.)
  • L1 — Starting: one experiment, non-prod, a written steady-state metric. The screenshot stage.
  • L2 — Routine: hypothesis tied to an SLO, CloudWatch stop conditions, tag-scoped prod targets, scheduled via EventBridge.
  • L3 — Program: experiments derived from a risk register, multi-account/cross-Region scenarios, pre-prod pipeline gates, findings re-tested, executive sponsor, Resilience Hub targets.

Opinionated take: the only jump worth obsessing over is L1 → L2. A steady-state hypothesis, a stop condition, and a schedule is what converts a demo into a practice. Chasing L3 tooling while still at L1 discipline is how you get an impressive runbook nobody runs.

The five non-negotiables before injecting in prod

You earn production chaos; you don’t start there.

  1. Steady-state definition — the metric that means healthy. Can’t name it? Don’t run it. (This is why observability is a prerequisite, not a nice-to-have.)
  2. Stop conditions — CloudWatch alarms wired into the FIS template that auto-halt on breach.
  3. Tag-scoped targets — never “all instances”; scope to a tag, skip when no valid target.
  4. Blast-radius cap — one AZ / one service / a percentage, not the fleet.
  5. A tested rollback — the experiment validates recovery; it shouldn’t be recovery’s first rehearsal.

What broke — A team scheduled a recurring FIS experiment via EventBridge Scheduler but reused the non-prod experiment template, which had its CloudWatch stop-condition alarm pointed at a non-prod alarm that didn’t exist in the prod account. The schedule fired against prod, the steady-state breached, and because the stop-condition alarm reference was invalid the experiment didn’t auto-halt as expected — the on-call aborted it manually ~6 minutes in. Blast radius was contained (tag scope held), but the lesson was sharp: a stop condition you haven’t verified resolves to a real, firing alarm in the target account is not a stop condition. They added a pre-check (now step 4 of the GameDay runbook) that ALARM-tests the stop condition before every run.

What to do this week

  1. Score your top tier-1 system on the maturity matrix. Be honest about L0 vs L1.
  2. Run the AZ Power Interruption scenario in a non-prod clone against a tagged workload. Watch what doesn’t fail over.
  3. For one experiment, write the steady-state hypothesis and wire a CloudWatch stop condition — that’s your L1→L2 move.
  4. Schedule your first GameDay using the runbook template, and commit to a re-test date for every finding.

What this post doesn’t cover


Related: OTel + chaos engineering tutorial · Disaster recovery strategies · Resilience patterns · SLA/SLO design · AWS managed services

If you only do one thing: Take one tier-1 service, write its steady-state hypothesis, wire a CloudWatch stop condition into an FIS experiment, and schedule it. That single L1→L2 step does more for reliability than any amount of unscheduled, hypothesis-free fault injection.

PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Recommended Reading

Explore All Articles »
5 min

GitOps on Amazon EKS (2026): Argo CD vs Flux, App-of-Apps, and the Decisions That Actually Bite

AWS Prescriptive Guidance says Argo CD and Flux both handle most GitOps scenarios capably — so picking one is a fit decision, not a winner. The decisions that actually cause incidents are the ones underneath: plaintext secrets in the GitOps repo, CI running kubectl apply and reintroducing drift, no App-of-Apps so onboarding is click-ops, and repo topology you can't change later. Here is the Argo CD vs Flux matrix, an App-of-Apps example, and the five traps independent of tool.

6 min

Observability Beyond CloudWatch (2026): When to Add Application Signals, ADOT, Managed Prometheus, and Grafana — and When Not To

The reflex to bolt Amazon Managed Prometheus + Grafana onto every workload is how observability bills quietly double. CloudWatch Application Signals now gives you an auto-discovered service map, SLOs, and traces with near-zero setup; AMP only earns its keep when you are PromQL-native or drowning in high-cardinality metrics — where ingestion (not retention) is the cost driver. Here is the decision matrix, an ADOT dual-export config, and the three levers that actually cut the AMP bill.