# Containment runbook — EC2 credential compromise / compromised instance group

Phase-gated runbook for the most common confirmed-intent path: GuardDuty emits
a `critical` `AttackSequence:EC2/CompromisedInstanceGroup` finding (or a
high-confidence `UnauthorizedAccess` / `CryptoCurrency` finding plus a
correlating CloudTrail signal). Copy per incident, fill the header, work the
phases in order. **Preserve evidence before you terminate anything.**

> Assumed setup: GuardDuty + Runtime Monitoring enabled; CloudTrail org trail
> on; the responder assumes a break-glass IR role with quarantine permissions.
> This runbook is read/act on **your** account — test it in a non-prod account
> first.

---

## Header

| Field | Value |
|-------|-------|
| Incident ID | `[INC-YYYY-NNN]` |
| Severity | `[SEV-1 / SEV-2]` |
| Detected by | `[GuardDuty finding ID]` |
| Incident Commander | `[name]` |
| Communications IC | `[name]` |
| Affected resources | `[i-xxxx, role/xxxx]` |
| Started (UTC) | `[timestamp]` |

---

## Phase 1 — Confirm & scope (target: 10 min)

- [ ] Open the GuardDuty finding; read the incident summary + MITRE ATT&CK mapping
- [ ] List every instance in the attack-sequence finding; note each instance profile / IAM role
- [ ] In CloudTrail (or Detective), pull the role's recent `AssumeRole` + write API calls
- [ ] Decide SEV: any sign of data read/exfil or customer-data access → **SEV-1**
- [ ] **Gate:** is this a true positive? `[CONFIRMED / FALSE-POSITIVE → close]`

## Phase 2 — Preserve evidence (target: +10 min) — do NOT skip

- [ ] Enable termination protection on affected instances (stops accidental destroy)
- [ ] Snapshot each instance's EBS volumes (tag `incident-id`, `do-not-delete`)
- [ ] Capture instance metadata, running process list (via SSM if reachable), and memory if tooling allows
- [ ] Export the relevant CloudTrail events to the incident S3 bucket (Object Lock / WORM if available)
- [ ] **Gate:** evidence captured and tagged. `[GO]`

## Phase 3 — Contain (target: +30 min)

Containment is **isolate, do not terminate**. A terminated instance is lost evidence.

- [ ] Move affected instances to a **quarantine security group** (no egress except to forensic tooling; no ingress)
- [ ] Detach the compromised IAM role / instance profile, or attach an explicit `Deny *` policy scoped to the role
- [ ] If access keys are implicated: deactivate (not delete) the keys; rotate the legitimate replacement out-of-band
- [ ] Revoke active sessions (`aws iam ... ` / STS) for the implicated principal
- [ ] If the workload is behind an ALB/ASG: deregister the instances so traffic stops flowing
- [ ] **Gate:** blast radius can no longer expand — no network egress, no usable credentials. `[GO]`

## Phase 4 — Eradicate & recover (target: same day for SEV-2)

- [ ] Identify entry vector from the finding + CloudTrail (exposed key, vuln, SSRF, etc.)
- [ ] Patch / close the vector in the **golden image / launch template**, not just the live host
- [ ] Replace, don't clean: launch fresh instances from a known-good AMI; never return a compromised host to service
- [ ] Restore data from backup taken **before** the compromise window (this is why backups matter — AWS notes recovery is not guaranteed)
- [ ] Re-issue scoped credentials; confirm least-privilege on the new role
- [ ] **Gate:** service restored on clean infrastructure; vector closed. `[GO]`

## Phase 5 — Close & learn (within 5 business days)

- [ ] Confirm GuardDuty shows no recurrence of the sequence for 72h
- [ ] If you use AWS Security IR, close the case and capture the engineer's notes
- [ ] Write the retrospective: timeline, detection lag, containment lag, what the runbook missed
- [ ] Convert at least one lesson into automation (a suppression rule, an EventBridge route, a tightened SCP)
- [ ] Update this runbook before the next on-call rotation

---

## Two failure modes this runbook is written to prevent

1. **Terminate-first.** Someone kills the instance to "stop the bleeding,"
   destroying the only forensic evidence and the root-cause trail. Isolate, snapshot, *then* replace.
2. **Clean and return.** Re-imaging in place or "removing the malware" and
   putting the host back. You cannot prove a compromised host is clean — replace it from a known-good AMI.
