---
title: AWS Infrastructure Drift Detection: How to Find and Fix Config Drift Before It Breaks Production
description: Infrastructure drift—when your actual AWS resources differ from what your IaC declares—causes silent failures and makes disaster recovery impossible. Learn how to detect drift systematically and fix it before it breaks production.
url: https://www.factualminds.com/blog/aws-infrastructure-drift-detection-terraform/
datePublished: 2026-04-04T00:00:00.000Z
dateModified: 2026-04-29T00:00:00.000Z
author: Palaniappan P
category: DevOps & CI/CD
tags: aws, terraform, drift-detection, infrastructure-as-code, devops
---

# AWS Infrastructure Drift Detection: How to Find and Fix Config Drift Before It Breaks Production

> Infrastructure drift—when your actual AWS resources differ from what your IaC declares—causes silent failures and makes disaster recovery impossible. Learn how to detect drift systematically and fix it before it breaks production.

Your infrastructure code declares that your database should have automated backups. But last week, a database engineer disabled backups to speed up a test, and nobody re-enabled them. The code says one thing. AWS does another. This is infrastructure drift.

Drift is silent. It doesn't trigger alerts. It doesn't break deployments. It just sits there until something goes wrong—a data loss, a security incident, a failed failover—and you discover the problem wasn't in your code, it was in the gap between your code and reality.

This guide covers how to detect drift systematically, what to do when you find it, and how to prevent it from happening again.

## What Is Infrastructure Drift and Why It's Dangerous

Infrastructure drift occurs when the actual state of your AWS resources diverges from what your Infrastructure as Code (IaC) declares. This can happen in multiple ways:

**Drift types:**

1. **Configuration drift** — Someone changed a setting in the AWS console (security group rule, RDS backup window, S3 bucket encryption)
2. **Structural drift** — A resource was created or deleted outside of IaC (manual resource provisioning)
3. **Compliance drift** — Resources no longer meet your security or compliance policies (encryption disabled, public access enabled, outdated OS patch)
4. **Tag drift** — Resources are missing tags required for cost allocation or compliance
5. **Version drift** — Infrastructure was created with an older version of a tool and never updated

Why drift is dangerous:

- **Disaster recovery breaks** — If you lose a resource and rebuild from IaC code, you won't get the manually-configured properties
- **Compliance violations** — You think you have encryption enabled because your code declares it, but the console shows it's disabled
- **Debugging nightmares** — Engineers spend hours investigating why production behaves differently than the staging environment, not realizing staging is drifted
- **Cost surprises** — Someone increased instance sizes or storage manually, and nobody notices until the bill arrives
- **Security gaps** — A security group rule was added manually to "temporarily" allow traffic, and it's still there a year later

## How Drift Happens

Understanding drift causes helps you prevent them.

### Manual Console Changes

The most common cause: someone needs to fix something urgently, logs into the AWS console, makes the change, and "will update the code later." They don't.

Example: A database is slow, so someone increases the instance size from `db.t3.medium` to `db.t3.large`. A week later, someone checks the code and expects the instance to be medium. It's not.

### Emergency Patches

A security vulnerability is discovered in your RDS database. You apply the patch immediately via the console, with plans to update your IaC tomorrow. Tomorrow becomes next week, and the code still declares the old version.

### Tool-Generated Resources

You use AWS SAM, CloudFormation, or a managed service that creates resources automatically. Your Terraform code doesn't know about these resources, or it's out of sync with what the tool actually created.

### Permissions and Assumptions

You assume only IaC creates resources, so you don't check. But a contractor spun up an EC2 instance for testing. A different team created an S3 bucket for backup storage. They're not declared in IaC.

### Time and Turnover

Your infrastructure code was written 6 months ago. Since then, AWS released new features. Your code still declares the old way of doing things. AWS now offers better defaults, and you're missing them.

## Tools for Drift Detection: The Terraform Approach

There are several ways to detect drift on AWS. The best approach depends on your infrastructure setup.

### 1. Terraform Plan as a Drift Detector

The simplest tool is the one you already have: `terraform plan`.

```bash
terraform plan -refresh=true
```

The `-refresh=true` flag tells Terraform to:

1. Query AWS for the current state of each resource
2. Compare actual state to your Terraform state file
3. Compare your Terraform code to the state file
4. Show what would change if you apply

If your Terraform state matches your code, but `terraform plan` shows changes, you have drift.

**Example output:**

```
Resource actions are as follows:

  ~ aws_security_group.api will be updated in-place

      ~ ingress {
            cidr_blocks = [
              + "192.168.1.0/24",
            ]
            from_port  = 443
            to_port    = 443
          }
```

This tells you someone added a security group rule that isn't in your code.

### 2. AWS Config for Continuous Drift Monitoring

AWS Config evaluates your resources against rules continuously. It detects configuration drift at the AWS layer.

**Example rule:** Check that all EC2 instances have the required tags.

```hcl
resource "aws_config_config_rule" "ec2_required_tags" {
  name = "ec2-required-tags"

  source {
    owner             = "AWS"
    source_identifier = "REQUIRED_TAGS"
  }

  input_parameters = jsonencode({
    tag1Key = "Environment"
    tag2Key = "Owner"
  })
}
```

Config evaluates this rule hourly and reports which resources are non-compliant.

**Advantages:**

- Continuous monitoring (not just on-demand)
- Works with any resource type (Terraform or not)
- Integrates with CloudWatch for alerting

**Disadvantages:**

- Limited to predefined rule types (though you can write custom rules)
- Doesn't know about IaC intent (it can't tell you "your code says t3.medium but you're running t3.large")
- Adds cost to your AWS bill

### 3. CloudFormation Drift Detection

If you use CloudFormation (or AWS SAM, which generates CloudFormation), CloudFormation has built-in drift detection:

```bash
aws cloudformation detect-stack-drift --stack-name my-stack
```

CloudFormation compares each resource in the stack to the resource definition in the template. It reports which resources have drifted.

**Works well if:** You use CloudFormation exclusively.

**Doesn't work if:** You mix CloudFormation and Terraform, or if some resources are created manually.

### 4. Automated Drift Detection in CI/CD

Many teams run `terraform plan` on a schedule (daily or hourly) to detect drift continuously.

**Example GitHub Actions workflow:**

```yaml
name: Detect Infrastructure Drift

on:
  schedule:
    - cron: '0 2 * * *' # 2 AM UTC daily

jobs:
  drift_detection:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2

      - name: Initialize Terraform
        run: terraform init

      - name: Detect Drift
        run: terraform plan -refresh=true

      - name: Report Drift
        if: failure()
        run: |
          echo "Infrastructure drift detected!"
          # Send to Slack, PagerDuty, or email
```

If `terraform plan` shows changes, the action fails and alerts your team.

## Triage: Which Drift Needs Immediate Fix?

Not all drift is equally urgent. Some drift is intentional. Some is benign. Some is a critical security issue.

**Severity levels:**

| Level    | Example                                                                             | Action                    |
| -------- | ----------------------------------------------------------------------------------- | ------------------------- |
| Critical | Security group rule opened to 0.0.0.0/0, encryption disabled, public access enabled | Fix immediately           |
| High     | Instance type upgraded (cost impact), backup retention reduced                      | Fix within 1 week         |
| Medium   | Tags missing, non-critical settings changed                                         | Fix within 1 sprint       |
| Low      | Non-impactful properties diverged (descriptions, comments)                          | Document and deprioritize |

**How to triage:**

1. Review the `terraform plan` output carefully
2. Check AWS console to understand what changed and why
3. Ask: "If this resource is destroyed and recreated from code, would it break anything?"
4. Ask: "Does this violate security or compliance requirements?"
5. Decide: fix the resource or update the code

## Remediation Strategies: Import, Revert, or Update Code

When you find drift, you have three options.

### Option 1: Update Your Code to Match Reality

If the drifted state is better than what your code declares, update the code.

**Scenario:** Someone increased the RDS instance from `db.t3.medium` to `db.t3.large` because it needed more capacity. The drift detection found it. The t3.large is correct and should stay.

**Fix:**

```hcl
resource "aws_db_instance" "main" {
  instance_class = "db.t3.large"  # Was db.t3.medium
  # ... rest of config
}
```

Run `terraform plan` and verify zero changes. Commit and deploy.

### Option 2: Revert the Resource to Match Code

If someone made a change that shouldn't have been made, revert it.

**Scenario:** A database engineer disabled automated backups to speed up a test. The IaC declares backups as enabled. The backups should stay enabled.

**Fix:**

Option A: Revert manually in the AWS console (if it's safe).

Option B: Use Terraform to revert:

```bash
terraform apply  # This will re-enable backups
```

This works if the change is safe and non-destructive. If the change is destructive (like deleting data), you need to handle it carefully.

### Option 3: Import the Drifted State

If you want Terraform to manage something that was created manually, import it.

See our detailed guide on [Terraform state management](/blog/terraform-state-management-aws-import-move-repair/) for the import workflow.

## Preventing Drift: Immutable Infrastructure Patterns

The best drift is drift you never create.

### Pattern 1: Destroy and Rebuild Instead of Modifying

Instead of modifying resources in place, destroy the old one and create a new one. This forces the change through code.

**Example:** Database version upgrade.

Instead of:

```
Click RDS console → Select database → Modify version → Apply
```

Do this:

```
Update code → terraform apply → destroys old instance, creates new one → restore from snapshot
```

This ensures the new version is declared in code.

### Pattern 2: Require Code Review Before Resource Changes

Establish a policy: no changes to resources without updating code first. Make this a code review requirement:

1. Engineer identifies needed change
2. Engineer updates IaC code
3. Code is reviewed and merged
4. Change is applied via CI/CD pipeline

This prevents the "I'll update the code later" problem.

### Pattern 3: Break-Glass Procedures for Emergencies

Sometimes you need to bypass this process. Define a break-glass procedure:

1. Emergency change is made directly in console
2. An incident ticket is created
3. Someone with authority approves the change
4. Code is updated within 24 hours (SLA)
5. Change is reapplied via IaC

This allows for true emergencies while maintaining accountability.

### Pattern 4: Immutable Infrastructure (Immutable Instances)

For EC2 instances, don't modify them. Recreate them:

- Instance needs a security patch? → Terminate and recreate from updated AMI
- Instance needs a config change? → Update the config in your deployment tool (Ansible, Chef), rebuild AMI, terminate and recreate

This prevents configuration drift entirely because instances are never modified—they're replaced.

## Drift Detection as Part of Your Disaster Recovery Process

Drift detection isn't just a nice-to-have. It's essential for disaster recovery.

When you rebuild infrastructure from code (because a region failed, or you're switching clouds), you're trusting that your code is complete and current. If your code is drifted, your rebuild will be incomplete.

**Test your recovery process:**

1. Run `terraform plan` to detect current drift
2. Fix all drift
3. (In a non-production environment) destroy all resources
4. Run `terraform apply` to rebuild
5. Verify the rebuilt infrastructure is identical to production

If this process fails, you've found a gap in your IaC. Fix it before a real disaster.

## Conclusion: Drift Is a Visibility Problem

Infrastructure drift isn't a technology problem—it's a visibility problem. Teams that don't detect drift lose the ability to trust their IaC. Teams that detect drift continuously stay in control.

Start with `terraform plan` on a schedule. Add AWS Config for compliance monitoring. Establish policies around emergency changes. Test your disaster recovery process regularly.

If you're managing complex AWS infrastructure and struggling with drift—or if you're concerned your current disaster recovery wouldn't actually work—we can help. At [FactualMinds](https://www.factualminds.com), we help teams establish infrastructure governance practices that catch drift early and prevent silent failures. Whether you're building IaC from scratch or auditing an existing setup, we ensure your infrastructure is what it claims to be.

---

## Related Reading

- [Terraform State Management on AWS: Imports, State Moves, and Emergency Repairs](/blog/terraform-state-management-aws-import-move-repair/)
- [AWS Well-Architected Framework & Review Guide: The 6 Pillars Explained](/blog/aws-well-architected-framework-6-pillars-explained/)
- [How to Implement a HIPAA-Compliant Architecture on AWS](/blog/how-to-implement-hipaa-compliant-architecture-aws/)

## FAQ

### What is infrastructure drift in AWS?
Infrastructure drift is when the actual state of AWS resources diverges from what your IaC (Terraform, CloudFormation, CDK, OpenTofu) declares. It happens through manual console changes, emergency patches that never get codified, tool-generated resources, and out-of-band scripts. Drift is silent — it does not fail deployments — but it breaks disaster recovery, hides compliance violations, and creates debugging nightmares.

### How do I detect Terraform drift on AWS?
Run `terraform plan -refresh=true` on a schedule (daily nightly job in CI). Terraform queries AWS for the actual state of every resource and reports differences against the state file. For multi-workspace setups, layer Terraform Cloud / Atlantis drift detection on top, or use `driftctl` for IaC-agnostic scanning. Pipe drift findings into Slack or PagerDuty so unauthorized changes are visible the next morning, not the next quarter.

### Should I use AWS Config or Terraform plan for drift detection?
Use both — they catch different things. AWS Config detects all resource and configuration changes regardless of how they were made and evaluates against managed compliance rules; it is the right tool for compliance and auditing. Terraform plan detects drift relative to your IaC and is the right tool for engineering team feedback. Together they cover the spectrum from 'is this resource compliant?' to 'why does the live state not match what we shipped?'

### How do I prevent infrastructure drift from happening in the first place?
Block console write access in production with SCPs or permission boundaries — humans can read the console, but writes go through pull requests. Enforce IAM Access Analyzer findings on root-level policies. Deploy AWS Service Catalog products for the workflows engineers most often want to do manually. Add a Config rule that flags any resource without a `managed-by: terraform` tag — anything untagged is a drift candidate.

### How do I fix drift once detected?
For configuration drift: update IaC to match the live state (or revert the live state via plan/apply if the change should not have happened). For resources created out of band: use `terraform import` (or `import` blocks in TF 1.5+) to bring them under management, then add the resource to code. For deleted resources: remove them from state with `terraform state rm` if the deletion was intended; otherwise re-apply. Always commit the fix immediately so the gap closes.

---

*Source: https://www.factualminds.com/blog/aws-infrastructure-drift-detection-terraform/*
