10 AWS DevOps Practices We Actually Use in Production in 2026
Quick summary: Real AWS DevOps practices from production: GitOps on EKS, OpenTelemetry, supply chain security, chaos engineering with FIS, and AI-assisted DevOps with Amazon Q.
Key Takeaways
- Real AWS DevOps practices from production: GitOps on EKS, OpenTelemetry, supply chain security, chaos engineering with FIS, and AI-assisted DevOps with Amazon Q
- Real AWS DevOps practices from production: GitOps on EKS, OpenTelemetry, supply chain security, chaos engineering with FIS, and AI-assisted DevOps with Amazon Q
Table of Contents
Most AWS DevOps advice you read today is recycled from 2021–2023. Separate VPCs? Monitoring dashboards? Basic Terraform? That’s table-stakes baseline now—infrastructure debt waiting to happen if it’s all you do.
This post is different. These are the practices we actually see in production at AWS teams managing serious scale and complexity in 2026. Not “here’s what you should do,” but “here’s what happens when you don’t, and how teams avoid it.”
| Practice | Core AWS Services | Typical Complexity | Time to Deploy |
|---|---|---|---|
| Multi-Account Organizations | Organizations, Control Tower, SCPs | Medium | 2–4 weeks |
| Policy-as-Code | OPA, Checkov, Terraform CI checks | Medium | 1–2 weeks |
| GitOps on EKS | ArgoCD/Flux, EKS, ECR | High | 3–6 weeks |
| OpenTelemetry Stack | Distro for OTel, CloudWatch, X-Ray | Medium | 2–3 weeks |
| FinOps Automation | Karpenter, Graviton, Spot Fleet | Medium | 2–4 weeks |
| Progressive Delivery | CodeDeploy, Flagger, CloudWatch Alarms | Medium | 1–3 weeks |
| Supply Chain Security | ECR Scanning, Cosign, SBOM, Artifact Hub | Medium | 1–2 weeks |
| Platform Engineering | Backstage, Terraform Cloud, AWS APIs | High | 4–8 weeks |
| AI-Assisted DevOps | Amazon Q Developer, AWS Marketplace | Low | 1–2 days setup |
| Chaos Engineering | AWS FIS, CloudWatch, Runbook Automation | Medium | 1–2 weeks |
1. Multi-Account AWS Organizations with Control Tower
The Problem Single AWS accounts scale until they don’t. Blast radius grows without boundaries. One rogue IAM policy or misconfigured security group affects everything. Compliance audits become nightmares because you can’t easily isolate workloads.
What Teams Do Differently Now In 2022, “separate accounts” meant VPCs. By 2026, it’s the minimum viable structure: a dedicated master account (or AWS management account in newer terminology) runs Control Tower, with OrganizationalUnits (OUs) for prod, staging, security, and shared services. Each OU has different SCPs (Service Control Policies), preventing teams from accidentally creating resources in restricted regions or disabling CloudTrail.
How It Works AWS Control Tower sets up a landing zone automatically:
- Management account (central billing, organizations, SCPs)
- Log archive account (all CloudTrail logs flow here)
- Audit account (compliance and security tooling)
- Workload accounts created on-demand via account factory
SCPs are policy guardrails attached to OUs. For example, an SCP on your prod OU can deny EC2 instance modifications outside business hours:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": [
"ec2:ModifyInstanceAttribute",
"ec2:TerminateInstances",
"rds:DeleteDBInstance"
],
"Resource": "*",
"Condition": {
"StringNotEquals": {
"aws:RequestedRegion": ["us-east-1"]
}
}
}
]
}The Gotcha: SCPs apply to all principals in an OU, including root. If you deny
iam:CreateAccessKeyacross prod, even your principal can’t create emergency credentials. Always have a break-glass procedure: a separate, heavily audited account with different SCPs for incident response.
Where This Fails Forgetting to enable CloudTrail across all accounts at the organization level. Then compliance asks “who deleted that database?” and you have no audit trail. Enable CloudTrail organization-wide before you create any workload accounts.
2. Policy-as-Code: SCPs + OPA + Checkov in CI
The Problem IAM policies are written in JSON, reviewed once, and never questioned again. Six months later, someone’s production role has s3:* on * resources. Access reviews never catch it because nobody reads JSON for a living.
What Teams Do Differently Now Policy-as-code means your infrastructure scans itself automatically. Three layers:
- SCPs (AWS Organizations level) — deny dangerous actions organization-wide
- Checkov in CI — scan Terraform before it’s merged, flag overpermissive policies, missing encryption flags
- OPA/Rego (Kubernetes/general) — custom policy rules, enforce your company’s standards
Checkov is the most practical layer. Run it in GitHub Actions:
# .github/workflows/terraform-check.yml
name: Terraform Policy Check
on: [pull_request]
jobs:
checkov:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Checkov scan
uses: bridgecrewio/checkov-action@master
with:
directory: infrastructure/
framework: terraform
quiet: false
soft_fail: false
output_format: sarif
output_file_path: checkov-results.sarif
- name: Upload to GitHub
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: checkov-results.sarifCheckov flags things like:
- IAM policies with wildcards (
s3:*) - RDS databases without encryption at rest
- Security groups open to 0.0.0.0
- Unencrypted EBS volumes
- Secrets in code
The Gotcha: Checkov has high false-positive rates if you don’t tune its config. A developer will see 50 warnings, assume they’re all noise, and ignore real issues. Create a
.checkov.yamlin your repo and disable noisy checks specific to your architecture. Run Checkov on every PR, but setsoft_fail: falseonly on main branch — on feature branches, let failures warn but not block.
Where This Fails Treating policy-as-code as a compliance checkbox rather than a developer tool. If scanning is slow (>5 min per PR), developers will skip it or be frustrated. Keep your Terraform well-organized: scanning 500 files is slower than scanning 50 with clear boundaries. Use Terraform modules to reduce duplication.
3. GitOps with ArgoCD/Flux on EKS
The Problem Traditional CI/CD says: “Run kubectl apply -f deployment.yaml from Jenkins.” But who controls Jenkins? What if a deployment fails halfway? How do you audit what changed and when? If your cluster gets corrupted, how do you know what the source of truth is?
What Teams Do Differently Now GitOps flips the model. Your Git repository (main branch) is the source of truth for all deployment state. ArgoCD or Flux watches the repo. When code changes, the tool automatically applies it. When configuration drifts (someone manually changed a pod), ArgoCD detects drift and either alerts or auto-corrects.
This is not just “Terraform for Kubernetes.” GitOps means:
- All changes go through Git (and code review)
- Rollbacks are
git revert, not manualkubectl set image - Your Git history is your entire infrastructure audit log
- Cluster state and Git state are automatically synchronized
A minimal ArgoCD setup:
# applications/argocd-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-api
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/myorg/infra-repo
targetRevision: main
path: k8s/my-api/
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # Delete old resources
selfHeal: true # Fix drift
syncOptions:
- CreateNamespace=trueApply this once, and ArgoCD continuously reconciles your cluster to match Git.
The Gotcha: GitOps can become a footgun if you allow manual changes. If developers can
kubectl execinto pods or manually scale deployments, they will. GitOps assumes your deployments are fully declarative and immutable. If you have stateful applications that require manual tweaks, GitOps doesn’t help until you’ve fixed the application architecture.
Where This Fails Teams run ArgoCD for some services but still use traditional CI/CD for others. This creates two mental models and double the debugging. Commit fully or don’t — partial GitOps is confusion masked as modernization.
4. OpenTelemetry as the Observability Standard
The Problem CloudWatch Logs, X-Ray traces, Prometheus metrics, and Datadog APM—all generating separate data streams. Your latency issue is split across four tools. Correlating a specific user request across services requires manual log hunting.
What Teams Do Differently Now OpenTelemetry (OTel) is the single standard for traces, metrics, and logs. You instrument your code once, and OTel exports to whatever backend you want: CloudWatch, DataDog, New Relic, Prometheus, or all of them.
AWS Distro for OpenTelemetry is AWS’s curated, production-ready OTel distribution with pre-built Lambda layers, ECS task definitions, and EKS Helm charts. Install once, get traces + metrics + logs unified.
For a Node.js app:
// index.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { AWSXRayIdGenerator } = require('@opentelemetry/id-generator-aws-xray');
const { AWSXRayPropagator } = require('@opentelemetry/propagator-aws-xray');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({
idGenerator: new AWSXRayIdGenerator(),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
// Your app code now automatically generates traces
const express = require('express');
const app = express();
app.get('/api/users/:id', async (req, res) => {
// OTel automatically traces this request and any downstream calls
const user = await db.query(`SELECT * FROM users WHERE id = ${req.params.id}`);
res.json(user);
});
app.listen(3000);Traces flow to CloudWatch X-Ray (or your backend). You now see:
- Request latency broken down by service
- Database query times
- External API calls (with errors)
- Automatic error tracking
The Gotcha: OTel has a steep learning curve if you’re not familiar with instrumentation. “Automatic” instrumentation via Node autodiscovery works for HTTP and databases, but custom business logic requires manual span creation. Teams often start with auto-instrumentation, hit limits when queries slow down (because they’re not instrumenting the slow code), and then realize they need to understand OTel deeply.
Where This Fails Shipping OTel telemetry to CloudWatch without configuring log groups for high cardinality. If every request generates a unique trace ID, your CloudWatch Logs bill explodes. Use sampling: in development, log 100% of traces; in production, sample 10–20% of traces plus 100% of errors.
5. FinOps Automation: Karpenter + Graviton + Spot Fleet
The Problem You reserved instances for a predicted peak. Now you’re paying ~$50k/month for capacity you use half the time. Spot instances are cheaper but unpredictable. Graviton looks good on paper but you’re nervous about compatibility.
What Teams Do Differently Now FinOps isn’t just “set budget alerts.” Real FinOps means automatic right-sizing: Karpenter provisions nodes based on actual demand, preferring Graviton (arm64) instances and Spot fleet, with automatic fallback to on-demand if Spot is unavailable.
On EKS, replace your cluster autoscaler:
# karpenter-provisioner.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: general-purpose
spec:
template:
metadata:
labels:
workload-type: general
spec:
nodeClassRef:
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values: ["arm64", "amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["t4g.medium", "t4g.large", "t3.medium", "t3.large"]
- key: karpenter.sh/do-not-consolidate
operator: DoesNotExist
limits:
cpu: 1000
memory: 1000Gi
disruption:
consolidateAfter: 30s
consolidationPolicy: cost
---
apiVersion: ec2.karpenter.sh/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
role: "KarpenterNodeRole-eks-prod"
subnetSelector:
karpenter.sh/discovery: "true"
securityGroupSelector:
karpenter.sh/discovery: "true"
userData: |
#!/bin/bash
echo "vm.max_map_count=262144" >> /etc/sysctl.conf
sysctl -pKarpenter watches your pod requests and right-sizes nodes. When demand drops, it consolidates workloads and removes idle nodes automatically.
Graviton instances (t4g, c7g, m7g) are 20–30% cheaper than equivalent x86 and have better power efficiency. Most container workloads run fine on Graviton; only large batch or specialized workloads (GPU, specific libraries) can’t run there.
The Gotcha: Switching to Graviton requires validating all your dependencies. A third-party library linked against x86 will fail silently on arm64. Test in a staging cluster first. Node consolidation can evict pods aggressively if not tuned—set
consolidateAfter: 30sinitially and increase if you see pod churn.
Where This Fails Teams enable Karpenter but keep their old Reserved Instances active. Now you’re paying for both Karpenter provisioned nodes AND unused RI commitments. If you move to Karpenter, cancel unused RIs.
6. Progressive Delivery: Canary Deployments with CodeDeploy + Flagger
The Problem You deploy a new API version. 5% of requests start failing due to a database connection bug. Your error rate spikes to 15% before anyone notices. Now you’re rolling back or incident-responding at 2 AM.
What Teams Do Differently Now Progressive delivery shifts traffic gradually to new versions, automatically rolling back if error rates or latency degrade. AWS CodeDeploy supports canary (10% traffic, monitor, then 90%) and linear (traffic increases by 10% every N minutes) strategies.
With Flagger (on EKS), you define SLO thresholds for your canary:
# flagger-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-service
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
progressDeadlineSeconds: 600
service:
port: 8080
targetPort: 8080
analysis:
interval: 1m
threshold: 5
maxWeight: 100
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 30s
skipAnalysis: false
skipWeightAll: falseWhen you update the deployment, Flagger automatically:
- Sends 10% of traffic to the new version
- Monitors error rates and latency
- If SLO breached (error rate < 99%), rolls back immediately
- Otherwise, gradually shifts 20%, 30%, etc. until 100%
The Gotcha: Canary requires good observability. If your monitoring is poor, Flagger can’t detect failures and your canary doesn’t protect anything. Pair canary with OpenTelemetry (practice #4). Also, canary doesn’t work well with database migrations—a new version might expect a new schema that doesn’t exist yet. Deploy schema changes separately, before code changes.
Where This Fails Teams run canaries for weeks without automated rollback. A human operator watches metrics and decides when to move to the next step. That defeats the purpose—if you need human oversight, you don’t have confidence in your deployment. Automate the threshold logic.
7. Supply Chain Security: SBOM + ECR Scanning + Container Signing
The Problem Your production container has a zero-day vulnerability in a third-party library. You don’t know it. Neither do your security auditors. By the time Log4Shell or xz-utils exploits hit, your container is already deployed.
What Teams Do Differently Now Supply chain security means:
- SBOM (Software Bill of Materials) — list every dependency in your container
- ECR Image Scanning — flag known CVEs in your images before they’re deployed
- Container Signing — prove an image came from your CI/CD, not a compromised registry
- SLSA Level 2 compliance — attestation that your build process was secure
Syft generates SBOMs automatically:
# In your CI/CD pipeline
docker build -t my-app:v1.2.3 .
docker push <account-id>.dkr.ecr.us-east-1.amazonaws.com/my-app:v1.2.3
# Generate SBOM with syft
syft <account-id>.dkr.ecr.us-east-1.amazonaws.com/my-app:v1.2.3 -o json > sbom.json
# Sign the image with Cosign (requires signing key in AWS Secrets Manager)
export COSIGN_EXPERIMENTAL=1
export AWS_REGION=us-east-1
cosign sign --keyless <account-id>.dkr.ecr.us-east-1.amazonaws.com/my-app:v1.2.3Enable ECR image scanning:
aws ecr put-image-scanning-configuration \
--repository-name my-app \
--image-scan-config scanOnPush=true \
--region us-east-1Now every push triggers a scan. CVE results appear in the ECR console and integrate with EventBridge:
# CloudWatch rule: fail deployment if critical CVE found
EventPattern:
source:
- aws.ecr
detail-type:
- ECR Image Scan
detail:
scan-status:
- COMPLETE
finding-severity-counts:
- CRITICAL: [1]
Action:
- Publish to SNS: "CRITICAL CVE in image!"The Gotcha: ECR scanning returns results on known CVEs in public databases. If a dependency has an unpatched vulnerability, ECR won’t flag it—you need runtime scanning (Wiz, Snyk) for that. Also, signing images keyless (no key to manage) is newer and requires keyless auth setup. If you’re not ready for that complexity, use a signing key in Secrets Manager.
Where This Fails Teams enable ECR scanning but allow deployments to proceed regardless of CVEs. The scan becomes a checkbox. Set up an admission controller (Kyverno, OPA) in Kubernetes that only allows unsigned images or images with scans.
8. Platform Engineering: Internal Developer Portals with Backstage
The Problem A new engineer joins. To deploy a microservice, they need to:
- Understand your Terraform module structure
- Learn your ECS task definition conventions
- Know which security groups to attach
- Figure out which environment variables are secrets
- Navigate 4 GitHub repos to find the right template
Then they deploy it wrong, security flags it, and they spend a day fixing it.
What Teams Do Differently Now Platform engineering means building a self-service portal (Spotify Backstage is popular) where developers specify what they want (a Node.js API, a data pipeline) and the platform auto-generates Terraform, manifests, CI/CD pipelines, and monitoring—all pre-configured to your standards.
Backstage + AWS:
# templates/nodejs-api.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: nodejs-api
title: Create a Node.js API
description: Self-service Node.js API with ECS, ALB, and auto-scaling
spec:
owner: platform-team
type: service
parameters:
- title: Basic Info
required:
- name
- description
properties:
name:
type: string
title: Service Name
description: Kebab-case service name (e.g., user-auth-api)
description:
type: string
port:
type: number
title: Container Port
default: 3000
memoryMb:
type: number
title: ECS Task Memory (MB)
default: 512
steps:
- id: fetch-base
name: Fetch Base Template
action: fetch:template
input:
url: ./skeleton
values:
serviceName: ${{ parameters.name }}
serviceDescription: ${{ parameters.description }}
port: ${{ parameters.port }}
- id: publish
name: Publish to GitHub
action: publish:github
input:
allowedHosts: ['github.com']
description: Created from Backstage
repoUrl: github.com?owner=myorg&repo=${{ parameters.name }}
- id: register
name: Register in Backstage
action: catalog:register
input:
repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}
catalogInfoPath: '/catalog-info.yaml'A developer fills out a form, clicks “create,” and gets a GitHub repo with:
Dockerfilepre-configuredecs-task-definition.jsonwith your standard security groups, logging, monitoringTerraformto provision ALB, ECS service, auto-scaling.github/workflows/deploy.ymlwith your standard CI/CD steps- Pre-wired to your observability stack (CloudWatch, Datadog, etc.)
The Gotcha: Backstage is powerful but has a steep learning curve. Each template you create is custom to your organization. You’ll spend weeks building templates, and a new template for every type of service (API, data pipeline, Lambda function, etc.). Start small: build a template for your most common service type first.
Where This Fails Backstage becomes outdated. A template was built for Terraform v1.2, but your org upgraded to v1.5 with breaking changes. Developers blindly follow the template and deploy broken infrastructure. Backstage requires an owner team (usually platform engineering) that updates templates when your standards change.
9. AI-Assisted DevOps: Amazon Q Developer + Runbook Generation
The Problem Your database is slow. Your on-call engineer needs to:
- Grep CloudWatch Logs for clues
- Check database performance metrics manually
- Look up common causes in Slack history
- Maybe write a custom query
- Hope it’s the right diagnosis
This takes 20 minutes on a simple issue.
What Teams Do Differently Now Amazon Q Developer integrates into AWS Console, GitHub, and IDEs to provide AI-assisted troubleshooting and runbook automation. Ask Q to diagnose a CloudWatch alarm, and it queries your logs, metrics, and configuration automatically.
In VS Code (with Q Developer extension):
You: "Why is my ECS task failing to start?"
Q: "I see your ECS task is exiting with code 1.
Looking at CloudWatch Logs for the task:
'[ERROR] Unable to connect to RDS database at host [rds-endpoint]'
This suggests the security group allows no inbound traffic.
Recommendation: Add an inbound rule to your RDS security group
allowing port 5432 from your ECS security group."Q can also auto-generate runbooks. If you get paged for a Lambda timeout, Q auto-creates a Markdown runbook:
# Lambda Timeout Incident Runbook
## Quick Check
1. Check Lambda metrics: **Invocations vs. Duration**
2. If Duration > 15 min (your timeout), check CloudWatch Logs for slow operations
3. Look for external API calls, database queries, or S3 operations
## Immediate Actions
1. Increase timeout to 30 minutes (temporary)
2. Add CloudWatch alarms for P99 latency
3. Profile the cold start time: AWS Lambda InsightsQ Developer also reviews Terraform in PRs:
On PR comment:
aws_security_group.prod:
- ⚠️ Allowing 0.0.0.0/0 on port 443 (HTTPS)
Recommendation: Restrict to VPN IP range [x.x.x.x/24]
- ✅ Good: RDS encryption enabled
- ✅ Good: VPC Flow Logs enabledThe Gotcha: Q is helpful but not omniscient. It works best when you’ve instrumented your infrastructure well (CloudWatch, X-Ray, logs). If your diagnostics are poor, Q can’t help. Also, Q sees your AWS account data—ensure it’s trained only on your non-sensitive workloads or audit access logs.
Where This Fails Teams use Q as a replacement for learning. A new engineer relies entirely on Q for troubleshooting instead of understanding their architecture. Document your runbooks in addition to using Q—Q generates quick answers, but human-written runbooks capture institutional knowledge.
10. Chaos Engineering with AWS Fault Injection Simulator (FIS)
The Problem Your system is “highly available.” Then one AZ goes down, and your service fails because you’ve never tested that scenario. Or a network dependency becomes unavailable, and your service hangs because it has no timeout. You don’t know what you’ve broken until production breaks.
What Teams Do Differently Now Chaos engineering is systematic resilience testing. AWS Fault Injection Simulator (FIS) lets you define experiments: kill 50% of EC2 instances, inject 1000ms of latency on API calls, disable a Lambda function—and measure your service’s response. Experiments run scheduled (weekly game days) or on-demand.
FIS experiment to test RDS failover:
# fis-rds-failover.yaml
{
"description": "Test RDS multi-AZ failover",
"targets": {
"RDSDatabase": {
"resourceType": "aws:rds:db",
"selectionMode": "ALL",
"resourceTags": {
"Environment": "production"
}
}
},
"actions": {
"FailRDSDatabase": {
"actionId": "aws:rds:failover-db-cluster",
"parameters": {
"clusterIdentifier": "prod-cluster"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch",
"value": "arn:aws:cloudwatch:us-east-1::alarm:prod-api-errors-high"
}
]
}This experiment:
- Triggers RDS failover
- Monitors your CloudWatch alarm (error rate > 5%)
- If alarm breached, stops the experiment immediately (rollback)
- Generates a report: “failover took 45 seconds, error rate spiked to 3% for 20 seconds”
You now know:
- Your failover works
- Your client retry logic works
- Your monitoring alerts in 30 seconds
The Gotcha: Chaos experiments can cause real incidents if not properly scoped. Kill 50% of prod instances and your service might not recover gracefully. Start with staging. Define clear stop conditions (error rate, latency threshold). Have an on-call engineer present during experiments.
Where This Fails Teams run chaos experiments but don’t act on findings. A latency experiment shows your service times out at 500ms, but you don’t increase timeout. Then production sees a timeout and your pager goes off. Chaos findings are debt—pay them before they become incidents.
Where to Start: Maturity Ladder
Not all practices are equally urgent. This ladder is based on impact and dependencies:
Month 1: Foundation (Start Here)
- ✅ Multi-Account Organizations (practice #1)
- ✅ Policy-as-Code with Checkov (practice #2)
These prevent class-of-bugs incidents (rogue IAM policies, unencrypted databases).
Month 2–3: Observability & Resilience
- ✅ OpenTelemetry stack (practice #4)
- ✅ Progressive delivery (practice #6)
You now see what’s failing and roll back safely.
Month 4–5: Modern Deployment
- ✅ GitOps on EKS (practice #3) — if you’re on Kubernetes
Month 6: Cost & Compliance
- ✅ FinOps with Karpenter (practice #5)
- ✅ Supply chain security (practice #7)
Month 7+: Advanced Optimization
- ✅ Platform engineering with Backstage (practice #8)
- ✅ Chaos engineering with FIS (practice #10)
- ✅ AI-assisted DevOps (practice #9)
The exact order depends on your current pain. If you’re bleeding money on over-provisioned infrastructure, do FinOps first. If you’re shipping vulnerabilities, do supply chain security first.
Frequently Asked Questions
See the FAQ section in the blog metadata for common questions and detailed answers.
The Pattern
Notice a common thread across these 10 practices? They all follow the same pattern:
- Automate what humans do manually (SCPs instead of policy reviews, Karpenter instead of manual scaling, FIS instead of manual chaos testing)
- Make state observable (Git is source of truth for GitOps, CloudWatch for observability, SBOM for supply chain)
- Shift detection left (Checkov in CI, ECR scanning, Flagger canaries—catch problems before they hit users)
That’s the 2026 DevOps formula. Not new tools for new tools’ sake, but tools that automate your toil, give you visibility, and let you move fast without breaking things.
The next major incident your team faces will teach you one of these practices the hard way. Or you can read about it here first.
Need Help?
If your team is struggling with production reliability, incident response, or scaling AWS infrastructure, our AWS DevOps consulting practices deep-dive into these patterns for your specific architecture.
AWS Cloud Architect & AI Expert
AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.



