Container Orchestration
Kubernetes on AWS (EKS)
Managed Kubernetes on AWS with Auto Mode, Hybrid Nodes, Karpenter 1.0, and Graviton-first node pools.
Last updated:April 29, 2026Author:FactualMinds Cloud Integration TeamReviewed by:FactualMinds AWS-certified architects (Solutions Architect – Professional)
AI & assistant-friendly summary
This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.
Summary
Amazon EKS in 2026: Auto Mode GA, Hybrid Nodes, Karpenter 1.0, Pod Identity, Graviton-first node pools, and ECR enhanced scanning — cheaper, safer K8s.
Key Facts
- • Amazon EKS in 2026: Auto Mode GA, Hybrid Nodes, Karpenter 1
- • 0, Pod Identity, Graviton-first node pools, and ECR enhanced scanning — cheaper, safer K8s
- • Managed Kubernetes on AWS with Auto Mode, Hybrid Nodes, Karpenter 1
- • 0, and Graviton-first node pools
- • What is EKS Auto Mode and when should I use it
Entity Definitions
- Bedrock
- Bedrock is relevant to kubernetes on aws (eks).
- Lambda
- Lambda is relevant to kubernetes on aws (eks).
- AWS Lambda
- AWS Lambda is relevant to kubernetes on aws (eks).
- EC2
- EC2 is relevant to kubernetes on aws (eks).
- S3
- S3 is relevant to kubernetes on aws (eks).
- RDS
- RDS is relevant to kubernetes on aws (eks).
- DynamoDB
- DynamoDB is relevant to kubernetes on aws (eks).
- CloudWatch
- CloudWatch is relevant to kubernetes on aws (eks).
- IAM
- IAM is relevant to kubernetes on aws (eks).
- VPC
- VPC is relevant to kubernetes on aws (eks).
- EKS
- EKS is relevant to kubernetes on aws (eks).
- Amazon EKS
- Amazon EKS is relevant to kubernetes on aws (eks).
- ECS
- ECS is relevant to kubernetes on aws (eks).
- Amazon ECS
- Amazon ECS is relevant to kubernetes on aws (eks).
- SQS
- SQS is relevant to kubernetes on aws (eks).
## Amazon EKS overview
Amazon EKS is AWS-managed Kubernetes. The control plane (API server, scheduler, etcd) is operated by AWS, patched automatically, and deployed across at least three availability zones. You own the data plane — or, on **EKS Auto Mode** (GA November 2024), you delegate the data plane to AWS as well and consume Kubernetes as an almost-serverless service.
FactualMinds deploys EKS for teams that need Kubernetes portability (multi-cloud, on-prem via **EKS Hybrid Nodes**, or open-source ecosystem alignment) and for mid-market AWS-only teams that have outgrown ECS or plain Fargate. We default new 2026 clusters to **Auto Mode on Kubernetes 1.32 with Graviton-first node pools** unless a specific workload says otherwise.
## What's new on EKS in 2026
- **EKS Auto Mode GA** — fully managed data plane, managed add-ons (VPC CNI, kube-proxy, CoreDNS, EBS CSI, AWS Load Balancer Controller), and a managed Karpenter that provisions nodes within seconds of scheduling pressure.
- **EKS Hybrid Nodes** (GA November 2024) — register Linux hosts running on-prem or at the edge as EKS worker nodes governed by an AWS-hosted control plane. One `kubectl` surface for cloud and hybrid.
- **Karpenter 1.0** (2024) — stable NodeClass/NodePool CRDs, disruption budgets, and consolidation-policy modes. Karpenter is the default on Auto Mode.
- **Pod Identity** — the ergonomic replacement for IRSA. No OIDC provider, no ServiceAccount annotation, no trust-policy gymnastics.
- **Kubernetes 1.31 / 1.32** — typical supported minor versions on EKS in 2026; upstream releases every ~4 months, EKS supports the current plus the previous three.
- **ECR enhanced scanning** — Inspector v2 scans images for OS and language-package CVEs with exploit-probability-index scoring; integrates with Security Hub.
- **AWS Load Balancer Controller** — managed install on Auto Mode; supports Gateway API, ALB and NLB target-group binding, and cross-zone health checks.
- **Amazon EBS CSI driver** managed add-on — Auto Mode handles install and upgrade; gp3 volumes by default.
- **Cilium + Hubble / eBPF observability** — supported via add-ons for teams that need deep network visibility without full-fat service mesh.
## Why EKS
**Kubernetes standard**
- Standard `kubectl`, Helm, Kustomize, and standard manifests.
- Portable: workloads run on other clouds, on-premises (EKS Hybrid Nodes or EKS Anywhere), or upstream Kubernetes.
- Massive ecosystem (Prometheus, OpenTelemetry, Argo CD, Flux, Karpenter, Cilium, Istio, Linkerd).
**AWS integration**
- VPC CNI for pod networking with real AWS IP addresses.
- Pod Identity for pod-level IAM permissions without OIDC acrobatics.
- Native integrations with ALB/NLB, EFS, EBS, S3, RDS, DynamoDB, SQS, Kinesis, Bedrock.
- AWS Security Hub / GuardDuty EKS Protection for runtime threat detection.
**Managed control plane**
- Multi-AZ control plane included in the $0.10/hour price.
- AWS patches the control plane on a published minor-version cadence.
- SLA covers control-plane availability; you are responsible for workload availability.
## EKS Architecture
**Control plane** (AWS managed)
- API server, scheduler, controller managers, etcd.
- Audit logs can be shipped to CloudWatch Logs; control-plane endpoints can be private, public, or public+private with IP allow-list.
**Data plane** (your choice)
- **EKS Auto Mode** — fully managed nodes, add-ons, networking, load balancing, and storage controllers.
- **Managed node groups** — EC2 instances you provision; AWS manages OS patching, drain, and replacement.
- **Karpenter on self-managed nodes** — for teams that want fine control over instance-type selection and disruption policy.
- **AWS Fargate** — serverless pods with no node management; higher per-pod price, best for bursty or sandbox workloads.
**Networking**
- AWS VPC CNI: each pod gets a real VPC IP (prefix delegation supported for IP density).
- Security groups for pods (SGFP) for per-pod network security.
- Cilium eBPF or Calico for network policy and observability.
- AWS Load Balancer Controller for ALB/NLB ingress.
## EKS Auto Mode in practice
- AWS provisions, scales, and replaces nodes automatically based on pending pods.
- Managed Karpenter bin-packs across instance types, including Graviton by default.
- OS patching via node replacement on a rolling schedule; no in-place kernel updates.
- AWS manages the core add-ons (VPC CNI, kube-proxy, CoreDNS, AWS Load Balancer Controller, EBS CSI).
- Billed as EC2 + a small EKS Auto Mode management fee per vCPU-hour; typically net-neutral or cheaper versus self-managed node groups when labor is priced in.
**Use Auto Mode when**
- You want Kubernetes without node operations.
- Your security team can live with AWS-managed, regularly replaced AMIs.
- Your workloads do not require custom kernel modules or niche runtime options.
**Prefer managed node groups when**
- You need a regulated/approved AMI (e.g., STIG-hardened) maintained by your security team.
- You run custom kernel modules (BPF/eBPF extensions beyond what's supported, niche drivers).
- You want fine-grained Spot pool control that the managed NodePool does not expose.
## EKS Hybrid Nodes
- Register on-prem Linux hosts as EKS workers against an AWS-hosted control plane.
- Supports x86 and ARM; requires AWS Systems Manager connectivity from the on-prem host.
- Use for edge compute, data-gravity on-prem workloads, or manufacturing floor nodes that must stay physically on site but should be governed from AWS.
- Compare to EKS Anywhere: Hybrid Nodes share one control plane with AWS; EKS Anywhere runs its own on-site control plane.
## Pod Identity vs IRSA
**Pod Identity (2026 default)**
```bash
aws eks create-pod-identity-association \
--cluster-name my-cluster \
--namespace production \
--service-account my-app \
--role-arn arn:aws:iam::123456789:role/my-app-role
```
Pods using the `my-app` ServiceAccount in the `production` namespace automatically receive temporary credentials via the Pod Identity Agent. No annotation, no OIDC provider, no trust-policy StringEquals dance.
**IRSA (legacy / niche)**
- Still required for workloads that only accept token-file authentication, clusters older than 1.24, or EC2 workloads outside EKS.
- OIDC provider + annotated ServiceAccount + trust-policy condition on the OIDC subject.
## Karpenter 1.0 patterns we deploy
- **Graviton-first NodePool** — allow `arm64` architectures, prefer on-demand for baseline and Spot for scale-out.
- **Consolidation policy** — `WhenUnderutilized` for dev/staging, `WhenEmpty` for production to avoid disruption of long-running pods.
- **Disruption budgets** — cap how many nodes Karpenter can consolidate per hour, aligned with PDB.
- **Per-namespace NodePool selection** — heavy GPU workloads go to a dedicated NodePool with `nvidia.com/gpu` taints.
## Observability stack
- **CloudWatch Container Insights (enhanced observability)** for the cluster, nodes, pods, and control plane.
- **ADOT Collector DaemonSet** forwarding traces, metrics, and logs to Managed Prometheus + Managed Grafana, or to Datadog / New Relic / Honeycomb.
- **Cilium Hubble** (or Pixie) for eBPF-level network visibility without a service mesh.
- **EKS audit logs** to CloudWatch Logs with 90-day retention and S3 archive behind Object Lock for SOC 2 / PCI evidence.
## Graviton cost savings
- Graviton3 (`m7g`, `c7g`, `r7g`) and Graviton4 (`m8g`, `c8g`, `r8g`) typically deliver 30–40% better price-performance than comparable x86 for stateless microservices and JVM workloads.
- Build multi-arch images with `docker buildx build --platform linux/amd64,linux/arm64` in CI; push both manifests to ECR.
- Karpenter on Auto Mode will pick ARM when it wins on price and pod fits.
## Reference architecture (2026 default)
```
┌──────────────────────────────────────────────┐
│ AWS-managed control plane (multi-AZ) │
│ api / scheduler / controller-mgr / etcd │
│ audit + authenticator + scheduler logs │
└─────────────────┬────────────────────────────┘
│ (private endpoint via PrivateLink)
│
┌──────────────────────────────────┼──────────────────────────────────┐
│ Data plane (Auto Mode) │ │
│ ├── managed Karpenter NodePool │ ── Pod Identity Agent (per node) │
│ ├── Graviton-first c8g/m8g/r8g │ ── VPC CNI (prefix delegation) │
│ ├── consolidation policy │ ── EBS CSI (gp3 default) │
│ └── disruption budgets │ ── AWS LB Controller (ALB+NLB) │
└──────────────────────────────────┴──────────────────────────────────┘
│
Workloads ── ServiceAccount → PodIdentityAssociation → IAM Role
Ingress ── ALB (alb.ingress.k8s.aws/scheme: internet-facing)
Storage ── EBS gp3 PVCs / EFS for shared / S3 for objects
Secrets ── Secrets Store CSI / HashiCorp VSO → Vault / Secrets Manager
Images ── ECR (enhanced scanning, image signing) ← CI attestation
Telemetry ─ CloudWatch Container Insights + ADOT → Datadog / AMP+AMG
Audit ── CloudWatch Logs (90d) + S3 Object Lock (compliance archive)
```
## Failure modes & resilience
**1. Karpenter consolidation evicting under-budgeted pods.** Default `consolidationPolicy: WhenUnderutilized` will move pods aggressively. For long-running stateful workloads, set `WhenEmpty` on the NodePool and define a PodDisruptionBudget (`minAvailable`) so consolidation cannot violate availability. Disruption budgets at the NodePool level cap voluntary disruptions per hour.
**2. Pod Identity Agent crash-loop.** Symptom: pods using the ServiceAccount get `403 AccessDenied` from STS. Causes: agent DaemonSet pod CrashLoopBackOff (check `kubectl logs -n kube-system -l app=eks-pod-identity-agent`), Pod Identity Association pointing at a non-existent IAM role, trust policy missing `pods.eks.amazonaws.com` principal, or IMDS hop limit too low on the node. Auto Mode handles the agent; on managed node groups confirm the agent add-on is healthy.
**3. NodePool pinned to a single AZ.** A zonal disruption (control-plane outage in one AZ, ELB endpoint flap) takes the workload with it. Always include `topology.kubernetes.io/zone In [a, b, c]` in NodePool requirements; combine with `topologySpreadConstraints` on Deployments.
**4. gp3 volume detach during node replacement.** Auto Mode replaces nodes — StatefulSets with `volumeClaimTemplates` should explicitly set `persistentVolumeReclaimPolicy: Retain` and a `storageClass` with `volumeBindingMode: WaitForFirstConsumer`. Otherwise an in-flight reschedule can race with detach and the pod stays `ContainerCreating` for several minutes.
**5. `--max-unavailable` vs PDB collisions.** A Deployment's RollingUpdate strategy plus a strict PDB (`minAvailable: 100%`) deadlocks the rollout. Always set PDB `minAvailable` such that `replicas - minAvailable >= maxUnavailable`.
**6. Cluster Autoscaler vs Karpenter coexistence.** Running both in the same cluster causes thrash. Pick one. Karpenter for new clusters; Cluster Autoscaler only if a vendor product hard-requires it.
**7. EKS minor-version upgrade window.** AWS supports current + 3 prior minors (~14 months). Letting a cluster slip to N-4 forces emergency upgrade across multiple breaking changes. Schedule quarterly minor upgrades; test in a staging cluster first.
## Observability runbook
**Enable control-plane logs at cluster creation:**
```bash
aws eks update-cluster-config \
--region eu-west-1 \
--name my-cluster \
--logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
```
**Alarms we ship:**
| Alarm | First action |
| ----------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| `cluster_failed_request_count > 0` (control plane) | Check audit logs for `Forbidden` / `Unauthorized` patterns; review IAM Identity mappings |
| `node_status_condition` Ready=false on any node | `kubectl describe node`; check kubelet, CNI, and SSM agent health |
| Karpenter `nodeclaim_disruption_total` spike | Inspect NodePool consolidation events; verify PDBs are honored |
| `pod_pending_count > 0` for `> 5 min` | `kubectl describe pod` → events; NodePool requirements vs pod tolerations / arch mismatch |
| ECR image-pull error rate | VPC endpoint health for `com.amazonaws.<region>.ecr.dkr`; IAM role `ecr:GetAuthorizationToken` |
| ADOT Collector `otelcol_exporter_send_failed_metric_points` | Backend (AMP / Datadog) reachability; collector resource limits |
**Debug path: "Pod stuck Pending":**
1. `kubectl describe pod <name>` → Events. Most common: `0/N nodes are available: insufficient memory` or `node(s) didn't match Pod's node affinity`.
2. If insufficient resources: confirm Karpenter is provisioning (`kubectl get nodeclaims`); check NodePool `requirements` allow the pod's architecture and instance family.
3. If affinity mismatch: check NodePool labels match pod's `nodeSelector` / `affinity`.
4. If `FailedScheduling` on Pod Identity SA: confirm `PodIdentityAssociation` exists for `(cluster, namespace, serviceAccount)`.
**Debug path: "Node not ready":**
1. `kubectl describe node <node>` → Conditions section. `MemoryPressure`, `DiskPressure`, `PIDPressure` are first signals.
2. CloudWatch Container Insights → node detail → kubelet logs.
3. VPC CNI: `kubectl logs -n kube-system -l k8s-app=aws-node` for IP exhaustion or ENI attach failures.
4. If on Auto Mode, the node will be replaced automatically — confirm replacement is in progress before manual intervention.
## When EKS is NOT the right call
- Small, simple container workload with 1–3 services and a team unfamiliar with Kubernetes — **Amazon ECS on Fargate** has a fraction of the operational surface and is often the better first step.
- Entirely event-driven or short-lived workload — **AWS Lambda** or ECS Fargate spot often costs less and simplifies ops.
- You have no plans to leverage Kubernetes portability or ecosystem — the $73/month per-cluster plus learning-curve tax is real.
- You need air-gapped operation with no AWS dependency — evaluate **EKS Anywhere** or upstream Kubernetes on bare metal.
## EKS best practices
**Resource management**
- Always set `requests` and `limits`. Use Vertical Pod Autoscaler recommendations to size requests.
- Use pod disruption budgets; align with Karpenter disruption budgets to avoid correlated voluntary disruptions.
**Auto-scaling**
- Karpenter (or Auto Mode's managed Karpenter) preferred over Cluster Autoscaler for new clusters.
- HPA on CPU/memory/custom metrics (Prometheus) for pod-level scaling.
- KEDA for event-driven autoscaling (SQS, Kinesis, Kafka lag).
**Security**
- Pod Identity for pod-level IAM.
- Network policies via Cilium/Calico; restrict egress by default.
- Kubernetes secrets encrypted with a customer-managed KMS key.
- Pair with **HashiCorp Vault Secrets Operator** or AWS Secrets Manager + Secrets Store CSI driver for application secrets.
- ECR enhanced scanning + image signing verified at admission.
**Reliability**
- Multi-AZ NodePools; never pin a production NodePool to a single AZ.
- Backups of cluster state (Velero) for stateful apps or CRD-heavy control-plane configuration.
- Routine disaster-recovery tests of cluster re-creation from IaC.
## Related reading
- [`ECS vs EKS: container orchestration decision guide`](/blog/aws-ecs-vs-eks-container-orchestration-decision-guide/)
- [`Karpenter vs Cluster Autoscaler on EKS: cost optimization`](/blog/karpenter-vs-cluster-autoscaler-eks-cost-optimization/)
- [`How to deploy EKS with Karpenter for cost-optimized autoscaling`](/blog/how-to-deploy-eks-karpenter-cost-optimized-autoscaling/)
## Related services
- [AWS Application Modernization](/services/aws-application-modernization/)
- [DevOps Pipeline Setup](/services/devops-pipeline-setup/)
- [Hire a Dedicated AWS Expert](/services/hire-a-dedicated-aws-expert/) Amazon EKS overview
Amazon EKS is AWS-managed Kubernetes. The control plane (API server, scheduler, etcd) is operated by AWS, patched automatically, and deployed across at least three availability zones. You own the data plane — or, on EKS Auto Mode (GA November 2024), you delegate the data plane to AWS as well and consume Kubernetes as an almost-serverless service.
FactualMinds deploys EKS for teams that need Kubernetes portability (multi-cloud, on-prem via EKS Hybrid Nodes, or open-source ecosystem alignment) and for mid-market AWS-only teams that have outgrown ECS or plain Fargate. We default new 2026 clusters to Auto Mode on Kubernetes 1.32 with Graviton-first node pools unless a specific workload says otherwise.
What’s new on EKS in 2026
- EKS Auto Mode GA — fully managed data plane, managed add-ons (VPC CNI, kube-proxy, CoreDNS, EBS CSI, AWS Load Balancer Controller), and a managed Karpenter that provisions nodes within seconds of scheduling pressure.
- EKS Hybrid Nodes (GA November 2024) — register Linux hosts running on-prem or at the edge as EKS worker nodes governed by an AWS-hosted control plane. One
kubectlsurface for cloud and hybrid. - Karpenter 1.0 (2024) — stable NodeClass/NodePool CRDs, disruption budgets, and consolidation-policy modes. Karpenter is the default on Auto Mode.
- Pod Identity — the ergonomic replacement for IRSA. No OIDC provider, no ServiceAccount annotation, no trust-policy gymnastics.
- Kubernetes 1.31 / 1.32 — typical supported minor versions on EKS in 2026; upstream releases every ~4 months, EKS supports the current plus the previous three.
- ECR enhanced scanning — Inspector v2 scans images for OS and language-package CVEs with exploit-probability-index scoring; integrates with Security Hub.
- AWS Load Balancer Controller — managed install on Auto Mode; supports Gateway API, ALB and NLB target-group binding, and cross-zone health checks.
- Amazon EBS CSI driver managed add-on — Auto Mode handles install and upgrade; gp3 volumes by default.
- Cilium + Hubble / eBPF observability — supported via add-ons for teams that need deep network visibility without full-fat service mesh.
Why EKS
Kubernetes standard
- Standard
kubectl, Helm, Kustomize, and standard manifests. - Portable: workloads run on other clouds, on-premises (EKS Hybrid Nodes or EKS Anywhere), or upstream Kubernetes.
- Massive ecosystem (Prometheus, OpenTelemetry, Argo CD, Flux, Karpenter, Cilium, Istio, Linkerd).
AWS integration
- VPC CNI for pod networking with real AWS IP addresses.
- Pod Identity for pod-level IAM permissions without OIDC acrobatics.
- Native integrations with ALB/NLB, EFS, EBS, S3, RDS, DynamoDB, SQS, Kinesis, Bedrock.
- AWS Security Hub / GuardDuty EKS Protection for runtime threat detection.
Managed control plane
- Multi-AZ control plane included in the $0.10/hour price.
- AWS patches the control plane on a published minor-version cadence.
- SLA covers control-plane availability; you are responsible for workload availability.
EKS Architecture
Control plane (AWS managed)
- API server, scheduler, controller managers, etcd.
- Audit logs can be shipped to CloudWatch Logs; control-plane endpoints can be private, public, or public+private with IP allow-list.
Data plane (your choice)
- EKS Auto Mode — fully managed nodes, add-ons, networking, load balancing, and storage controllers.
- Managed node groups — EC2 instances you provision; AWS manages OS patching, drain, and replacement.
- Karpenter on self-managed nodes — for teams that want fine control over instance-type selection and disruption policy.
- AWS Fargate — serverless pods with no node management; higher per-pod price, best for bursty or sandbox workloads.
Networking
- AWS VPC CNI: each pod gets a real VPC IP (prefix delegation supported for IP density).
- Security groups for pods (SGFP) for per-pod network security.
- Cilium eBPF or Calico for network policy and observability.
- AWS Load Balancer Controller for ALB/NLB ingress.
EKS Auto Mode in practice
- AWS provisions, scales, and replaces nodes automatically based on pending pods.
- Managed Karpenter bin-packs across instance types, including Graviton by default.
- OS patching via node replacement on a rolling schedule; no in-place kernel updates.
- AWS manages the core add-ons (VPC CNI, kube-proxy, CoreDNS, AWS Load Balancer Controller, EBS CSI).
- Billed as EC2 + a small EKS Auto Mode management fee per vCPU-hour; typically net-neutral or cheaper versus self-managed node groups when labor is priced in.
Use Auto Mode when
- You want Kubernetes without node operations.
- Your security team can live with AWS-managed, regularly replaced AMIs.
- Your workloads do not require custom kernel modules or niche runtime options.
Prefer managed node groups when
- You need a regulated/approved AMI (e.g., STIG-hardened) maintained by your security team.
- You run custom kernel modules (BPF/eBPF extensions beyond what’s supported, niche drivers).
- You want fine-grained Spot pool control that the managed NodePool does not expose.
EKS Hybrid Nodes
- Register on-prem Linux hosts as EKS workers against an AWS-hosted control plane.
- Supports x86 and ARM; requires AWS Systems Manager connectivity from the on-prem host.
- Use for edge compute, data-gravity on-prem workloads, or manufacturing floor nodes that must stay physically on site but should be governed from AWS.
- Compare to EKS Anywhere: Hybrid Nodes share one control plane with AWS; EKS Anywhere runs its own on-site control plane.
Pod Identity vs IRSA
Pod Identity (2026 default)
aws eks create-pod-identity-association \
--cluster-name my-cluster \
--namespace production \
--service-account my-app \
--role-arn arn:aws:iam::123456789:role/my-app-role
Pods using the my-app ServiceAccount in the production namespace automatically receive temporary credentials via the Pod Identity Agent. No annotation, no OIDC provider, no trust-policy StringEquals dance.
IRSA (legacy / niche)
- Still required for workloads that only accept token-file authentication, clusters older than 1.24, or EC2 workloads outside EKS.
- OIDC provider + annotated ServiceAccount + trust-policy condition on the OIDC subject.
Karpenter 1.0 patterns we deploy
- Graviton-first NodePool — allow
arm64architectures, prefer on-demand for baseline and Spot for scale-out. - Consolidation policy —
WhenUnderutilizedfor dev/staging,WhenEmptyfor production to avoid disruption of long-running pods. - Disruption budgets — cap how many nodes Karpenter can consolidate per hour, aligned with PDB.
- Per-namespace NodePool selection — heavy GPU workloads go to a dedicated NodePool with
nvidia.com/gputaints.
Observability stack
- CloudWatch Container Insights (enhanced observability) for the cluster, nodes, pods, and control plane.
- ADOT Collector DaemonSet forwarding traces, metrics, and logs to Managed Prometheus + Managed Grafana, or to Datadog / New Relic / Honeycomb.
- Cilium Hubble (or Pixie) for eBPF-level network visibility without a service mesh.
- EKS audit logs to CloudWatch Logs with 90-day retention and S3 archive behind Object Lock for SOC 2 / PCI evidence.
Graviton cost savings
- Graviton3 (
m7g,c7g,r7g) and Graviton4 (m8g,c8g,r8g) typically deliver 30–40% better price-performance than comparable x86 for stateless microservices and JVM workloads. - Build multi-arch images with
docker buildx build --platform linux/amd64,linux/arm64in CI; push both manifests to ECR. - Karpenter on Auto Mode will pick ARM when it wins on price and pod fits.
Reference architecture (2026 default)
┌──────────────────────────────────────────────┐
│ AWS-managed control plane (multi-AZ) │
│ api / scheduler / controller-mgr / etcd │
│ audit + authenticator + scheduler logs │
└─────────────────┬────────────────────────────┘
│ (private endpoint via PrivateLink)
│
┌──────────────────────────────────┼──────────────────────────────────┐
│ Data plane (Auto Mode) │ │
│ ├── managed Karpenter NodePool │ ── Pod Identity Agent (per node) │
│ ├── Graviton-first c8g/m8g/r8g │ ── VPC CNI (prefix delegation) │
│ ├── consolidation policy │ ── EBS CSI (gp3 default) │
│ └── disruption budgets │ ── AWS LB Controller (ALB+NLB) │
└──────────────────────────────────┴──────────────────────────────────┘
│
Workloads ── ServiceAccount → PodIdentityAssociation → IAM Role
Ingress ── ALB (alb.ingress.k8s.aws/scheme: internet-facing)
Storage ── EBS gp3 PVCs / EFS for shared / S3 for objects
Secrets ── Secrets Store CSI / HashiCorp VSO → Vault / Secrets Manager
Images ── ECR (enhanced scanning, image signing) ← CI attestation
Telemetry ─ CloudWatch Container Insights + ADOT → Datadog / AMP+AMG
Audit ── CloudWatch Logs (90d) + S3 Object Lock (compliance archive)
Failure modes & resilience
1. Karpenter consolidation evicting under-budgeted pods. Default consolidationPolicy: WhenUnderutilized will move pods aggressively. For long-running stateful workloads, set WhenEmpty on the NodePool and define a PodDisruptionBudget (minAvailable) so consolidation cannot violate availability. Disruption budgets at the NodePool level cap voluntary disruptions per hour.
2. Pod Identity Agent crash-loop. Symptom: pods using the ServiceAccount get 403 AccessDenied from STS. Causes: agent DaemonSet pod CrashLoopBackOff (check kubectl logs -n kube-system -l app=eks-pod-identity-agent), Pod Identity Association pointing at a non-existent IAM role, trust policy missing pods.eks.amazonaws.com principal, or IMDS hop limit too low on the node. Auto Mode handles the agent; on managed node groups confirm the agent add-on is healthy.
3. NodePool pinned to a single AZ. A zonal disruption (control-plane outage in one AZ, ELB endpoint flap) takes the workload with it. Always include topology.kubernetes.io/zone In [a, b, c] in NodePool requirements; combine with topologySpreadConstraints on Deployments.
4. gp3 volume detach during node replacement. Auto Mode replaces nodes — StatefulSets with volumeClaimTemplates should explicitly set persistentVolumeReclaimPolicy: Retain and a storageClass with volumeBindingMode: WaitForFirstConsumer. Otherwise an in-flight reschedule can race with detach and the pod stays ContainerCreating for several minutes.
5. --max-unavailable vs PDB collisions. A Deployment’s RollingUpdate strategy plus a strict PDB (minAvailable: 100%) deadlocks the rollout. Always set PDB minAvailable such that replicas - minAvailable >= maxUnavailable.
6. Cluster Autoscaler vs Karpenter coexistence. Running both in the same cluster causes thrash. Pick one. Karpenter for new clusters; Cluster Autoscaler only if a vendor product hard-requires it.
7. EKS minor-version upgrade window. AWS supports current + 3 prior minors (~14 months). Letting a cluster slip to N-4 forces emergency upgrade across multiple breaking changes. Schedule quarterly minor upgrades; test in a staging cluster first.
Observability runbook
Enable control-plane logs at cluster creation:
aws eks update-cluster-config \
--region eu-west-1 \
--name my-cluster \
--logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
Alarms we ship:
| Alarm | First action |
|---|---|
cluster_failed_request_count > 0 (control plane) | Check audit logs for Forbidden / Unauthorized patterns; review IAM Identity mappings |
node_status_condition Ready=false on any node | kubectl describe node; check kubelet, CNI, and SSM agent health |
Karpenter nodeclaim_disruption_total spike | Inspect NodePool consolidation events; verify PDBs are honored |
pod_pending_count > 0 for > 5 min | kubectl describe pod → events; NodePool requirements vs pod tolerations / arch mismatch |
| ECR image-pull error rate | VPC endpoint health for com.amazonaws.<region>.ecr.dkr; IAM role ecr:GetAuthorizationToken |
ADOT Collector otelcol_exporter_send_failed_metric_points | Backend (AMP / Datadog) reachability; collector resource limits |
Debug path: “Pod stuck Pending”:
kubectl describe pod <name>→ Events. Most common:0/N nodes are available: insufficient memoryornode(s) didn't match Pod's node affinity.- If insufficient resources: confirm Karpenter is provisioning (
kubectl get nodeclaims); check NodePoolrequirementsallow the pod’s architecture and instance family. - If affinity mismatch: check NodePool labels match pod’s
nodeSelector/affinity. - If
FailedSchedulingon Pod Identity SA: confirmPodIdentityAssociationexists for(cluster, namespace, serviceAccount).
Debug path: “Node not ready”:
kubectl describe node <node>→ Conditions section.MemoryPressure,DiskPressure,PIDPressureare first signals.- CloudWatch Container Insights → node detail → kubelet logs.
- VPC CNI:
kubectl logs -n kube-system -l k8s-app=aws-nodefor IP exhaustion or ENI attach failures. - If on Auto Mode, the node will be replaced automatically — confirm replacement is in progress before manual intervention.
When EKS is NOT the right call
- Small, simple container workload with 1–3 services and a team unfamiliar with Kubernetes — Amazon ECS on Fargate has a fraction of the operational surface and is often the better first step.
- Entirely event-driven or short-lived workload — AWS Lambda or ECS Fargate spot often costs less and simplifies ops.
- You have no plans to leverage Kubernetes portability or ecosystem — the $73/month per-cluster plus learning-curve tax is real.
- You need air-gapped operation with no AWS dependency — evaluate EKS Anywhere or upstream Kubernetes on bare metal.
EKS best practices
Resource management
- Always set
requestsandlimits. Use Vertical Pod Autoscaler recommendations to size requests. - Use pod disruption budgets; align with Karpenter disruption budgets to avoid correlated voluntary disruptions.
Auto-scaling
- Karpenter (or Auto Mode’s managed Karpenter) preferred over Cluster Autoscaler for new clusters.
- HPA on CPU/memory/custom metrics (Prometheus) for pod-level scaling.
- KEDA for event-driven autoscaling (SQS, Kinesis, Kafka lag).
Security
- Pod Identity for pod-level IAM.
- Network policies via Cilium/Calico; restrict egress by default.
- Kubernetes secrets encrypted with a customer-managed KMS key.
- Pair with HashiCorp Vault Secrets Operator or AWS Secrets Manager + Secrets Store CSI driver for application secrets.
- ECR enhanced scanning + image signing verified at admission.
Reliability
- Multi-AZ NodePools; never pin a production NodePool to a single AZ.
- Backups of cluster state (Velero) for stateful apps or CRD-heavy control-plane configuration.
- Routine disaster-recovery tests of cluster re-creation from IaC.
Related reading
ECS vs EKS: container orchestration decision guideKarpenter vs Cluster Autoscaler on EKS: cost optimizationHow to deploy EKS with Karpenter for cost-optimized autoscaling
Related services
Tools & Calculators
Self-serve calculators and assessments that pair with this integration.
AWS Architecture Review
Have an AWS-certified architect review your EKS cluster design, networking, and cost posture.
Related AWS Services
Consulting engagements that frequently pair with this integration.
AWS Application Modernization — From Legacy to Cloud-Native
AWS application modernization — legacy migration, microservices, containers. Expert consulting from FactualMinds.
AWS DevOps Consulting
AWS DevOps consulting — CI/CD pipeline setup, infrastructure as code (SAM/CDK), and deployment automation.
Hire a Dedicated AWS Consultant | FactualMinds
Hire a dedicated AWS consultant — a certified expert embedded with your team for cloud management, cost optimization, security, and architecture work.
Who typically runs this integration?
The roles that most often own or review this stack.
AWS Solutions for DevOps & Platform Engineers
EKS Auto Mode, OIDC-native CI/CD, supply-chain security, CDK Toolkit v2, and eBPF observability for platform teams building the platform on AWS in 2026.
AWS Solutions for CTOs
Cloud strategy, multi-account governance, agentic AI platform decisions, and FinOps culture for technology leaders scaling AWS in 2026 and beyond.
Related Integrations
Other AWS integration guides commonly deployed alongside this one.
Terraform on AWS
Terraform + AWS in 2026: Stacks GA, ephemeral values, provider-defined functions, Test Framework, OpenTofu 1.8 encryption — vs CDK and CloudFormation.
Datadog with AWS
Datadog on AWS in 2026: unified observability for CloudWatch, EKS, Lambda, Bedrock LLM workloads, and security posture across multi-cloud estates.
HashiCorp Vault on AWS
HashiCorp Vault on AWS: dynamic DB credentials, transit-engine encryption, HCP Vault Secrets, and EKS Secrets Operator vs AWS Secrets Manager guidance.
Frequently Asked Questions
What is EKS Auto Mode and when should I use it?
How does EKS Pod Identity differ from IRSA, and which should I use in 2026?
What is Karpenter 1.0 and how does it change node scaling?
When should I use EKS Hybrid Nodes versus EKS Anywhere?
How do I secure container images pulled to EKS?
What is the 2026 best practice for logging and observability on EKS?
How does Graviton affect EKS cost and what are the gotchas?
Related Reading
- AWS ECS vs EKS: Container Orchestration Decision Guide
ECS is "AWS-native containers." EKS is "Kubernetes, but you're still on the hook for everything Kubernetes." A decision guide for picking between ECS and EKS based on team Kubernetes experience, operational complexity, and the cost gap at production scale.
- Karpenter vs Cluster Autoscaler: EKS Node Cost Optimization in 2026
Karpenter replaces Cluster Autoscaler as the recommended EKS node autoscaler. It provisions nodes faster, selects better-fit instance types per workload, and consolidates nodes more aggressively — typically reducing EKS compute costs by 20-40% compared to an equivalent Cluster Autoscaler deployment.
- How to Deploy EKS with Karpenter for Cost-Optimized Autoscaling
Karpenter replaces Kubernetes Cluster Autoscaler with intelligent bin-packing and just-in-time node provisioning. This guide covers setup, consolidation, cost optimization, and production patterns for EKS clusters.
Need Help with This Integration?
Our AWS-certified engineers can design, implement, and operate this integration end-to-end — or review what you already have.