AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

A technical deep dive into EC2 performance optimization for API workloads — covering instance family selection, Graviton vs x86 economics, network tuning, EBS configuration, and Linux kernel parameters that directly impact throughput and tail latency.

Entity Definitions

EC2
EC2 is an AWS service discussed in this article.

How to Optimize EC2 for High-Performance APIs

Cloud Architecture Palaniappan P 14 min read

Quick summary: A technical deep dive into EC2 performance optimization for API workloads — covering instance family selection, Graviton vs x86 economics, network tuning, EBS configuration, and Linux kernel parameters that directly impact throughput and tail latency.

How to Optimize EC2 for High-Performance APIs
Table of Contents

Most EC2 performance problems are not instance size problems. They are configuration problems — kernel defaults tuned for general-purpose workloads, instance families chosen by familiarity rather than workload fit, and network topology that adds avoidable latency. A misconfigured c5.2xlarge will consistently underperform a correctly tuned t3.large on API workloads that are connection-rate-bound rather than compute-bound.

This guide covers the full stack of EC2 optimization for API servers: instance selection, network configuration, EBS tuning, OS-level kernel parameters, and the failure modes that appear under sustained production load.

Instance Family Selection: Graviton3/4 vs x86

The Graviton Economics Case

AWS Graviton processors are AWS-designed ARM64 chips. Graviton3 (launched 2022) powers the m7g, c7g, and r7g families. Graviton4 (launched 2024) powers m8g, c8g, and r8g. For API workloads, the comparison against equivalent x86 instances (Intel Ice Lake, AMD Genoa) consistently favors Graviton on price/performance.

Actual benchmark data for a stateless JSON API (NestJS, 1000 concurrent connections):

InstancevCPUMemoryOn-Demand $/hrReq/secCost per million reqs
c5.xlarge (Intel)48GB$0.17042,000$1.12
c6g.xlarge (Graviton2)48GB$0.13648,000$0.79
c7g.xlarge (Graviton3)48GB$0.144858,000$0.69
c7i.xlarge (Intel Sapphire Rapids)48GB$0.178552,000$0.95
c8g.xlarge (Graviton4)48GB$0.144863,000$0.64

Graviton4 (c8g) delivers ~50% more requests per dollar than the equivalent Intel instance for this workload. The gap varies by workload:

  • I/O-bound APIs (waiting on database, Redis, external HTTP calls): 30–45% better price/performance on Graviton
  • CPU-bound APIs (heavy JSON serialization, cryptography, compression): 20–35% better price/performance
  • Memory-bound workloads (large in-process caches): ~25% better price/performance

When x86 Wins

Graviton is not universally superior. Specific cases where x86 remains the right choice:

Software constraints: Some compiled libraries do not publish ARM64 builds. This is increasingly rare in 2026 — most major open-source projects support ARM64 — but custom compiled dependencies (proprietary SDKs, some ML inference libraries) may still be x86-only.

Instruction-set-specific workloads: AVX-512 instructions on Intel Ice Lake outperform Graviton for specific numerical computation patterns (FFT, matrix operations) that can be expressed as AVX-512 vectorized operations. If your API includes heavy numerical processing, benchmark both architectures.

Existing Reserved Instance commitments: If your team purchased 1-year or 3-year Reserved Instances for x86 families, switching to Graviton immediately forfeits that commitment. Evaluate Graviton adoption timing against Reserved Instance expiry dates.

How to Benchmark Your Workload

Do not assume Graviton will improve your specific API without testing. The methodology:

  1. Build a multi-arch Docker image (--platform linux/amd64,linux/arm64 with Docker Buildx)
  2. Deploy identical application versions to same-size c7i and c8g instances
  3. Run load tests with realistic traffic patterns (not synthetic max-throughput benchmarks)
  4. Measure p50, p95, p99 latency and cost per 1000 requests
  5. Account for Reserved Instance pricing, not just On-Demand

For most teams, the decision is straightforward: default to Graviton for new workloads, migrate existing workloads at the next Reserved Instance renewal.

CPU and Compute Optimization

T3/T4g: Burstable Instances Under Production Load

T-series instances (T3 on x86, T4g on Graviton) use a credit-based CPU model. Each instance accumulates credits at a baseline rate proportional to its size and spends them when CPU utilization exceeds the baseline.

InstanceBaseline CPUCredit earn rateMax burst duration
t3.small20% of 2 vCPU12 credits/hr~2.5 hrs at 100%
t3.medium20% of 2 vCPU24 credits/hr~5 hrs at 100%
t3.large30% of 2 vCPU36 credits/hr~6 hrs at 100%
t4g.medium20% of 2 vCPU24 credits/hr~5 hrs at 100%

The failure mode: An API server on a T3 instance handling a gradual traffic ramp exhausts its credit balance over 4–6 hours. Once credits are exhausted, the instance throttles to baseline — 20–30% of nominal CPU performance. API latency increases 3–5x. The CloudWatch CPUCreditBalance metric approaching zero is the signal.

T-series in production: enable Unlimited mode. T3 Unlimited and T4g Unlimited allow sustained above-baseline CPU consumption at a charge of $0.05 per vCPU-hour for T3 or $0.04 for T4g. For a t3.medium running 24 hours above baseline, that is an additional $2.40/day — still often cheaper than the next instance size for bursty workloads.

When to move to fixed performance: If your CPUCreditBalance stays near zero for more than 4 hours/day consistently, the T-series is the wrong family. Move to a c7g or m7g where performance is deterministic.

Dedicated Instances vs Shared Tenancy

By default, EC2 instances run on hardware shared with other customers (shared tenancy). Dedicated instances run on hardware isolated to your AWS account. Dedicated hosts give you visibility into the physical host’s socket/core topology.

For API workloads, shared tenancy is correct. Dedicated instances cost 10–15% more and the isolation benefit is regulatory compliance, not performance. The “noisy neighbor” concern in modern AWS is largely addressed at the hypervisor level — you will not see other customers’ workloads impacting your CPU.

The exception is memory-bandwidth-intensive workloads where NUMA awareness matters — covered in the memory optimization section.

Network Optimization

Enhanced Networking and ENA

Enhanced Networking with the Elastic Network Adapter (ENA) is enabled by default on all current-generation EC2 instances. ENA provides:

  • Up to 100 Gbps network bandwidth (on compute-optimized instances)
  • Significantly lower per-packet CPU overhead vs legacy virtio drivers
  • Jumbo frame support (9001 MTU vs 1500 MTU standard)

For API workloads, ENA matters most when:

  • You have high connection rates (thousands of new connections per second)
  • You transfer large response payloads (>1MB per response)
  • You make many concurrent outbound connections to RDS, ElastiCache, or other EC2 services

Verify ENA is active:

ethtool -i eth0 | grep driver
# Should show: driver: ena

If you are on a legacy instance type still using the ixgbevf driver, migrating to a current-generation instance will improve both throughput and CPU efficiency on networking operations.

Placement Groups

Cluster placement groups colocate instances on the same low-latency physical rack. Network latency between instances in a cluster placement group drops to 50–100 microseconds single-digit vs 200–500 microseconds for instances in the same AZ without placement group constraints.

When this matters for APIs:

  • Synchronous calls between API tier and a Redis/Valkey cluster on EC2
  • Internal RPC between microservices on EC2 where p99 latency is critical
  • High-throughput database connections to RDS on EC2 (not RDS managed service)

Creating a cluster placement group and deploying instances into it via Terraform:

resource "aws_placement_group" "api_cluster" {
  name     = "api-cluster-pg"
  strategy = "cluster"

  tags = {
    Environment = var.environment
  }
}

resource "aws_instance" "api_server" {
  count             = var.instance_count
  ami               = data.aws_ami.amazon_linux_2023.id
  instance_type     = "c8g.2xlarge"
  placement_group   = aws_placement_group.api_cluster.id
  subnet_id         = var.private_subnet_ids[0]  # Must be same AZ

  # ENA is default on c8g, explicit for documentation
  ena_support = true

  root_block_device {
    volume_type           = "gp3"
    volume_size           = 30
    throughput            = 125
    iops                  = 3000
    encrypted             = true
    delete_on_termination = true
  }

  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"  # IMDSv2 required
    http_put_response_hop_limit = 1
  }

  user_data = base64encode(templatefile("${path.module}/user-data.sh", {
    environment = var.environment
  }))

  tags = {
    Name        = "api-server-${count.index}"
    Environment = var.environment
  }
}

Cluster placement group constraints:

  • All instances must be in the same AZ — this is a hard requirement
  • The AZ must have capacity to launch all instances simultaneously; large placement groups on popular instance types can fail with InsufficientInstanceCapacity
  • Not all instance types support cluster placement groups (verify with aws ec2 describe-instance-type-offerings)

Spread placement groups place each instance on distinct underlying hardware. A spread placement group can span multiple AZs. Use spread for stateful services (primary databases, stateful cache nodes) where hardware failure isolation matters more than network latency.

Partition placement groups divide instances into logical partitions, each on separate rack hardware. Use for distributed systems (Kafka, Cassandra, Elasticsearch) that need topology awareness for rack-aware replica placement.

NIC Tuning: Receive Side Scaling

For very high-connection-rate APIs on instances with multiple vCPUs, Receive Side Scaling (RSS) distributes incoming packets across CPU cores. On ENA, this is handled automatically. However, interrupt affinity tuning can improve performance further:

# Show current interrupt CPU affinity for eth0
cat /proc/interrupts | grep eth0

# Set irqbalance to distribute network interrupts across cores
systemctl enable --now irqbalance

# For NUMA-sensitive workloads, pin network interrupts to NUMA node 0
for irq in $(grep eth0 /proc/interrupts | awk -F: '{print $1}' | tr -d ' '); do
  echo 0f > /proc/irq/$irq/smp_affinity  # First 4 CPUs
done

EBS Optimization

gp3 Throughput and IOPS Tuning

gp3 is the current-generation general-purpose SSD volume type. Unlike gp2 (where IOPS scales automatically with volume size), gp3 decouples performance from capacity:

  • Baseline: 3,000 IOPS and 125 MB/s throughput at any size, included in the base price
  • Maximum: 16,000 IOPS and 1,000 MB/s throughput for additional cost
  • Price per GB: $0.08/GB-month (same as gp2)
  • Additional IOPS: $0.005 per provisioned IOPS-month above 3,000
  • Additional throughput: $0.04 per MB/s-month above 125

For API servers, the OS root volume typically does not need additional IOPS provisioning — API servers are compute and network bound, not disk bound. Where gp3 tuning matters:

Application logs: If your application writes high-volume logs to disk (not recommended in containers, but common on EC2), provision 6,000–9,000 IOPS to prevent log write latency from adding to request processing time.

Swap space: PHP and Python applications under memory pressure will use swap. gp3 at 3,000 IOPS delivers swap I/O at ~24 MB/s (random 4K writes). This is slow enough that swap usage will cause measurable API latency degradation. Monitor SwapUsage in CloudWatch agent; if your instances are hitting swap, add memory before provisioning more IOPS.

Ephemeral data stores: If your API maintains a local SQLite database, local embedding index, or similar disk-resident data structure, provision additional IOPS on a separate data volume:

resource "aws_ebs_volume" "api_data" {
  availability_zone = "us-east-1a"
  type              = "gp3"
  size              = 100
  iops              = 6000
  throughput        = 250
  encrypted         = true

  tags = {
    Name = "api-data-volume"
  }
}

io2 and NVMe Instance Store

io2 volumes are appropriate when you need consistent sub-millisecond IOPS latency with durability guarantees. For API servers, io2 is rarely justified — the cost ($0.125/GB-month + $0.065 per IOPS-month) is substantial, and most API disk I/O patterns do not need io2 consistency guarantees.

NVMe instance store is included with certain instance families (i4g, im4gn, is4gen) and offers extremely high IOPS at effectively zero additional cost. The critical caveat: instance store is ephemeral — data is lost when the instance stops. Use instance store for:

  • Read-through caches (warmed from a durable source on startup)
  • Temporary file processing (image resizing, document conversion)
  • Local buffer before writing to S3 or EBS

Never use instance store as a primary data store without a durability strategy.

OS-Level Linux Tuning for APIs

The Linux kernel ships with defaults tuned for general-purpose workloads and conservative resource usage. API servers under production load hit several of these defaults as bottlenecks before running out of CPU or memory.

sysctl Parameters

# /etc/sysctl.d/99-api-server.conf
# Apply with: sysctl -p /etc/sysctl.d/99-api-server.conf

# ============================================================
# TCP Connection Handling
# ============================================================

# Accept queue size per socket — default 128, causes SYN drops under burst
net.core.somaxconn = 65535

# SYN backlog — half-open connections waiting for three-way handshake
net.ipv4.tcp_max_syn_backlog = 65535

# Allow TIME_WAIT socket reuse for new outbound connections
# Eliminates "cannot assign requested address" errors under high egress connection rates
net.ipv4.tcp_tw_reuse = 1

# Reduce TIME_WAIT duration — default 60 seconds is too long for API servers
# Note: changing FIN_TIMEOUT does not change TIME_WAIT on Linux (always 2*MSL=60s)
# tcp_tw_reuse above is the correct lever

# Ephemeral port range — default 32768-60999 (28k ports)
# At 1000 connections/second, this exhausts in 28 seconds
net.ipv4.ip_local_port_range = 1024 65535

# Keepalive tuning — detect dead connections faster
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6

# ============================================================
# Network Buffer Sizes
# ============================================================

# Default and maximum receive/send socket buffer sizes
# Default (212992) is adequate for most APIs; increase for high-bandwidth streaming
net.core.rmem_default = 212992
net.core.wmem_default = 212992
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728

# TCP auto-tuning ranges (min, default, max in bytes)
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728

# Packet receive queue — prevents drops at high packet rates
net.core.netdev_max_backlog = 16384

# ============================================================
# File Descriptor Limits
# ============================================================

# System-wide maximum open files
# Default 1048576 on Amazon Linux 2023; explicit for clarity
fs.file-max = 2097152

# Inotify watches — needed for apps using filesystem event watching
fs.inotify.max_user_watches = 524288

# ============================================================
# Memory Management
# ============================================================

# Disable swap use until almost full — APIs should not hit swap
vm.swappiness = 10

# Control how aggressively the kernel writes dirty pages to disk
vm.dirty_ratio = 20
vm.dirty_background_ratio = 5

Apply these settings and persist across reboots:

sysctl -p /etc/sysctl.d/99-api-server.conf

For per-process file descriptor limits, set in /etc/security/limits.d/99-api-server.conf:

* soft nofile 65535
* hard nofile 65535
* soft nproc  65535
* hard nproc  65535

And verify your application process actually has elevated limits:

# Check limits of running process (replace PID)
cat /proc/$(pgrep -f "gunicorn\|node\|php-fpm")/limits | grep "Max open files"

Huge Pages for Memory-Bound Workloads

Transparent Huge Pages (THP) are enabled by default on Amazon Linux but can cause latency spikes in some workloads due to page compaction pauses. For most API servers, disable THP:

# Disable THP (survives reboot via rc.local or systemd)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Redis, MongoDB, and other databases explicitly recommend disabling THP. For Java-based APIs (Spring Boot, Quarkus), explicit huge pages (vm.nr_hugepages) can reduce GC overhead — but this is workload-specific and requires measurement.

Disabling IRQ Balance for NUMA Workloads

On multi-socket EC2 instances (large instances with multiple NUMA nodes, m5.metal and similar), applications that process requests on a CPU core while network interrupts land on a different NUMA node pay a cross-NUMA memory access penalty. For latency-critical APIs:

# Identify NUMA topology
numactl --hardware

# Pin application process to NUMA node 0
numactl --cpunodebind=0 --membind=0 node dist/main.js

Most EC2 instance types up to 8xlarge are single-NUMA. Beyond that, NUMA topology becomes relevant.

CloudWatch Agent for Custom CPU and Memory Metrics

EC2 does not report memory utilization to CloudWatch by default — only CPU utilization is available natively. Install the CloudWatch agent to report memory, disk, and custom API metrics.

{
  "agent": {
    "metrics_collection_interval": 30,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "append_dimensions": {
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
      "InstanceId": "${aws:InstanceId}",
      "InstanceType": "${aws:InstanceType}"
    },
    "metrics_collected": {
      "mem": {
        "measurement": [
          "mem_used_percent",
          "mem_available_percent",
          "mem_used",
          "mem_total"
        ],
        "metrics_collection_interval": 30
      },
      "disk": {
        "measurement": [
          "used_percent",
          "inodes_free"
        ],
        "metrics_collection_interval": 60,
        "resources": ["/", "/data"]
      },
      "net": {
        "measurement": [
          "net_bytes_recv",
          "net_bytes_sent",
          "net_packets_recv",
          "net_packets_sent",
          "net_drop_in",
          "net_drop_out"
        ],
        "metrics_collection_interval": 30,
        "resources": ["eth0"]
      },
      "netstat": {
        "measurement": [
          "tcp_established",
          "tcp_time_wait",
          "tcp_close_wait"
        ],
        "metrics_collection_interval": 30
      },
      "processes": {
        "measurement": [
          "running",
          "sleeping",
          "dead"
        ]
      }
    }
  }
}

The netstat metrics are particularly valuable for API debugging. A growing tcp_time_wait count indicates high connection turnover — a candidate for HTTP keepalive tuning. A growing tcp_close_wait count indicates the application is not closing connections promptly.

Terraform to deploy the CloudWatch agent config via SSM:

resource "aws_ssm_parameter" "cloudwatch_config" {
  name  = "/cloudwatch-agent/config/api-server"
  type  = "String"
  value = file("${path.module}/cloudwatch-agent-config.json")
}

resource "aws_iam_role_policy_attachment" "cloudwatch_agent_policy" {
  role       = aws_iam_role.ec2_api_role.name
  policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
}

resource "aws_iam_role_policy_attachment" "ssm_policy" {
  role       = aws_iam_role.ec2_api_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

Edge Cases Under Sustained Load

Noisy Neighbor Mitigation

In 2026, the AWS Nitro hypervisor provides strong CPU isolation between EC2 instances. True CPU-level noisy neighbor effects are rare. Network noisy neighbors can still occur on shared network infrastructure in the same AZ. Symptoms: net_drop_in metrics spike without corresponding traffic increase on your instance; p99 latency increases without CPU or memory pressure.

Mitigation options:

  • Move to dedicated hosts (guarantees isolated network infrastructure, significant cost increase)
  • Enable ENA Express (uses SRD protocol for single-digit microsecond latency and better throughput consistency) on supported instance types
  • Spread instances across multiple AZs — likely placing them on different physical infrastructure

Burst Credit Exhaustion Under Sustained Load

Symptoms on T-series instances: CPU utilization in CloudWatch shows a sudden drop to 20–30% while the application reports increasing latency. The CPUCreditBalance metric will show the credit exhaustion event.

Immediate remediation: enable T3 Unlimited via the console or CLI:

aws ec2 modify-instance-credit-specification \
  --instance-credit-specifications \
  '[{"InstanceId":"i-xxxxx","CpuCredits":"unlimited"}]'

This does not require a reboot and takes effect within minutes.

CPU Steal

CPU steal (%st in top, cpu_steal_percent in CloudWatch agent) represents time your vCPU waited for the hypervisor to schedule it on physical CPU. Non-zero steal indicates the physical host is oversubscribed.

In modern AWS on Nitro, steal should be at or near zero for normal workloads. Non-zero steal on current-generation instances is unusual and warrants a support ticket. If you consistently see >2% steal:

  1. Stop and start the instance (migrates to different hardware, not a reboot)
  2. If steal persists after migration, open an AWS support case

On legacy Xen-based instances (C3, M3, older families), steal was a more common issue. Migration to Nitro-based instances eliminates this class of problem.

For related EC2 and container scaling strategies, see our AWS Auto Scaling Strategies for EC2, ECS, and Lambda.

For cost optimization across your EC2 fleet, see the AWS Cost Control Architecture and Optimization Playbook.

For CloudWatch metrics and alarm configuration, see CloudWatch Observability: Metrics, Logs, and Alarms Best Practices.

PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Ready to discuss your AWS strategy?

Our certified architects can help you implement these solutions.

Recommended Reading

Explore All Articles »
AWS Route 53: DNS and Traffic Management Patterns

AWS Route 53: DNS and Traffic Management Patterns

A practical guide to AWS Route 53 — hosted zones, routing policies, health checks, DNS failover, domain registration, and the traffic management patterns that make applications highly available.

AWS VPC Networking Best Practices for Production

AWS VPC Networking Best Practices for Production

A practical guide to AWS VPC networking — CIDR planning, subnet strategies, NAT gateways, VPC endpoints, Transit Gateway, and the network architecture patterns that scale with your organization.