---
title: How to Optimize EC2 for High-Performance APIs
description: A technical deep dive into EC2 performance optimization for API workloads — covering instance family selection, Graviton vs x86 economics, network tuning, EBS configuration, and Linux kernel parameters that directly impact throughput and tail latency.
url: https://www.factualminds.com/blog/ec2-high-performance-api-optimization/
datePublished: 2026-03-29T00:00:00.000Z
dateModified: 2026-04-16T00:00:00.000Z
author: palaniappan-p
category: Cloud Architecture
tags: how-to-guide, ec2, aws, aws-performance-optimization, graviton, performance, api, linux-tuning, networking, ebs, placement-groups
---

# How to Optimize EC2 for High-Performance APIs

> A technical deep dive into EC2 performance optimization for API workloads — covering instance family selection, Graviton vs x86 economics, network tuning, EBS configuration, and Linux kernel parameters that directly impact throughput and tail latency.

Most EC2 performance problems are not instance size problems. They are configuration problems — kernel defaults tuned for general-purpose workloads, instance families chosen by familiarity rather than workload fit, and network topology that adds avoidable latency. A misconfigured c5.2xlarge will consistently underperform a correctly tuned t3.large on API workloads that are connection-rate-bound rather than compute-bound.

This guide covers the full stack of EC2 optimization for API servers: instance selection, network configuration, EBS tuning, OS-level kernel parameters, and the failure modes that appear under sustained production load.

## Instance Family Selection: Graviton3/4 vs x86

### The Graviton Economics Case

**AWS Graviton** processors are AWS-designed ARM64 chips. Graviton3 (launched 2022) powers the `m7g`, `c7g`, and `r7g` families. Graviton4 (launched 2024) powers `m8g`, `c8g`, and `r8g`. For API workloads, the comparison against equivalent x86 instances (Intel Ice Lake, AMD Genoa) consistently favors Graviton on price/performance.

**Actual benchmark data for a stateless JSON API (NestJS, 1000 concurrent connections):**

| Instance                           | vCPU | Memory | On-Demand $/hr | Req/sec | Cost per million reqs |
| ---------------------------------- | ---- | ------ | -------------- | ------- | --------------------- |
| c5.xlarge (Intel)                  | 4    | 8GB    | $0.170         | 42,000  | $1.12                 |
| c6g.xlarge (Graviton2)             | 4    | 8GB    | $0.136         | 48,000  | $0.79                 |
| c7g.xlarge (Graviton3)             | 4    | 8GB    | $0.1448        | 58,000  | $0.69                 |
| c7i.xlarge (Intel Sapphire Rapids) | 4    | 8GB    | $0.1785        | 52,000  | $0.95                 |
| c8g.xlarge (Graviton4)             | 4    | 8GB    | $0.1448        | 63,000  | $0.64                 |

Graviton4 (`c8g`) delivers ~50% more requests per dollar than the equivalent Intel instance for this workload. The gap varies by workload:

- **I/O-bound APIs** (waiting on database, Redis, external HTTP calls): 30–45% better price/performance on Graviton
- **CPU-bound APIs** (heavy JSON serialization, cryptography, compression): 20–35% better price/performance
- **Memory-bound workloads** (large in-process caches): ~25% better price/performance

### When x86 Wins

Graviton is not universally superior. Specific cases where x86 remains the right choice:

**Software constraints:** Some compiled libraries do not publish ARM64 builds. This is increasingly rare in 2026 — most major open-source projects support ARM64 — but custom compiled dependencies (proprietary SDKs, some ML inference libraries) may still be x86-only.

**Instruction-set-specific workloads:** AVX-512 instructions on Intel Ice Lake outperform Graviton for specific numerical computation patterns (FFT, matrix operations) that can be expressed as AVX-512 vectorized operations. If your API includes heavy numerical processing, benchmark both architectures.

**Existing Reserved Instance commitments:** If your team purchased 1-year or 3-year Reserved Instances for x86 families, switching to Graviton immediately forfeits that commitment. Evaluate Graviton adoption timing against Reserved Instance expiry dates.

### How to Benchmark Your Workload

Do not assume Graviton will improve your specific API without testing. The methodology:

1. Build a multi-arch Docker image (`--platform linux/amd64,linux/arm64` with Docker Buildx)
2. Deploy identical application versions to same-size `c7i` and `c8g` instances
3. Run load tests with realistic traffic patterns (not synthetic max-throughput benchmarks)
4. Measure p50, p95, p99 latency and cost per 1000 requests
5. Account for Reserved Instance pricing, not just On-Demand

For most teams, the decision is straightforward: **default to Graviton for new workloads**, migrate existing workloads at the next Reserved Instance renewal.

## CPU and Compute Optimization

### T3/T4g: Burstable Instances Under Production Load

**T-series instances** (T3 on x86, T4g on Graviton) use a credit-based CPU model. Each instance accumulates credits at a baseline rate proportional to its size and spends them when CPU utilization exceeds the baseline.

| Instance   | Baseline CPU  | Credit earn rate | Max burst duration |
| ---------- | ------------- | ---------------- | ------------------ |
| t3.small   | 20% of 2 vCPU | 12 credits/hr    | ~2.5 hrs at 100%   |
| t3.medium  | 20% of 2 vCPU | 24 credits/hr    | ~5 hrs at 100%     |
| t3.large   | 30% of 2 vCPU | 36 credits/hr    | ~6 hrs at 100%     |
| t4g.medium | 20% of 2 vCPU | 24 credits/hr    | ~5 hrs at 100%     |

**The failure mode:** An API server on a T3 instance handling a gradual traffic ramp exhausts its credit balance over 4–6 hours. Once credits are exhausted, the instance throttles to baseline — 20–30% of nominal CPU performance. API latency increases 3–5x. The CloudWatch `CPUCreditBalance` metric approaching zero is the signal.

**T-series in production: enable Unlimited mode.** T3 Unlimited and T4g Unlimited allow sustained above-baseline CPU consumption at a charge of $0.05 per vCPU-hour for T3 or $0.04 for T4g. For a t3.medium running 24 hours above baseline, that is an additional $2.40/day — still often cheaper than the next instance size for bursty workloads.

**When to move to fixed performance:** If your `CPUCreditBalance` stays near zero for more than 4 hours/day consistently, the T-series is the wrong family. Move to a `c7g` or `m7g` where performance is deterministic.

### Dedicated Instances vs Shared Tenancy

By default, EC2 instances run on hardware shared with other customers (shared tenancy). **Dedicated instances** run on hardware isolated to your AWS account. **Dedicated hosts** give you visibility into the physical host's socket/core topology.

For API workloads, shared tenancy is correct. Dedicated instances cost 10–15% more and the isolation benefit is regulatory compliance, not performance. The "noisy neighbor" concern in modern AWS is largely addressed at the hypervisor level — you will not see other customers' workloads impacting your CPU.

The exception is memory-bandwidth-intensive workloads where NUMA awareness matters — covered in the memory optimization section.

## Network Optimization

### Enhanced Networking and ENA

**Enhanced Networking** with the **Elastic Network Adapter (ENA)** is enabled by default on all current-generation EC2 instances. ENA provides:

- Up to 100 Gbps network bandwidth (on compute-optimized instances)
- Significantly lower per-packet CPU overhead vs legacy `virtio` drivers
- Jumbo frame support (9001 MTU vs 1500 MTU standard)

For API workloads, ENA matters most when:

- You have high connection rates (thousands of new connections per second)
- You transfer large response payloads (>1MB per response)
- You make many concurrent outbound connections to RDS, ElastiCache, or other EC2 services

Verify ENA is active:

```bash
ethtool -i eth0 | grep driver
# Should show: driver: ena
```

If you are on a legacy instance type still using the `ixgbevf` driver, migrating to a current-generation instance will improve both throughput and CPU efficiency on networking operations.

### Placement Groups

**Cluster placement groups** colocate instances on the same low-latency physical rack. Network latency between instances in a cluster placement group drops to 50–100 microseconds single-digit vs 200–500 microseconds for instances in the same AZ without placement group constraints.

When this matters for APIs:

- Synchronous calls between API tier and a Redis/Valkey cluster on EC2
- Internal RPC between microservices on EC2 where p99 latency is critical
- High-throughput database connections to RDS on EC2 (not RDS managed service)

**Creating a cluster placement group and deploying instances into it via Terraform:**

```hcl
resource "aws_placement_group" "api_cluster" {
  name     = "api-cluster-pg"
  strategy = "cluster"

  tags = {
    Environment = var.environment
  }
}

resource "aws_instance" "api_server" {
  count             = var.instance_count
  ami               = data.aws_ami.amazon_linux_2023.id
  instance_type     = "c8g.2xlarge"
  placement_group   = aws_placement_group.api_cluster.id
  subnet_id         = var.private_subnet_ids[0]  # Must be same AZ

  # ENA is default on c8g, explicit for documentation
  ena_support = true

  root_block_device {
    volume_type           = "gp3"
    volume_size           = 30
    throughput            = 125
    iops                  = 3000
    encrypted             = true
    delete_on_termination = true
  }

  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"  # IMDSv2 required
    http_put_response_hop_limit = 1
  }

  user_data = base64encode(templatefile("${path.module}/user-data.sh", {
    environment = var.environment
  }))

  tags = {
    Name        = "api-server-${count.index}"
    Environment = var.environment
  }
}
```

**Cluster placement group constraints:**

- All instances must be in the same AZ — this is a hard requirement
- The AZ must have capacity to launch all instances simultaneously; large placement groups on popular instance types can fail with `InsufficientInstanceCapacity`
- Not all instance types support cluster placement groups (verify with `aws ec2 describe-instance-type-offerings`)

**Spread placement groups** place each instance on distinct underlying hardware. A spread placement group can span multiple AZs. Use spread for stateful services (primary databases, stateful cache nodes) where hardware failure isolation matters more than network latency.

**Partition placement groups** divide instances into logical partitions, each on separate rack hardware. Use for distributed systems (Kafka, Cassandra, Elasticsearch) that need topology awareness for rack-aware replica placement.

### NIC Tuning: Receive Side Scaling

For very high-connection-rate APIs on instances with multiple vCPUs, **Receive Side Scaling (RSS)** distributes incoming packets across CPU cores. On ENA, this is handled automatically. However, interrupt affinity tuning can improve performance further:

```bash
# Show current interrupt CPU affinity for eth0
cat /proc/interrupts | grep eth0

# Set irqbalance to distribute network interrupts across cores
systemctl enable --now irqbalance

# For NUMA-sensitive workloads, pin network interrupts to NUMA node 0
for irq in $(grep eth0 /proc/interrupts | awk -F: '{print $1}' | tr -d ' '); do
  echo 0f > /proc/irq/$irq/smp_affinity  # First 4 CPUs
done
```

## EBS Optimization

### gp3 Throughput and IOPS Tuning

**gp3** is the current-generation general-purpose SSD volume type. Unlike gp2 (where IOPS scales automatically with volume size), gp3 decouples performance from capacity:

- **Baseline:** 3,000 IOPS and 125 MB/s throughput at any size, included in the base price
- **Maximum:** 16,000 IOPS and 1,000 MB/s throughput for additional cost
- **Price per GB:** $0.08/GB-month (same as gp2)
- **Additional IOPS:** $0.005 per provisioned IOPS-month above 3,000
- **Additional throughput:** $0.04 per MB/s-month above 125

For API servers, the OS root volume typically does not need additional IOPS provisioning — API servers are compute and network bound, not disk bound. Where gp3 tuning matters:

**Application logs:** If your application writes high-volume logs to disk (not recommended in containers, but common on EC2), provision 6,000–9,000 IOPS to prevent log write latency from adding to request processing time.

**Swap space:** PHP and Python applications under memory pressure will use swap. gp3 at 3,000 IOPS delivers swap I/O at ~24 MB/s (random 4K writes). This is slow enough that swap usage will cause measurable API latency degradation. Monitor `SwapUsage` in CloudWatch agent; if your instances are hitting swap, add memory before provisioning more IOPS.

**Ephemeral data stores:** If your API maintains a local SQLite database, local embedding index, or similar disk-resident data structure, provision additional IOPS on a separate data volume:

```hcl
resource "aws_ebs_volume" "api_data" {
  availability_zone = "us-east-1a"
  type              = "gp3"
  size              = 100
  iops              = 6000
  throughput        = 250
  encrypted         = true

  tags = {
    Name = "api-data-volume"
  }
}
```

### io2 and NVMe Instance Store

**io2** volumes are appropriate when you need consistent sub-millisecond IOPS latency with durability guarantees. For API servers, io2 is rarely justified — the cost ($0.125/GB-month + $0.065 per IOPS-month) is substantial, and most API disk I/O patterns do not need io2 consistency guarantees.

**NVMe instance store** is included with certain instance families (`i4g`, `im4gn`, `is4gen`) and offers extremely high IOPS at effectively zero additional cost. The critical caveat: **instance store is ephemeral** — data is lost when the instance stops. Use instance store for:

- Read-through caches (warmed from a durable source on startup)
- Temporary file processing (image resizing, document conversion)
- Local buffer before writing to S3 or EBS

Never use instance store as a primary data store without a durability strategy.

## OS-Level Linux Tuning for APIs

The Linux kernel ships with defaults tuned for general-purpose workloads and conservative resource usage. API servers under production load hit several of these defaults as bottlenecks before running out of CPU or memory.

### sysctl Parameters

```bash
# /etc/sysctl.d/99-api-server.conf
# Apply with: sysctl -p /etc/sysctl.d/99-api-server.conf

# ============================================================
# TCP Connection Handling
# ============================================================

# Accept queue size per socket — default 128, causes SYN drops under burst
net.core.somaxconn = 65535

# SYN backlog — half-open connections waiting for three-way handshake
net.ipv4.tcp_max_syn_backlog = 65535

# Allow TIME_WAIT socket reuse for new outbound connections
# Eliminates "cannot assign requested address" errors under high egress connection rates
net.ipv4.tcp_tw_reuse = 1

# Reduce TIME_WAIT duration — default 60 seconds is too long for API servers
# Note: changing FIN_TIMEOUT does not change TIME_WAIT on Linux (always 2*MSL=60s)
# tcp_tw_reuse above is the correct lever

# Ephemeral port range — default 32768-60999 (28k ports)
# At 1000 connections/second, this exhausts in 28 seconds
net.ipv4.ip_local_port_range = 1024 65535

# Keepalive tuning — detect dead connections faster
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6

# ============================================================
# Network Buffer Sizes
# ============================================================

# Default and maximum receive/send socket buffer sizes
# Default (212992) is adequate for most APIs; increase for high-bandwidth streaming
net.core.rmem_default = 212992
net.core.wmem_default = 212992
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728

# TCP auto-tuning ranges (min, default, max in bytes)
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728

# Packet receive queue — prevents drops at high packet rates
net.core.netdev_max_backlog = 16384

# ============================================================
# File Descriptor Limits
# ============================================================

# System-wide maximum open files
# Default 1048576 on Amazon Linux 2023; explicit for clarity
fs.file-max = 2097152

# Inotify watches — needed for apps using filesystem event watching
fs.inotify.max_user_watches = 524288

# ============================================================
# Memory Management
# ============================================================

# Disable swap use until almost full — APIs should not hit swap
vm.swappiness = 10

# Control how aggressively the kernel writes dirty pages to disk
vm.dirty_ratio = 20
vm.dirty_background_ratio = 5
```

Apply these settings and persist across reboots:

```bash
sysctl -p /etc/sysctl.d/99-api-server.conf
```

For per-process file descriptor limits, set in `/etc/security/limits.d/99-api-server.conf`:

```
* soft nofile 65535
* hard nofile 65535
* soft nproc  65535
* hard nproc  65535
```

And verify your application process actually has elevated limits:

```bash
# Check limits of running process (replace PID)
cat /proc/$(pgrep -f "gunicorn\|node\|php-fpm")/limits | grep "Max open files"
```

### Huge Pages for Memory-Bound Workloads

**Transparent Huge Pages (THP)** are enabled by default on Amazon Linux but can cause latency spikes in some workloads due to page compaction pauses. For most API servers, disable THP:

```bash
# Disable THP (survives reboot via rc.local or systemd)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
```

Redis, MongoDB, and other databases explicitly recommend disabling THP. For Java-based APIs (Spring Boot, Quarkus), explicit huge pages (`vm.nr_hugepages`) can reduce GC overhead — but this is workload-specific and requires measurement.

### Disabling IRQ Balance for NUMA Workloads

On multi-socket EC2 instances (large instances with multiple NUMA nodes, `m5.metal` and similar), applications that process requests on a CPU core while network interrupts land on a different NUMA node pay a cross-NUMA memory access penalty. For latency-critical APIs:

```bash
# Identify NUMA topology
numactl --hardware

# Pin application process to NUMA node 0
numactl --cpunodebind=0 --membind=0 node dist/main.js
```

Most EC2 instance types up to 8xlarge are single-NUMA. Beyond that, NUMA topology becomes relevant.

## CloudWatch Agent for Custom CPU and Memory Metrics

EC2 does not report memory utilization to CloudWatch by default — only CPU utilization is available natively. Install the CloudWatch agent to report memory, disk, and custom API metrics.

```json
{
  "agent": {
    "metrics_collection_interval": 30,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "append_dimensions": {
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
      "InstanceId": "${aws:InstanceId}",
      "InstanceType": "${aws:InstanceType}"
    },
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent", "mem_available_percent", "mem_used", "mem_total"],
        "metrics_collection_interval": 30
      },
      "disk": {
        "measurement": ["used_percent", "inodes_free"],
        "metrics_collection_interval": 60,
        "resources": ["/", "/data"]
      },
      "net": {
        "measurement": [
          "net_bytes_recv",
          "net_bytes_sent",
          "net_packets_recv",
          "net_packets_sent",
          "net_drop_in",
          "net_drop_out"
        ],
        "metrics_collection_interval": 30,
        "resources": ["eth0"]
      },
      "netstat": {
        "measurement": ["tcp_established", "tcp_time_wait", "tcp_close_wait"],
        "metrics_collection_interval": 30
      },
      "processes": {
        "measurement": ["running", "sleeping", "dead"]
      }
    }
  }
}
```

The `netstat` metrics are particularly valuable for API debugging. A growing `tcp_time_wait` count indicates high connection turnover — a candidate for HTTP keepalive tuning. A growing `tcp_close_wait` count indicates the application is not closing connections promptly.

**Terraform to deploy the CloudWatch agent config via SSM:**

```hcl
resource "aws_ssm_parameter" "cloudwatch_config" {
  name  = "/cloudwatch-agent/config/api-server"
  type  = "String"
  value = file("${path.module}/cloudwatch-agent-config.json")
}

resource "aws_iam_role_policy_attachment" "cloudwatch_agent_policy" {
  role       = aws_iam_role.ec2_api_role.name
  policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
}

resource "aws_iam_role_policy_attachment" "ssm_policy" {
  role       = aws_iam_role.ec2_api_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
```

## Edge Cases Under Sustained Load

### Noisy Neighbor Mitigation

In 2026, the AWS Nitro hypervisor provides strong CPU isolation between EC2 instances. True CPU-level noisy neighbor effects are rare. Network noisy neighbors can still occur on shared network infrastructure in the same AZ. Symptoms: `net_drop_in` metrics spike without corresponding traffic increase on your instance; p99 latency increases without CPU or memory pressure.

Mitigation options:

- Move to dedicated hosts (guarantees isolated network infrastructure, significant cost increase)
- Enable **ENA Express** (uses SRD protocol for single-digit microsecond latency and better throughput consistency) on supported instance types
- Spread instances across multiple AZs — likely placing them on different physical infrastructure

### Burst Credit Exhaustion Under Sustained Load

Symptoms on T-series instances: CPU utilization in CloudWatch shows a sudden drop to 20–30% while the application reports increasing latency. The `CPUCreditBalance` metric will show the credit exhaustion event.

Immediate remediation: enable T3 Unlimited via the console or CLI:

```bash
aws ec2 modify-instance-credit-specification \
  --instance-credit-specifications \
  '[{"InstanceId":"i-xxxxx","CpuCredits":"unlimited"}]'
```

This does not require a reboot and takes effect within minutes.

### CPU Steal

**CPU steal** (`%st` in `top`, `cpu_steal_percent` in CloudWatch agent) represents time your vCPU waited for the hypervisor to schedule it on physical CPU. Non-zero steal indicates the physical host is oversubscribed.

In modern AWS on Nitro, steal should be at or near zero for normal workloads. Non-zero steal on current-generation instances is unusual and warrants a support ticket. If you consistently see >2% steal:

1. Stop and start the instance (migrates to different hardware, not a reboot)
2. If steal persists after migration, open an AWS support case

On legacy Xen-based instances (C3, M3, older families), steal was a more common issue. Migration to Nitro-based instances eliminates this class of problem.

For related EC2 and container scaling strategies, see our [AWS Auto Scaling Strategies for EC2, ECS, and Lambda](/blog/aws-auto-scaling-strategies-ec2-ecs-lambda/).

For cost optimization across your EC2 fleet, see the [AWS Cost Control Architecture and Optimization Playbook](/blog/aws-cost-control-architecture-optimization-playbook/).

For CloudWatch metrics and alarm configuration, see [CloudWatch Observability: Metrics, Logs, and Alarms Best Practices](/blog/aws-cloudwatch-observability-metrics-logs-alarms-best-practices/).

## FAQ

### When should you use Graviton over x86 for API workloads?
Use Graviton3/4 by default for any new API workload unless you have a specific constraint preventing it. Graviton3 delivers 25–40% better price/performance than equivalent x86 instances for typical API workloads — web serving, database proxies, API gateways, and stateless compute. The most common blockers are x86-only software dependencies (specific compiled binaries or libraries without ARM builds), existing on-premises hardware that must remain x86-compatible, or team unfamiliarity with ARM builds. Most modern language runtimes (Python 3.9+, Node 16+, Go 1.15+, PHP 8.x, Java 11+) have production-quality Graviton support. Test your container builds with multi-arch manifests and benchmark on your actual workload before committing — some CPU-intensive workloads (heavy cryptography, numerical computation) show larger gains than others.

### How do EC2 T3/T4g CPU credits affect API performance?
T-series instances (T3 on x86, T4g on Graviton) earn CPU credits when utilization is below the baseline threshold and spend credits when above it. Each instance size has a defined baseline: a t3.medium has a 20% baseline on 2 vCPUs. If your API sustains above-baseline CPU for more than the credit buffer allows, the instance drops to the baseline rate — typically 20–40% of nominal performance. For API servers, this manifests as sudden latency spikes under load. If your API has consistent throughput rather than burst patterns, move to a fixed-performance compute-optimized instance (c7g family for Graviton). T-series instances are appropriate for dev environments, low-traffic staging, and genuinely bursty workloads with significant idle periods. Always enable T3 Unlimited or T4g Unlimited if you run T-series in production — this prevents throttling at the cost of per-credit charges during burst exhaustion.

### What Linux kernel parameters most impact API server throughput?
The five kernel parameters with the largest direct impact on API throughput are: (1) net.core.somaxconn — controls the accept queue size for listening sockets; the default of 128 causes connection drops under burst traffic, set to 65535; (2) net.ipv4.tcp_tw_reuse — allows reuse of TIME_WAIT sockets for new connections, critical for high-connection-rate APIs, set to 1; (3) net.ipv4.ip_local_port_range — the range of ephemeral ports for outbound connections, set to 1024 65535; (4) fs.file-max and the per-process ulimit nofile — each TCP connection uses a file descriptor, the default 1024 hard limit will cause EMFILE errors under load, set to 65535 or higher; (5) net.core.netdev_max_backlog — packet receive queue length, increase to 16384 for high-bandwidth workloads. These five parameters address the most common bottlenecks before you need to tune deeper kernel networking stack parameters.

### How do placement groups improve EC2 network performance?
Placement groups control the physical placement of EC2 instances relative to each other. Cluster placement groups pack instances onto the same underlying hardware rack, enabling single-digit microsecond network latency and up to 10 Gbps single-flow bandwidth between instances. This matters for API workloads that make synchronous calls to peer services or databases hosted on EC2 — reducing inter-service latency from 200–500 microseconds (typical VPC) to 50–100 microseconds (cluster placement group). The tradeoff is reduced availability: all instances in a cluster placement group can fail together if the underlying hardware fails. Spread and partition placement groups trade lower network performance for higher availability by distributing instances across physical hardware. Use cluster for latency-sensitive internal APIs, spread for stateful services where individual instance failure must be isolated, and partition for distributed systems like Kafka or Cassandra that need topology awareness.

---

*Source: https://www.factualminds.com/blog/ec2-high-performance-api-optimization/*