How to Tune PHP, Node.js, Python, and Go for High Concurrency on AWS
Quick summary: PHP-FPM, Node.js, Python, and Go have fundamentally different concurrency models. Tuning each runtime for high concurrency on ECS requires understanding the model, not just copying configuration values from Stack Overflow.
Key Takeaways
- Tuning each runtime for high concurrency on ECS requires understanding the model, not just copying configuration values from Stack Overflow
- Tuning each runtime for high concurrency on ECS requires understanding the model, not just copying configuration values from Stack Overflow

Table of Contents
Every engineering team that “tunes for performance” eventually lands in the same place: they copy configuration values from a Medium article written for a different workload on different hardware, deploy to ECS, and wonder why their p99 latency did not improve.
Effective tuning requires understanding the concurrency model of your runtime. PHP processes requests sequentially per worker. Node.js multiplexes thousands of connections on a single thread via an event loop. Python’s GIL gates CPU parallelism. Go spawns goroutines so cheaply that it can handle millions of concurrent connections on a laptop.
Each model has different bottlenecks, different failure modes, and different configuration levers. Here is what actually matters for each one on AWS ECS.
Concurrency Models Compared
Understanding the model before touching any configuration:
| Runtime | Concurrency mechanism | Parallelism | Key bottleneck |
|---|---|---|---|
| PHP-FPM | Multiple OS processes | True parallelism up to worker count | Worker pool exhaustion |
| PHP + Octane | Persistent workers, multiple processes | True parallelism up to worker count | Memory leaks, static state |
| Node.js | Single-threaded event loop | No parallelism (CPU-bound) | Event loop blocking |
| Node.js cluster | Multiple OS processes, each with event loop | True parallelism up to worker count | Memory, process coordination |
| Python asyncio | Cooperative coroutines, single-threaded | No parallelism (GIL) | GIL for CPU work, async I/O ceiling |
| Python Gunicorn | Multiple OS processes | True parallelism up to worker count | Worker pool exhaustion, GIL per worker |
| Go | Goroutines (M:N threads) | True parallelism up to GOMAXPROCS | Goroutine leaks, sync.Mutex contention |
Critical distinction: parallelism (doing two things at the same time) vs concurrency (managing many things in progress simultaneously). Node.js is highly concurrent but not parallel for CPU work. Go is both concurrent and parallel.
For ECS, this means:
- PHP-FPM and Python Gunicorn: scale by adding workers (vertical) and tasks (horizontal).
- Node.js: scale by adding cluster workers (vertical, limited to CPU count) and tasks (horizontal).
- Go: scale by adding tasks (horizontal); each task handles massive concurrency internally.
PHP and Laravel Octane
Standard PHP-FPM Tuning
The PHP-FPM bottleneck is pm.max_children. When all workers are busy, new connections queue at Nginx (subject to backlog in the listen directive). When the queue fills, connections are refused.
OPcache eliminates the file system reads for every request — PHP bytecode is compiled once and stored in shared memory. Without OPcache, every PHP file in your application is read from disk and parsed on every request. With OPcache: read once, cached indefinitely (until invalidated).
OPcache tuning for production:
[opcache]
opcache.enable = 1
opcache.enable_cli = 1
; Shared memory for cached bytecode
; 256MB covers most applications
opcache.memory_consumption = 256
; Shared memory for interned strings (deduplicated)
opcache.interned_strings_buffer = 32
; Maximum number of files that can be cached
; Check current usage: opcache_get_status()['opcache_statistics']['num_cached_scripts']
opcache.max_accelerated_files = 20000
; In production: disable file timestamp checks (files do not change)
; Requires: opcache_reset() or container restart after deploy
opcache.validate_timestamps = 0
; JIT compilation (PHP 8+)
; opcache.jit_buffer_size = 100M
; opcache.jit = tracing
; Note: JIT helps CPU-bound code. For I/O-bound Laravel APIs, the benefit is minimal.Set opcache.validate_timestamps = 0 in production containers. When deploying a new version, you are replacing the container image — there is no need for PHP to check whether files changed. This eliminates a file system stat call per cached file per request.
Enable the OPcache status page during debugging to verify cache hit rate:
<?php
$status = opcache_get_status();
echo "Hit rate: " . $status['opcache_statistics']['opcache_hit_rate'] . "%\n";
echo "Files cached: " . $status['opcache_statistics']['num_cached_scripts'] . "\n";
echo "Memory used: " . round($status['memory_usage']['used_memory'] / 1024 / 1024) . " MB\n";A cache hit rate below 95% means either your file count exceeds max_accelerated_files or validate_timestamps = 1 is causing unnecessary cache invalidation.
Laravel Octane: What It Actually Does
Standard PHP-FPM bootstraps Laravel for every request:
- Load
public/index.php - Require
vendor/autoload.php(autoloader) - Create Application container
- Register all service providers (
boot()andregister()on 50+ providers) - Resolve HTTP kernel
- Run middleware stack
- Dispatch to router
- Execute controller
- Tear down request state
Steps 1–6 are bootstrap — identical for every request. For a typical Laravel application, this costs 10–40ms before your controller runs.
Laravel Octane with Swoole or RoadRunner executes steps 1–6 once and keeps the result in memory. Steps 7–9 run per request. Steps 1–6 are shared across all requests for the lifetime of the worker.
<?php
// config/octane.php — production-tuned configuration
return [
'server' => env('OCTANE_SERVER', 'swoole'),
// Worker count formula: available_memory / memory_per_worker
// Measure memory_per_worker with: memory_get_usage(true) in a controller
// after a few requests have warmed the worker
'workers' => env('OCTANE_WORKERS', 8),
// Task workers for async dispatch (Octane::concurrently)
'task_workers' => env('OCTANE_TASK_WORKERS', 4),
// Restart each worker after N requests (prevents memory leak accumulation)
// Start with 500; reduce if you observe memory growth
'max_requests' => env('OCTANE_MAX_REQUESTS', 500),
'swoole' => [
'options' => [
// Maximum open connections per worker
'max_conn' => 1024,
// Heartbeat timeout for idle connections
'heartbeat_idle_time' => 60,
'heartbeat_check_interval' => 30,
// Enable coroutines for async operations within a request
'hook_flags' => SWOOLE_HOOK_ALL,
],
],
// Listeners that clean up state between requests
// Add your custom cleanup here
'listeners' => [
WorkerStarting::class => [
EnsureUploadedFilesAreValid::class,
],
RequestReceived::class => [
...Octane::prepareApplicationForNextOperation(),
...Octane::prepareApplicationForNextRequest(),
],
RequestHandled::class => [],
RequestTerminated::class => [
FlushTemporaryContainerInstances::class,
],
],
// Services that are 'warm' (pre-resolved before first request)
// Only include services that are safe to share across requests
'warm' => [
...Octane::defaultServicesToWarm(),
],
];Memory Leak Patterns in Octane
The most common memory leak patterns:
Static properties that accumulate state:
class EventDispatcher {
// BAD: static array accumulates across requests
private static array $listeners = [];
public static function listen(string $event, callable $listener): void {
static::$listeners[$event][] = $listener;
}
}Container bindings not scoped to request:
// BAD: If this closure captures request-specific data, it leaks across requests
$this->app->bind(OrderProcessor::class, function () use ($request) {
return new OrderProcessor($request->user());
});Global state in singletons:
Octane’s RequestTerminated listener calls Octane::prepareApplicationForNextRequest() which flushes singletons registered with $app->instance(). Custom singletons registered outside Laravel’s container are not flushed automatically.
Detection: log memory usage at the start and end of each request in staging:
class LogMemoryUsageMiddleware
{
public function handle(Request $request, Closure $next): Response
{
$before = memory_get_usage(true);
$response = $next($request);
$after = memory_get_usage(true);
$delta = $after - $before;
if ($delta > 1024 * 1024) { // Log if request used more than 1 MB net
logger()->warning('High memory delta', [
'path' => $request->path(),
'delta_bytes' => $delta,
'after_mb' => round($after / 1024 / 1024, 2),
]);
}
return $response;
}
}If memory grows steadily across requests and does not return to baseline, you have a leak. max_requests is the safety valve — workers restart before leaks become critical.
Node.js: Taming the Event Loop
Event Loop Blocking Detection
The most common Node.js performance problem in production is event loop blocking that appears as high p99 latency with moderate CPU usage. The event loop is blocked, preventing I/O callbacks from running, but the CPU is not pegged — it is just waiting on the synchronous code to finish.
Production monitoring with PerformanceObserver:
import { PerformanceObserver, performance } from 'node:perf_hooks';
let eventLoopLag = 0;
const SAMPLE_INTERVAL_MS = 100;
function measureEventLoopLag() {
const start = performance.now();
setImmediate(() => {
// If the event loop is healthy, this runs ~immediately after setImmediate queues it
// If blocked, this runs after the blocking code completes
eventLoopLag = performance.now() - start - SAMPLE_INTERVAL_MS;
});
}
// Sample every 100ms
setInterval(measureEventLoopLag, SAMPLE_INTERVAL_MS);
// Report the metric
setInterval(() => {
if (eventLoopLag > 50) {
console.warn(`Event loop lag: ${eventLoopLag.toFixed(1)}ms`);
// Publish to CloudWatch via AWS SDK
publishMetric('EventLoopLag', eventLoopLag);
}
}, 5000);
function publishMetric(name, value) {
// CloudWatch custom metric publishing
const { CloudWatchClient, PutMetricDataCommand } = require('@aws-sdk/client-cloudwatch');
const client = new CloudWatchClient({ region: process.env.AWS_REGION });
client.send(new PutMetricDataCommand({
Namespace: 'NodeJS/Runtime',
MetricData: [{
MetricName: name,
Value: value,
Unit: 'Milliseconds',
Dimensions: [
{ Name: 'ServiceName', Value: process.env.SERVICE_NAME ?? 'unknown' },
],
}],
}));
}V8 CPU profiling for production issues:
# Start with profiling enabled
node --prof dist/server.js
# After incident, process the isolate-*.log file
node --prof-process isolate-0x*.log > profile.txt
# Top CPU consumers appear in the "JavaScript" section
grep -A 50 "\[JavaScript\]" profile.txtFor long-running investigations in production, CPU sampling via clinic.js or 0x gives flamegraph output that makes blocking code immediately visible.
Worker Threads for CPU-Intensive Operations
Offload CPU-intensive operations to worker threads to avoid blocking the main event loop:
import { Worker, isMainThread, parentPort, workerData } from 'node:worker_threads';
import { fileURLToPath } from 'node:url';
import path from 'node:path';
const __filename = fileURLToPath(import.meta.url);
if (isMainThread) {
/**
* Run a CPU-intensive task in a worker thread.
* The main event loop continues processing I/O while this runs.
*/
export function runCpuTask(data) {
return new Promise((resolve, reject) => {
const worker = new Worker(__filename, {
workerData: data,
});
worker.on('message', resolve);
worker.on('error', reject);
worker.on('exit', (code) => {
if (code !== 0) {
reject(new Error(`Worker stopped with exit code ${code}`));
}
});
});
}
} else {
// Worker thread execution
const result = performHeavyComputation(workerData);
parentPort.postMessage(result);
}
function performHeavyComputation(data) {
// CPU-intensive work here — runs in a separate OS thread
// Does not block the main event loop
let sum = 0;
for (let i = 0; i < data.iterations; i++) {
sum += Math.sqrt(i);
}
return sum;
}For production, use a worker thread pool (via piscina or workerpool) rather than spawning a new thread per request:
import Piscina from 'piscina';
import path from 'node:path';
import { fileURLToPath } from 'node:url';
const __dirname = path.dirname(fileURLToPath(import.meta.url));
// Thread pool: min 2, max (CPU count) worker threads
const pool = new Piscina({
filename: path.resolve(__dirname, './workers/cpu-worker.js'),
minThreads: 2,
maxThreads: os.availableParallelism(),
idleTimeout: 60000, // Retire idle threads after 60s
});
// In request handler
app.post('/process', async (req, res) => {
const result = await pool.run({ data: req.body });
res.json(result);
});Graceful Shutdown with In-Flight Request Draining
import http from 'node:http';
const server = http.createServer(app);
let isShuttingDown = false;
server.listen(3000);
const gracefulShutdown = async (signal) => {
if (isShuttingDown) return;
isShuttingDown = true;
console.log(`Received ${signal}, shutting down gracefully`);
// Stop accepting new connections
server.close(async () => {
console.log('HTTP server closed');
// Close database connections, flush caches, etc.
await closeDbConnections();
process.exit(0);
});
// Force shutdown after 30 seconds
setTimeout(() => {
console.error('Forcing shutdown after timeout');
process.exit(1);
}, 30000);
};
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
// Refuse new requests during shutdown
app.use((req, res, next) => {
if (isShuttingDown) {
res.set('Connection', 'close');
res.status(503).json({ error: 'Server shutting down' });
return;
}
next();
});ECS sends SIGTERM to the container task when stopping or deregistering from the load balancer. The 30-second stopTimeout in the task definition gives your application time to finish in-flight requests before ECS sends SIGKILL. Set stopTimeout to match your graceful shutdown timeout.
Python: GIL, asyncio, and Gunicorn
The GIL Impact in Practice
The GIL (Global Interpreter Lock) prevents multiple Python threads from executing Python bytecode simultaneously. One thread runs at a time. For I/O operations, the GIL is released while waiting — so asyncio and multi-threaded I/O both work efficiently. For CPU operations, the GIL means multi-threaded Python is effectively single-threaded.
Implications for Gunicorn:
--worker-class sync(default): each worker is a single-threaded process. The GIL is irrelevant — each process has its own interpreter. True parallelism up to worker count.--worker-class gthread: each worker is multi-threaded. Multiple threads share one process, one GIL. I/O can be concurrent; CPU is constrained by the GIL.--worker-class uvicorn.workers.UvicornWorker: each worker runs an asyncio event loop. I/O concurrency within each worker; CPU constrained by GIL per worker.
For most FastAPI or Django applications on ECS, uvicorn.workers.UvicornWorker is the right choice: it combines Gunicorn’s multi-process stability with uvicorn’s async I/O efficiency.
uvloop for Higher Event Loop Performance
uvloop is a faster asyncio event loop implemented in Cython and libuv. Drop-in replacement:
import uvloop
import asyncio
# Install as default event loop
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
# Or for FastAPI with uvicorn, pass --loop uvloop to uvicorn
# uvicorn app.main:app --loop uvloopuvloop provides 2–4× throughput improvement for I/O-bound asyncio code in benchmarks. Real-world gains depend on the workload — applications spending most time in database queries see smaller improvements than those with many small I/O operations.
Production uvicorn Command
gunicorn app.main:app \
--bind 0.0.0.0:${PORT:-8000} \
--workers ${GUNICORN_WORKERS:-$(( 2 * $(nproc) + 1 ))} \
--worker-class uvicorn.workers.UvicornWorker \
--worker-connections 1000 \
--max-requests ${GUNICORN_MAX_REQUESTS:-1000} \
--max-requests-jitter ${GUNICORN_MAX_REQUESTS_JITTER:-100} \
--timeout ${GUNICORN_TIMEOUT:-30} \
--graceful-timeout ${GUNICORN_GRACEFUL_TIMEOUT:-30} \
--keep-alive ${GUNICORN_KEEPALIVE:-2} \
--log-level ${LOG_LEVEL:-info} \
--access-logfile - \
--error-logfile - \
--forwarded-allow-ips "*"Expose the worker count and timeout as environment variables — this allows tuning per ECS task without rebuilding the image. In your ECS task definition, set GUNICORN_WORKERS based on the task’s vCPU allocation.
Celery Worker Types and GIL Impact
Celery supports multiple execution pool types. The right choice depends on your task workload:
# For CPU-bound tasks (data processing, image handling, ML inference):
# prefork — multiple OS processes, each with own GIL
# celery -A app worker --pool=prefork --concurrency=4
# For I/O-bound tasks (HTTP calls, database queries, Redis ops):
# gevent or eventlet — coroutine-based concurrency within one process
# celery -A app worker --pool=gevent --concurrency=100
# For async Python tasks (requires Celery 5+ with asyncio support):
# solo — single-threaded, uses asyncio (experimental)
# celery -A app worker --pool=soloFor ECS, prefork is the most reliable choice for production workloads. gevent provides more concurrency per worker for pure I/O workloads but has compatibility issues with some libraries that are not gevent-safe. Measure both in staging before committing.
Go: Goroutines, Pools, and GC Tuning
http.Server Production Configuration
The Go standard library net/http package is production-ready without additional frameworks. Every timeout matters:
package main
import (
"context"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"time"
)
func main() {
mux := http.NewServeMux()
mux.HandleFunc("/health", healthHandler)
mux.HandleFunc("/api/orders", ordersHandler)
server := &http.Server{
Addr: ":8080",
Handler: mux,
// Time allowed to read the full request (headers + body)
// Prevents Slowloris attacks
ReadTimeout: 10 * time.Second,
// Time allowed to read request headers only
// More granular than ReadTimeout
ReadHeaderTimeout: 5 * time.Second,
// Time allowed to write the full response
// Includes time to send response headers and body
WriteTimeout: 30 * time.Second,
// Maximum time an idle keep-alive connection is kept open
// Set lower than ALB idle timeout (default 60s) to let the server
// close connections before the ALB does, avoiding 502 errors
IdleTimeout: 45 * time.Second,
// Maximum allowed header size in bytes
MaxHeaderBytes: 1 << 20, // 1 MB
}
// Start server in goroutine
go func() {
log.Printf("Server starting on %s", server.Addr)
if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
log.Fatalf("Server failed: %v", err)
}
}()
// Wait for OS signal
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
<-quit
// Graceful shutdown: finish in-flight requests
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
log.Println("Shutting down server...")
if err := server.Shutdown(ctx); err != nil {
log.Fatalf("Server forced shutdown: %v", err)
}
log.Println("Server stopped")
}
func healthHandler(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("ok"))
}
func ordersHandler(w http.ResponseWriter, r *http.Request) {
// Context from request — cancelled when client disconnects or WriteTimeout expires
ctx := r.Context()
// Pass context to all downstream calls
result, err := fetchOrdersFromDB(ctx)
if err != nil {
if ctx.Err() != nil {
// Client disconnected or timeout — not a server error
http.Error(w, "Request cancelled", http.StatusRequestTimeout)
return
}
http.Error(w, "Internal server error", http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
// encoding/json.NewEncoder(w).Encode(result) is more efficient for large responses
// (streams directly to ResponseWriter without allocating a full buffer)
if err := json.NewEncoder(w).Encode(result); err != nil {
log.Printf("Failed to encode response: %v", err)
}
}The IdleTimeout at 45 seconds versus ALB’s default 60-second idle timeout is an important detail. When an ALB closes an idle connection at 60 seconds, any in-flight request on that connection gets a 502 error. By closing the connection at 45 seconds (before the ALB), the server proactively removes the connection and the ALB does not have a stale connection to reuse for the next request.
sync.Pool for Allocation Reduction
Go’s garbage collector is low-latency but not zero-cost. Applications with high allocation rates (allocating new objects for each request) generate GC pressure that increases p99 latency. sync.Pool provides a per-goroutine pool of recyclable objects:
import (
"bytes"
"sync"
)
var bufferPool = sync.Pool{
New: func() any {
return new(bytes.Buffer)
},
}
func handleRequest(w http.ResponseWriter, r *http.Request) {
// Get a buffer from the pool (or allocate a new one if pool is empty)
buf := bufferPool.Get().(*bytes.Buffer)
buf.Reset() // Clear contents from previous use
defer bufferPool.Put(buf) // Return to pool when done
// Use buf for JSON encoding, template rendering, etc.
if err := json.NewEncoder(buf).Encode(responseData); err != nil {
http.Error(w, "encoding failed", http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
w.Write(buf.Bytes())
}sync.Pool objects may be GC’d between uses — the pool is not a cache. Objects returned to the pool must be safe to reuse (always Reset() before use). Use it for short-lived allocations that are created and released within a request, not for long-lived state.
GOGC Tuning for ECS
Go’s garbage collector triggers when the heap grows by GOGC% since the last collection (default: 100%). This means GC runs when the heap doubles.
For memory-constrained ECS containers, lower GOGC to trigger GC more frequently, keeping heap size smaller at the cost of more CPU spent on GC:
# In ECS task definition environment variables:
# GOGC=50 triggers GC when heap grows 50% (more frequent, smaller heap)
# GOGC=200 triggers GC when heap doubles again (less frequent, larger heap)
# Default GOGC=100 is appropriate for most workloads
# For containers under memory pressure (close to limit):
GOGC=50
# For containers with abundant memory (throughput over memory):
GOGC=200Go 1.19+ introduced GOMEMLIMIT which sets a soft memory limit for the Go runtime. When the heap approaches GOMEMLIMIT, GC runs more aggressively to stay under the limit. This is more practical for ECS than tuning GOGC:
# Set GOMEMLIMIT to 80% of container memory limit
# For a 512 MB container:
GOMEMLIMIT=409MiBWith GOMEMLIMIT, Go’s GC automatically tunes its aggressiveness to keep memory under the limit, reducing OOM kills without requiring manual GOGC tuning.
Goroutine Leak Detection
Goroutine leaks — goroutines that are spawned and never exit — accumulate memory over time and eventually crash the container. Common causes:
- Goroutines blocked on channels that are never written to.
- Goroutines blocked on mutex locks that are never released.
- HTTP client goroutines where the response body is never closed.
Detection in staging:
import "runtime"
func goroutineCountHandler(w http.ResponseWriter, r *http.Request) {
count := runtime.NumGoroutine()
fmt.Fprintf(w, "goroutines: %d\n", count)
}In production, expose this via a /debug/vars endpoint (from expvar package) and monitor goroutine_count via CloudWatch. A goroutine count that grows monotonically with request count (never decreasing) indicates leaks.
The goleak package detects goroutine leaks in tests:
func TestHandler(t *testing.T) {
defer goleak.VerifyNone(t)
// Test code here — goleak fails the test if goroutines are leaked
}File Descriptors on ECS Containers
Every open network connection, file, and socket consumes a file descriptor. The OS default limit is often 1,024 per process. Under high concurrency, you exhaust this limit long before you hit CPU or memory limits.
Symptoms of file descriptor exhaustion:
- “too many open files” errors in application logs.
- New connections refused.
- Database connection pool errors.
Fix at the ECS task definition level:
{
"containerDefinitions": [
{
"name": "api",
"ulimits": [
{
"name": "nofile",
"softLimit": 65535,
"hardLimit": 65535
}
]
}
]
}In Terraform:
resource "aws_ecs_task_definition" "api" {
family = "api"
container_definitions = jsonencode([
{
name = "api"
image = "${var.ecr_repository_url}:latest"
ulimits = [
{
name = "nofile"
softLimit = 65535
hardLimit = 65535
}
]
}
])
}For Nginx containers, also configure worker_rlimit_nofile in nginx.conf:
worker_rlimit_nofile 65535;
events {
worker_connections 8192;
}Each active connection requires two file descriptors (client socket + upstream socket for proxy). Effective max connections per worker: worker_connections / 2 = 4096. Total for 4 workers: 16,384 concurrent connections from one container.
Putting Runtime Tuning in Context
The configuration levers covered here have real impact — PHP-FPM worker counts affect throughput linearly, OPcache misses add 10–40ms to every request, event loop blocking turns p50 into p99. But tuning the runtime is the second-order optimization.
The first-order optimizations: eliminate N+1 queries, add caching for hot read paths, use connection pooling (PgBouncer for PostgreSQL, connection pool middleware for MySQL), and right-size ECS tasks based on measured resource utilization rather than guesswork.
A well-tuned runtime on an unoptimized application will plateau. An untuned runtime on a well-optimized application often performs adequately. Build in the right order.
Related reading: AWS ECS vs EKS: Container Orchestration Decision Guide and AWS Auto Scaling Strategies for EC2, ECS, and Lambda.
AWS Cloud Architect & AI Expert
AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.



