AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

A 500ms latency spike in a distributed system could be a slow RDS query, a Lambda cold start, a downstream API timeout, or a CloudWatch Logs ingestion delay. Finding the cause requires correlated logs, traces, and metrics — not grep.

Key Facts

  • A 500ms latency spike in a distributed system could be a slow RDS query, a Lambda cold start, a downstream API timeout, or a CloudWatch Logs ingestion delay
  • A 500ms latency spike in a distributed system could be a slow RDS query, a Lambda cold start, a downstream API timeout, or a CloudWatch Logs ingestion delay

Entity Definitions

Lambda
Lambda is an AWS service discussed in this article.
RDS
RDS is an AWS service discussed in this article.
CloudWatch
CloudWatch is an AWS service discussed in this article.

How to Debug Production Issues Across Distributed AWS Systems

DevOps & CI/CD Palaniappan P 15 min read

Quick summary: A 500ms latency spike in a distributed system could be a slow RDS query, a Lambda cold start, a downstream API timeout, or a CloudWatch Logs ingestion delay. Finding the cause requires correlated logs, traces, and metrics — not grep.

Key Takeaways

  • A 500ms latency spike in a distributed system could be a slow RDS query, a Lambda cold start, a downstream API timeout, or a CloudWatch Logs ingestion delay
  • A 500ms latency spike in a distributed system could be a slow RDS query, a Lambda cold start, a downstream API timeout, or a CloudWatch Logs ingestion delay
How to Debug Production Issues Across Distributed AWS Systems
Table of Contents

You get a PagerDuty alert at 2:47 AM. P95 API latency crossed 800ms. Your SLA is 500ms. You SSH to nothing — there is no server to SSH to. Your application runs on Lambda, talks to an ECS sidecar, queries RDS Aurora, and calls two third-party APIs. The error rate is 0.3%, not high enough to trigger your error alarms but enough that customers are complaining. You have CloudWatch. You have 47 log groups.

This is the fundamental problem of distributed systems debugging: the evidence is spread across components that cannot see each other. printf debugging works when you have one process. It fails completely when a single user request touches five services in sequence, each logging to a different destination, with no common thread connecting those log lines.

The solution is not more logs. It is correlated observability: a trace ID that flows through every service boundary, structured log output that includes that trace ID on every line, and a query tool that can reconstruct a complete request timeline from those disparate log streams.

Why printf Fails at Distributed Scale

The mental model that breaks down: “if I add more logging, I can find the problem.” This works in monoliths. In a distributed system, you can have perfect logging in every individual service and still be unable to answer the question “why was this specific user’s request slow?”

Consider a request path: API Gateway → Lambda (authentication) → ECS service (business logic) → RDS Aurora (query) → SQS (event publish) → Lambda (async processor). A 500ms spike could be:

  • Lambda cold start (authentication Lambda spun down)
  • Aurora connection pool exhaustion under load
  • The SQS message enqueue when a topic is throttled
  • Network latency between the ECS task and Aurora during a Availability Zone failover
  • A third-party API timeout in the business logic service

Without distributed tracing, you are correlating by timestamp — looking at CloudWatch Logs for the authentication Lambda around 2:47 AM, looking at ECS service logs around the same time, looking at RDS Performance Insights for slow queries at that time. Timestamps are approximate, time zones are treacherous, and you have no guarantee the specific request you care about appears in the time window you think it does.

With distributed tracing, you have a single trace ID. Every log line from every service for that specific request includes the same trace ID. A single query reconstructs the complete timeline.

The Three Pillars and What Each Misses

The standard observability framework is logs + metrics + traces. Each pillar is necessary; none is sufficient alone.

Metrics tell you something is wrong. A CloudWatch alarm fires when p99 latency exceeds your threshold. Metrics are cheap, fast, and excellent for alerting. What they miss: why. A latency spike in your RDS query time metric does not tell you which query, which transaction, or which code path caused it. Metrics aggregate — they lose the individual request context that makes root cause identification possible.

Logs contain individual request details. They tell you what happened. What they miss: the connection between services. A log line in Lambda saying “processed request req-abc-123 in 45ms” and a log line in ECS saying “slow query executed in 490ms” are not obviously connected — unless both lines contain the same trace ID, and you have tooling to query by that trace ID across both log groups simultaneously.

Traces connect the dots across service boundaries. They show you the exact sequence of operations, the duration of each span, and where time was spent within a request. What they miss: the full log context. A trace span saying “RDS query: 490ms” does not contain the actual SQL query, the database error message, or the application state at query time. Traces point you to the right place; logs tell you what actually happened there.

The operational model: alert on metrics, investigate with traces, diagnose with logs. Use traces to find which service and which operation is slow, then use structured log queries to read the detailed log context for that specific request.

Structured Logging with Trace ID Propagation

The foundational requirement for distributed debugging is that every log line includes the trace ID for the request being processed. This requires two things: structured logging (JSON output with a traceId field), and trace context propagation (the trace ID must be extracted from the incoming request and made available to the logger).

Node.js with OpenTelemetry and W3C Trace Context

// src/instrumentation.ts — loaded before application code
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { AwsInstrumentation } from '@opentelemetry/instrumentation-aws-sdk';
import { W3CTraceContextPropagator } from '@opentelemetry/core';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown-service',
  }),
  // Send to ADOT collector sidecar, which forwards to X-Ray
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
  }),
  instrumentations: [
    new HttpInstrumentation({
      // Propagate W3C trace context on all outbound HTTP calls
      headersToPropagate: ['traceparent', 'tracestate'],
    }),
    new ExpressInstrumentation(),
    new AwsInstrumentation({
      sqsExtractContextPropagationFromPayload: true,
    }),
  ],
  // W3C Trace Context — compatible with X-Ray via ADOT translation
  textMapPropagator: new W3CTraceContextPropagator(),
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0));
});
// src/logger.ts — structured logger with automatic trace ID injection
import { context, trace } from '@opentelemetry/api';

interface LogEntry {
  level: 'info' | 'warn' | 'error' | 'debug';
  message: string;
  traceId?: string;
  spanId?: string;
  service: string;
  timestamp: string;
  [key: string]: unknown;
}

function getTraceContext(): { traceId?: string; spanId?: string } {
  const span = trace.getActiveSpan();
  if (!span) {
    return {};
  }
  const ctx = span.spanContext();
  return {
    traceId: ctx.traceId,
    spanId: ctx.spanId,
  };
}

export const logger = {
  info(message: string, fields?: Record<string, unknown>) {
    const entry: LogEntry = {
      level: 'info',
      message,
      service: process.env.SERVICE_NAME || 'unknown',
      timestamp: new Date().toISOString(),
      ...getTraceContext(),
      ...fields,
    };
    console.log(JSON.stringify(entry));
  },

  warn(message: string, fields?: Record<string, unknown>) {
    const entry: LogEntry = {
      level: 'warn',
      message,
      service: process.env.SERVICE_NAME || 'unknown',
      timestamp: new Date().toISOString(),
      ...getTraceContext(),
      ...fields,
    };
    console.warn(JSON.stringify(entry));
  },

  error(message: string, error?: Error, fields?: Record<string, unknown>) {
    const entry: LogEntry = {
      level: 'error',
      message,
      service: process.env.SERVICE_NAME || 'unknown',
      timestamp: new Date().toISOString(),
      ...getTraceContext(),
      errorMessage: error?.message,
      errorStack: error?.stack,
      ...fields,
    };
    console.error(JSON.stringify(entry));
  },
};
// src/app.ts — Express application using structured logger
import 'reflect-metadata';
import './instrumentation'; // Must be first import
import express from 'express';
import { logger } from './logger';

const app = express();
app.use(express.json());

app.get('/users/:id', async (req, res) => {
  const { id } = req.params;

  logger.info('Fetching user', { userId: id, path: req.path });

  try {
    const user = await fetchUserFromDatabase(id);

    if (!user) {
      logger.warn('User not found', { userId: id });
      return res.status(404).json({ error: 'User not found' });
    }

    logger.info('User fetched successfully', {
      userId: id,
      userEmail: user.email,
    });

    return res.json(user);
  } catch (err) {
    logger.error('Failed to fetch user', err as Error, { userId: id });
    return res.status(500).json({ error: 'Internal server error' });
  }
});

Every logger.info() call automatically includes the traceId and spanId from the active OpenTelemetry span — no manual context threading required. The HTTP instrumentation ensures the traceparent header from incoming requests creates the active span, and outbound HTTP calls automatically include the traceparent header for downstream propagation.

Python FastAPI with ADOT and X-Ray Backend

# main.py
import json
import logging
import os
from contextlib import asynccontextmanager

import boto3
from fastapi import FastAPI, Request
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentation
from opentelemetry.instrumentation.botocore import BotocoreInstrumentation
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentation
from opentelemetry.propagators.aws import AwsXRayPropagator
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBased


# Structured JSON logger that injects trace context
class TraceAwareJsonFormatter(logging.Formatter):
    def format(self, record: logging.LogRecord) -> str:
        span = trace.get_current_span()
        ctx = span.get_span_context() if span else None

        log_entry = {
            "level": record.levelname,
            "message": record.getMessage(),
            "logger": record.name,
            "timestamp": self.formatTime(record),
            "service": os.environ.get("SERVICE_NAME", "unknown"),
        }

        if ctx and ctx.is_valid:
            # Convert OpenTelemetry trace ID to X-Ray format for AWS Console linking
            trace_id_hex = format(ctx.trace_id, '032x')
            xray_trace_id = f"1-{trace_id_hex[:8]}-{trace_id_hex[8:]}"
            log_entry["traceId"] = trace_id_hex
            log_entry["xrayTraceId"] = xray_trace_id
            log_entry["spanId"] = format(ctx.span_id, '016x')

        if record.exc_info:
            log_entry["exception"] = self.formatException(record.exc_info)

        # Include any extra fields passed to the logger
        for key, value in record.__dict__.items():
            if key not in ('msg', 'args', 'levelname', 'levelno', 'pathname',
                          'filename', 'module', 'exc_info', 'exc_text',
                          'stack_info', 'lineno', 'funcName', 'created',
                          'msecs', 'relativeCreated', 'thread', 'threadName',
                          'processName', 'process', 'name', 'message'):
                log_entry[key] = value

        return json.dumps(log_entry)


def configure_logging():
    handler = logging.StreamHandler()
    handler.setFormatter(TraceAwareJsonFormatter())
    logging.basicConfig(level=logging.INFO, handlers=[handler])


def configure_tracing():
    # Tail-based sampling: 100% of errors, 5% of successful requests
    # Note: true tail-based sampling requires OTel Collector — this is head-based
    sampler = ParentBased(
        root=TraceIdRatioBased(0.05),  # 5% sample rate for new traces
    )

    provider = TracerProvider(
        resource=Resource.create({
            "service.name": os.environ.get("SERVICE_NAME", "unknown"),
            "deployment.environment": os.environ.get("ENVIRONMENT", "production"),
        }),
        sampler=sampler,
    )

    # Send to ADOT collector → X-Ray
    otlp_exporter = OTLPSpanExporter(
        endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
    )
    provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
    trace.set_tracer_provider(provider)

    # Use X-Ray propagator to inter-operate with native AWS services
    from opentelemetry.propagate import set_global_textmap
    from opentelemetry.propagators.composite import CompositePropagator
    from opentelemetry.propagators.b3 import B3MultiFormat
    set_global_textmap(CompositePropagator([
        AwsXRayPropagator(),  # For AWS services (API Gateway, ALB)
    ]))


@asynccontextmanager
async def lifespan(app: FastAPI):
    configure_logging()
    configure_tracing()

    # Auto-instrument libraries
    FastAPIInstrumentation().instrument()
    BotocoreInstrumentation().instrument()  # boto3/botocore AWS SDK calls
    Psycopg2Instrumentation().instrument()  # PostgreSQL queries

    yield

    trace.get_tracer_provider().shutdown()


app = FastAPI(lifespan=lifespan)
logger = logging.getLogger(__name__)


@app.get("/orders/{order_id}")
async def get_order(order_id: str, request: Request):
    logger.info("Fetching order", extra={"orderId": order_id})

    try:
        order = await fetch_order(order_id)

        if not order:
            logger.warning("Order not found", extra={"orderId": order_id})
            return {"error": "Order not found"}, 404

        logger.info(
            "Order fetched",
            extra={"orderId": order_id, "status": order["status"]}
        )
        return order

    except Exception as e:
        logger.error(
            "Failed to fetch order",
            exc_info=True,
            extra={"orderId": order_id}
        )
        raise

Go Structured Logging with zerolog

// pkg/logger/logger.go
package logger

import (
    "os"

    "github.com/rs/zerolog"
    "go.opentelemetry.io/otel/trace"
)

// Logger wraps zerolog with automatic trace context injection
type Logger struct {
    zl      zerolog.Logger
    service string
}

func New(serviceName string) *Logger {
    zl := zerolog.New(os.Stdout).
        With().
        Timestamp().
        Str("service", serviceName).
        Logger()

    return &Logger{zl: zl, service: serviceName}
}

// withTrace returns a zerolog event with trace context fields injected
func (l *Logger) withTrace(ctx interface{ Value(interface{}) interface{} }) *zerolog.Logger {
    // Extract span from context
    span := trace.SpanFromContext(ctx.(interface {
        Value(interface{}) interface{}
        Done() <-chan struct{}
        Err() error
        Deadline() (interface{}, bool)
    }))

    if !span.SpanContext().IsValid() {
        return &l.zl
    }

    sc := span.SpanContext()
    enriched := l.zl.With().
        Str("traceId", sc.TraceID().String()).
        Str("spanId", sc.SpanID().String()).
        Logger()

    return &enriched
}

// Info logs at info level with trace context
func (l *Logger) Info(ctx interface{ Value(interface{}) interface{} }, msg string, fields map[string]interface{}) {
    event := l.withTrace(ctx).Info()
    for k, v := range fields {
        event = event.Interface(k, v)
    }
    event.Msg(msg)
}

// Error logs at error level with trace context
func (l *Logger) Error(ctx interface{ Value(interface{}) interface{} }, msg string, err error, fields map[string]interface{}) {
    event := l.withTrace(ctx).Error().Err(err)
    for k, v := range fields {
        event = event.Interface(k, v)
    }
    event.Msg(msg)
}
// cmd/api/main.go — HTTP handler using trace-aware logger
package main

import (
    "context"
    "net/http"

    "github.com/your-org/service/pkg/logger"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
)

var log = logger.New(os.Getenv("SERVICE_NAME"))
var tracer = otel.Tracer("api")

func getOrderHandler(w http.ResponseWriter, r *http.Request) {
    ctx, span := tracer.Start(r.Context(), "getOrder")
    defer span.End()

    orderID := r.PathValue("orderId")
    span.SetAttributes(attribute.String("order.id", orderID))

    log.Info(ctx, "Fetching order", map[string]interface{}{
        "orderId": orderID,
        "userId":  r.Header.Get("X-User-ID"),
    })

    order, err := fetchOrderFromDB(ctx, orderID)
    if err != nil {
        span.RecordError(err)
        log.Error(ctx, "Database query failed", err, map[string]interface{}{
            "orderId": orderID,
        })
        http.Error(w, "Internal server error", http.StatusInternalServerError)
        return
    }

    log.Info(ctx, "Order fetched successfully", map[string]interface{}{
        "orderId": orderID,
        "status":  order.Status,
    })

    // respond with order...
}

CloudWatch Logs Insights: Reconstructing a Request Timeline

With structured logging and trace ID propagation in place, CloudWatch Logs Insights becomes a powerful debugging tool. The following query reconstructs all log entries for a specific trace across multiple log groups:

# Query all log groups for a specific trace ID
# Run against: all relevant service log groups simultaneously

fields @timestamp, service, level, message, spanId, orderId, userId, errorMessage
| filter traceId = "4bf92f3577b34da6a3ce929d0e0e4736"
| sort @timestamp asc
| limit 500

Logs Insights allows querying multiple log groups in a single query — select all relevant log groups (Lambda, ECS services, API Gateway access logs) before running the query. The result is a chronological timeline of every logged event across all services for that specific request.

For finding trace IDs from a symptom (high latency, error), start with a metrics-first approach:

# Step 1: Find high-latency requests in the API Gateway access logs
fields @timestamp, traceId, requestId, status, responseLatency, path
| filter responseLatency > 800
| filter status != 4xx
| sort @timestamp desc
| limit 20
# Step 2: Find all 5xx errors in the last hour with trace IDs
fields @timestamp, traceId, service, level, message, errorMessage
| filter level = "error" or level = "ERROR"
| filter @timestamp > datefloor(@timestamp, 1h)
| stats count(*) as errorCount by traceId, service, message
| sort errorCount desc
| limit 50
# Step 3: Reconstruct the full timeline for a specific trace
fields @timestamp, service, level, message, spanId, durationMs, errorMessage
| filter traceId = "INSERT_TRACE_ID_FROM_STEP_2"
| sort @timestamp asc

Cost-aware querying: CloudWatch Logs Insights charges per GB scanned. Running Step 1 against all log groups scans everything. Minimize cost by:

  1. Selecting only API Gateway access logs for Step 1 (they contain latency)
  2. Using time range filters as narrow as the incident window
  3. Running exploratory queries on short time windows first, then expanding

The 10-Minute Incident Response Workflow

A structured workflow gets you from alert to root cause hypothesis within 10 minutes. Longer investigation can follow, but the first 10 minutes determine whether you page more engineers or resolve it alone.

Minutes 0–2: Characterize the symptom. Open your metrics dashboard (CloudWatch or Grafana). Answer: which metric triggered? Which service owns that metric? Is it latency, error rate, or throughput? Is it trending up, spiking, or flat at a new level? A spike points to a transient event (deployment, traffic burst, dependency failure). A flat new level points to a configuration change or a code change that was deployed.

Minutes 2–4: Isolate the service boundary. If you have a service map (X-Ray Service Map or Grafana service graph), look for the red node — which service’s latency increased relative to its callers? A downstream service being slow (not the first service in the call chain) indicates the root cause is lower in the stack. Work down from the entry point to find the service where latency originates.

Minutes 4–6: Find representative trace IDs. Use the Logs Insights query from Step 1 above against the entry-point service’s log group. Grab 3–5 trace IDs from the high-latency time window. In X-Ray Console, open the trace timeline for each — the Gantt chart shows exactly which span is long.

Minutes 6–8: Read the detailed logs. Take the trace ID from the long span you found in step 3. Run the full trace reconstruction query against all relevant log groups. Read the log output chronologically. The error message, the slow query, or the timeout will be visible in the structured log fields.

Minutes 8–10: Formulate the hypothesis. You should now know: the specific operation that was slow, the service that owns it, and the log context around the failure. Common patterns: RDS slow query (look for durationMs > 1000 in your database layer logs), Lambda cold start (look for init spans in X-Ray trace), third-party API timeout (look for ETIMEDOUT or similar in your HTTP client logs), connection pool exhaustion (look for connection wait time in database layer).

Edge Cases and Hard Problems

Missing Logs

CloudWatch Logs ingestion is asynchronous and can have delays of 5–60 seconds under normal conditions, up to several minutes under load. When debugging a real-time incident, logs from the last 2–3 minutes may not yet be queryable. Work with logs from 5+ minutes ago to avoid false negatives.

Lambda functions that crash (out of memory, timeout, initialization failure) may not flush their logs before termination. The last few log lines before a crash are frequently missing. To recover: check CloudWatch Metrics for the Lambda function — Errors, Throttles, Duration Max will tell you the type of failure even without log evidence.

Partial Failures and Opaque Dependencies

SQS, SNS, and EventBridge do not propagate OpenTelemetry trace context automatically — trace context breaks at async service boundaries unless you explicitly include it in the message attributes. The @opentelemetry/instrumentation-aws-sdk handles this for some services (SQS message attributes) but you must configure it correctly. Verify that your SQS consumer is reading the traceparent attribute from incoming messages and activating the parent context before processing.

For third-party APIs (Stripe, Twilio, external partners), you have no visibility inside their systems. The trace span simply shows a long duration on the HTTP call. In this case, the debugging question shifts: is this API consistently slow (baseline latency issue), intermittently slow (rate limiting or their infrastructure problem), or is it failing silently (returning 200 but with error payloads)? Log the full response body (for non-sensitive APIs), the HTTP status, and the duration for every third-party call. When you see a latency spike originating from an external API span, that is the end of your debugging — it is their problem, and your action is to implement a circuit breaker.

CloudWatch Logs Insights Scale Limits

Logs Insights queries time out for very large log groups (multi-TB) or very long time windows. If your query times out, split the time window. A query scanning 24 hours of a high-volume log group can be split into 4 × 6-hour queries. For organizations at this scale, consider forwarding logs to OpenSearch (formerly Elasticsearch) or a purpose-built log analytics system — CloudWatch Logs Insights is convenient but not designed for multi-TB interactive query patterns.


The shift from “add more logging” to “correlate existing signals” changes how you staff on-call rotations. With structured logs, trace IDs, and a practiced 10-minute workflow, a single on-call engineer can debug complex distributed issues without needing to wake up domain experts at 3 AM. The instrumentation investment — a few days of work to add OpenTelemetry to your services — pays back in reduced mean time to resolution on every incident for the lifetime of the system.


Related reading: AWS CloudWatch Observability: Metrics, Logs, Alarms Best Practices covers the infrastructure that supports this debugging workflow. AWS CloudWatch Logging Costs and Observability covers managing the cost of logs at scale — the other side of the observability equation.

PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Ready to discuss your AWS strategy?

Our certified architects can help you implement these solutions.

Recommended Reading

Explore All Articles »
How to Build Cost-Aware CI/CD Pipelines on AWS

How to Build Cost-Aware CI/CD Pipelines on AWS

CI/CD infrastructure is invisible until your DevOps bill hits $15,000/month. Build minutes, artifact storage, and ephemeral environments accumulate costs that few teams track. Here is how to measure and control them.