---
title: How to Debug Production Issues Across Distributed AWS Systems
description: A 500ms latency spike in a distributed system could be a slow RDS query, a Lambda cold start, a downstream API timeout, or a CloudWatch Logs ingestion delay. Finding the cause requires correlated logs, traces, and metrics — not grep.
url: https://www.factualminds.com/blog/debug-production-distributed-aws-systems/
datePublished: 2026-03-29T00:00:00.000Z
dateModified: 2026-06-10T00:00:00.000Z
author: palaniappan-p
category: DevOps & CI/CD
tags: how-to-guide, observability, debugging, aws-performance-optimization, cloudwatch, xray, opentelemetry, distributed-tracing, aws, production, logs, metrics
---

# How to Debug Production Issues Across Distributed AWS Systems

> A 500ms latency spike in a distributed system could be a slow RDS query, a Lambda cold start, a downstream API timeout, or a CloudWatch Logs ingestion delay. Finding the cause requires correlated logs, traces, and metrics — not grep.

You get a PagerDuty alert at 2:47 AM. P95 API latency crossed 800ms. Your SLA is 500ms. You SSH to nothing — there is no server to SSH to. Your application runs on Lambda, talks to an ECS sidecar, queries RDS Aurora, and calls two third-party APIs. The error rate is 0.3%, not high enough to trigger your error alarms but enough that customers are complaining. You have CloudWatch. You have 47 log groups.

This is the fundamental problem of distributed systems debugging: the evidence is spread across components that cannot see each other. `printf` debugging works when you have one process. It fails completely when a single user request touches five services in sequence, each logging to a different destination, with no common thread connecting those log lines.

The solution is not more logs. It is correlated observability: a trace ID that flows through every service boundary, structured log output that includes that trace ID on every line, and a query tool that can reconstruct a complete request timeline from those disparate log streams.

## Why printf Fails at Distributed Scale

The mental model that breaks down: "if I add more logging, I can find the problem." This works in monoliths. In a distributed system, you can have perfect logging in every individual service and still be unable to answer the question "why was this specific user's request slow?"

Consider a request path: API Gateway → Lambda (authentication) → ECS service (business logic) → RDS Aurora (query) → SQS (event publish) → Lambda (async processor). A 500ms spike could be:

- Lambda cold start (authentication Lambda spun down)
- Aurora connection pool exhaustion under load
- The SQS message enqueue when a topic is throttled
- Network latency between the ECS task and Aurora during a Availability Zone failover
- A third-party API timeout in the business logic service

Without distributed tracing, you are correlating by timestamp — looking at CloudWatch Logs for the authentication Lambda around 2:47 AM, looking at ECS service logs around the same time, looking at RDS Performance Insights for slow queries at that time. Timestamps are approximate, time zones are treacherous, and you have no guarantee the specific request you care about appears in the time window you think it does.

With distributed tracing, you have a single trace ID. Every log line from every service for that specific request includes the same trace ID. A single query reconstructs the complete timeline.

## The Three Pillars and What Each Misses

The standard observability framework is logs + metrics + traces. Each pillar is necessary; none is sufficient alone.

**Metrics** tell you something is wrong. A CloudWatch alarm fires when p99 latency exceeds your threshold. Metrics are cheap, fast, and excellent for alerting. What they miss: why. A latency spike in your RDS query time metric does not tell you which query, which transaction, or which code path caused it. Metrics aggregate — they lose the individual request context that makes root cause identification possible.

**Logs** contain individual request details. They tell you what happened. What they miss: the connection between services. A log line in Lambda saying "processed request req-abc-123 in 45ms" and a log line in ECS saying "slow query executed in 490ms" are not obviously connected — unless both lines contain the same trace ID, and you have tooling to query by that trace ID across both log groups simultaneously.

**Traces** connect the dots across service boundaries. They show you the exact sequence of operations, the duration of each span, and where time was spent within a request. What they miss: the full log context. A trace span saying "RDS query: 490ms" does not contain the actual SQL query, the database error message, or the application state at query time. Traces point you to the right place; logs tell you what actually happened there.

The operational model: alert on metrics, investigate with traces, diagnose with logs. Use traces to find which service and which operation is slow, then use structured log queries to read the detailed log context for that specific request.

## Structured Logging with Trace ID Propagation

The foundational requirement for distributed debugging is that every log line includes the trace ID for the request being processed. This requires two things: structured logging (JSON output with a `traceId` field), and trace context propagation (the trace ID must be extracted from the incoming request and made available to the logger).

### Node.js with OpenTelemetry and W3C Trace Context

```typescript
// src/instrumentation.ts — loaded before application code
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { AwsInstrumentation } from '@opentelemetry/instrumentation-aws-sdk';
import { W3CTraceContextPropagator } from '@opentelemetry/core';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown-service',
  }),
  // Send to ADOT collector sidecar, which forwards to X-Ray
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
  }),
  instrumentations: [
    new HttpInstrumentation({
      // Propagate W3C trace context on all outbound HTTP calls
      headersToPropagate: ['traceparent', 'tracestate'],
    }),
    new ExpressInstrumentation(),
    new AwsInstrumentation({
      sqsExtractContextPropagationFromPayload: true,
    }),
  ],
  // W3C Trace Context — compatible with X-Ray via ADOT translation
  textMapPropagator: new W3CTraceContextPropagator(),
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0));
});
```

```typescript
// src/logger.ts — structured logger with automatic trace ID injection
import { context, trace } from '@opentelemetry/api';

interface LogEntry {
  level: 'info' | 'warn' | 'error' | 'debug';
  message: string;
  traceId?: string;
  spanId?: string;
  service: string;
  timestamp: string;
  [key: string]: unknown;
}

function getTraceContext(): { traceId?: string; spanId?: string } {
  const span = trace.getActiveSpan();
  if (!span) {
    return {};
  }
  const ctx = span.spanContext();
  return {
    traceId: ctx.traceId,
    spanId: ctx.spanId,
  };
}

export const logger = {
  info(message: string, fields?: Record<string, unknown>) {
    const entry: LogEntry = {
      level: 'info',
      message,
      service: process.env.SERVICE_NAME || 'unknown',
      timestamp: new Date().toISOString(),
      ...getTraceContext(),
      ...fields,
    };
    console.log(JSON.stringify(entry));
  },

  warn(message: string, fields?: Record<string, unknown>) {
    const entry: LogEntry = {
      level: 'warn',
      message,
      service: process.env.SERVICE_NAME || 'unknown',
      timestamp: new Date().toISOString(),
      ...getTraceContext(),
      ...fields,
    };
    console.warn(JSON.stringify(entry));
  },

  error(message: string, error?: Error, fields?: Record<string, unknown>) {
    const entry: LogEntry = {
      level: 'error',
      message,
      service: process.env.SERVICE_NAME || 'unknown',
      timestamp: new Date().toISOString(),
      ...getTraceContext(),
      errorMessage: error?.message,
      errorStack: error?.stack,
      ...fields,
    };
    console.error(JSON.stringify(entry));
  },
};
```

```typescript
// src/app.ts — Express application using structured logger
import 'reflect-metadata';
import './instrumentation'; // Must be first import
import express from 'express';
import { logger } from './logger';

const app = express();
app.use(express.json());

app.get('/users/:id', async (req, res) => {
  const { id } = req.params;

  logger.info('Fetching user', { userId: id, path: req.path });

  try {
    const user = await fetchUserFromDatabase(id);

    if (!user) {
      logger.warn('User not found', { userId: id });
      return res.status(404).json({ error: 'User not found' });
    }

    logger.info('User fetched successfully', {
      userId: id,
      userEmail: user.email,
    });

    return res.json(user);
  } catch (err) {
    logger.error('Failed to fetch user', err as Error, { userId: id });
    return res.status(500).json({ error: 'Internal server error' });
  }
});
```

Every `logger.info()` call automatically includes the `traceId` and `spanId` from the active OpenTelemetry span — no manual context threading required. The HTTP instrumentation ensures the `traceparent` header from incoming requests creates the active span, and outbound HTTP calls automatically include the `traceparent` header for downstream propagation.

### Python FastAPI with ADOT and X-Ray Backend

```python
# main.py
import json
import logging
import os
from contextlib import asynccontextmanager

import boto3
from fastapi import FastAPI, Request
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentation
from opentelemetry.instrumentation.botocore import BotocoreInstrumentation
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentation
from opentelemetry.propagators.aws import AwsXRayPropagator
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBased


# Structured JSON logger that injects trace context
class TraceAwareJsonFormatter(logging.Formatter):
    def format(self, record: logging.LogRecord) -> str:
        span = trace.get_current_span()
        ctx = span.get_span_context() if span else None

        log_entry = {
            "level": record.levelname,
            "message": record.getMessage(),
            "logger": record.name,
            "timestamp": self.formatTime(record),
            "service": os.environ.get("SERVICE_NAME", "unknown"),
        }

        if ctx and ctx.is_valid:
            # Convert OpenTelemetry trace ID to X-Ray format for AWS Console linking
            trace_id_hex = format(ctx.trace_id, '032x')
            xray_trace_id = f"1-{trace_id_hex[:8]}-{trace_id_hex[8:]}"
            log_entry["traceId"] = trace_id_hex
            log_entry["xrayTraceId"] = xray_trace_id
            log_entry["spanId"] = format(ctx.span_id, '016x')

        if record.exc_info:
            log_entry["exception"] = self.formatException(record.exc_info)

        # Include any extra fields passed to the logger
        for key, value in record.__dict__.items():
            if key not in ('msg', 'args', 'levelname', 'levelno', 'pathname',
                          'filename', 'module', 'exc_info', 'exc_text',
                          'stack_info', 'lineno', 'funcName', 'created',
                          'msecs', 'relativeCreated', 'thread', 'threadName',
                          'processName', 'process', 'name', 'message'):
                log_entry[key] = value

        return json.dumps(log_entry)


def configure_logging():
    handler = logging.StreamHandler()
    handler.setFormatter(TraceAwareJsonFormatter())
    logging.basicConfig(level=logging.INFO, handlers=[handler])


def configure_tracing():
    # Tail-based sampling: 100% of errors, 5% of successful requests
    # Note: true tail-based sampling requires OTel Collector — this is head-based
    sampler = ParentBased(
        root=TraceIdRatioBased(0.05),  # 5% sample rate for new traces
    )

    provider = TracerProvider(
        resource=Resource.create({
            "service.name": os.environ.get("SERVICE_NAME", "unknown"),
            "deployment.environment": os.environ.get("ENVIRONMENT", "production"),
        }),
        sampler=sampler,
    )

    # Send to ADOT collector → X-Ray
    otlp_exporter = OTLPSpanExporter(
        endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
    )
    provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
    trace.set_tracer_provider(provider)

    # Use X-Ray propagator to inter-operate with native AWS services
    from opentelemetry.propagate import set_global_textmap
    from opentelemetry.propagators.composite import CompositePropagator
    from opentelemetry.propagators.b3 import B3MultiFormat
    set_global_textmap(CompositePropagator([
        AwsXRayPropagator(),  # For AWS services (API Gateway, ALB)
    ]))


@asynccontextmanager
async def lifespan(app: FastAPI):
    configure_logging()
    configure_tracing()

    # Auto-instrument libraries
    FastAPIInstrumentation().instrument()
    BotocoreInstrumentation().instrument()  # boto3/botocore AWS SDK calls
    Psycopg2Instrumentation().instrument()  # PostgreSQL queries

    yield

    trace.get_tracer_provider().shutdown()


app = FastAPI(lifespan=lifespan)
logger = logging.getLogger(__name__)


@app.get("/orders/{order_id}")
async def get_order(order_id: str, request: Request):
    logger.info("Fetching order", extra={"orderId": order_id})

    try:
        order = await fetch_order(order_id)

        if not order:
            logger.warning("Order not found", extra={"orderId": order_id})
            return {"error": "Order not found"}, 404

        logger.info(
            "Order fetched",
            extra={"orderId": order_id, "status": order["status"]}
        )
        return order

    except Exception as e:
        logger.error(
            "Failed to fetch order",
            exc_info=True,
            extra={"orderId": order_id}
        )
        raise
```

### Go Structured Logging with zerolog

```go
// pkg/logger/logger.go
package logger

import (
    "os"

    "github.com/rs/zerolog"
    "go.opentelemetry.io/otel/trace"
)

// Logger wraps zerolog with automatic trace context injection
type Logger struct {
    zl      zerolog.Logger
    service string
}

func New(serviceName string) *Logger {
    zl := zerolog.New(os.Stdout).
        With().
        Timestamp().
        Str("service", serviceName).
        Logger()

    return &Logger{zl: zl, service: serviceName}
}

// withTrace returns a zerolog event with trace context fields injected
func (l *Logger) withTrace(ctx interface{ Value(interface{}) interface{} }) *zerolog.Logger {
    // Extract span from context
    span := trace.SpanFromContext(ctx.(interface {
        Value(interface{}) interface{}
        Done() <-chan struct{}
        Err() error
        Deadline() (interface{}, bool)
    }))

    if !span.SpanContext().IsValid() {
        return &l.zl
    }

    sc := span.SpanContext()
    enriched := l.zl.With().
        Str("traceId", sc.TraceID().String()).
        Str("spanId", sc.SpanID().String()).
        Logger()

    return &enriched
}

// Info logs at info level with trace context
func (l *Logger) Info(ctx interface{ Value(interface{}) interface{} }, msg string, fields map[string]interface{}) {
    event := l.withTrace(ctx).Info()
    for k, v := range fields {
        event = event.Interface(k, v)
    }
    event.Msg(msg)
}

// Error logs at error level with trace context
func (l *Logger) Error(ctx interface{ Value(interface{}) interface{} }, msg string, err error, fields map[string]interface{}) {
    event := l.withTrace(ctx).Error().Err(err)
    for k, v := range fields {
        event = event.Interface(k, v)
    }
    event.Msg(msg)
}
```

```go
// cmd/api/main.go — HTTP handler using trace-aware logger
package main

import (
    "context"
    "net/http"

    "github.com/your-org/service/pkg/logger"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
)

var log = logger.New(os.Getenv("SERVICE_NAME"))
var tracer = otel.Tracer("api")

func getOrderHandler(w http.ResponseWriter, r *http.Request) {
    ctx, span := tracer.Start(r.Context(), "getOrder")
    defer span.End()

    orderID := r.PathValue("orderId")
    span.SetAttributes(attribute.String("order.id", orderID))

    log.Info(ctx, "Fetching order", map[string]interface{}{
        "orderId": orderID,
        "userId":  r.Header.Get("X-User-ID"),
    })

    order, err := fetchOrderFromDB(ctx, orderID)
    if err != nil {
        span.RecordError(err)
        log.Error(ctx, "Database query failed", err, map[string]interface{}{
            "orderId": orderID,
        })
        http.Error(w, "Internal server error", http.StatusInternalServerError)
        return
    }

    log.Info(ctx, "Order fetched successfully", map[string]interface{}{
        "orderId": orderID,
        "status":  order.Status,
    })

    // respond with order...
}
```

## CloudWatch Logs Insights: Reconstructing a Request Timeline

With structured logging and trace ID propagation in place, CloudWatch Logs Insights becomes a powerful debugging tool. The following query reconstructs all log entries for a specific trace across multiple log groups:

```
# Query all log groups for a specific trace ID
# Run against: all relevant service log groups simultaneously

fields @timestamp, service, level, message, spanId, orderId, userId, errorMessage
| filter traceId = "4bf92f3577b34da6a3ce929d0e0e4736"
| sort @timestamp asc
| limit 500
```

Logs Insights allows querying multiple log groups in a single query — select all relevant log groups (Lambda, ECS services, API Gateway access logs) before running the query. The result is a chronological timeline of every logged event across all services for that specific request.

For finding trace IDs from a symptom (high latency, error), start with a metrics-first approach:

```
# Step 1: Find high-latency requests in the API Gateway access logs
fields @timestamp, traceId, requestId, status, responseLatency, path
| filter responseLatency > 800
| filter status != 4xx
| sort @timestamp desc
| limit 20
```

```
# Step 2: Find all 5xx errors in the last hour with trace IDs
fields @timestamp, traceId, service, level, message, errorMessage
| filter level = "error" or level = "ERROR"
| filter @timestamp > datefloor(@timestamp, 1h)
| stats count(*) as errorCount by traceId, service, message
| sort errorCount desc
| limit 50
```

```
# Step 3: Reconstruct the full timeline for a specific trace
fields @timestamp, service, level, message, spanId, durationMs, errorMessage
| filter traceId = "INSERT_TRACE_ID_FROM_STEP_2"
| sort @timestamp asc
```

**Cost-aware querying**: CloudWatch Logs Insights charges per GB scanned. Running Step 1 against all log groups scans everything. Minimize cost by:

1. Selecting only API Gateway access logs for Step 1 (they contain latency)
2. Using time range filters as narrow as the incident window
3. Running exploratory queries on short time windows first, then expanding

## The 10-Minute Incident Response Workflow

A structured workflow gets you from alert to root cause hypothesis within 10 minutes. Longer investigation can follow, but the first 10 minutes determine whether you page more engineers or resolve it alone.

**Minutes 0–2: Characterize the symptom**. Open your metrics dashboard (CloudWatch or Grafana). Answer: which metric triggered? Which service owns that metric? Is it latency, error rate, or throughput? Is it trending up, spiking, or flat at a new level? A spike points to a transient event (deployment, traffic burst, dependency failure). A flat new level points to a configuration change or a code change that was deployed.

**Minutes 2–4: Isolate the service boundary**. If you have a service map (X-Ray Service Map or Grafana service graph), look for the red node — which service's latency increased relative to its callers? A downstream service being slow (not the first service in the call chain) indicates the root cause is lower in the stack. Work down from the entry point to find the service where latency originates.

**Minutes 4–6: Find representative trace IDs**. Use the Logs Insights query from Step 1 above against the entry-point service's log group. Grab 3–5 trace IDs from the high-latency time window. In X-Ray Console, open the trace timeline for each — the Gantt chart shows exactly which span is long.

**Minutes 6–8: Read the detailed logs**. Take the trace ID from the long span you found in step 3. Run the full trace reconstruction query against all relevant log groups. Read the log output chronologically. The error message, the slow query, or the timeout will be visible in the structured log fields.

**Minutes 8–10: Formulate the hypothesis**. You should now know: the specific operation that was slow, the service that owns it, and the log context around the failure. Common patterns: RDS slow query (look for `durationMs > 1000` in your database layer logs), Lambda cold start (look for `init` spans in X-Ray trace), third-party API timeout (look for `ETIMEDOUT` or similar in your HTTP client logs), connection pool exhaustion (look for connection wait time in database layer).

## Edge Cases and Hard Problems

### Missing Logs

CloudWatch Logs ingestion is asynchronous and can have delays of 5–60 seconds under normal conditions, up to several minutes under load. When debugging a real-time incident, logs from the last 2–3 minutes may not yet be queryable. Work with logs from 5+ minutes ago to avoid false negatives.

Lambda functions that crash (out of memory, timeout, initialization failure) may not flush their logs before termination. The last few log lines before a crash are frequently missing. To recover: check CloudWatch Metrics for the Lambda function — `Errors`, `Throttles`, `Duration` Max will tell you the type of failure even without log evidence.

### Partial Failures and Opaque Dependencies

SQS, SNS, and EventBridge do not propagate OpenTelemetry trace context automatically — trace context breaks at async service boundaries unless you explicitly include it in the message attributes. The `@opentelemetry/instrumentation-aws-sdk` handles this for some services (SQS message attributes) but you must configure it correctly. Verify that your SQS consumer is reading the `traceparent` attribute from incoming messages and activating the parent context before processing.

For third-party APIs (Stripe, Twilio, external partners), you have no visibility inside their systems. The trace span simply shows a long duration on the HTTP call. In this case, the debugging question shifts: is this API consistently slow (baseline latency issue), intermittently slow (rate limiting or their infrastructure problem), or is it failing silently (returning 200 but with error payloads)? Log the full response body (for non-sensitive APIs), the HTTP status, and the duration for every third-party call. When you see a latency spike originating from an external API span, that is the end of your debugging — it is their problem, and your action is to implement a circuit breaker.

### CloudWatch Logs Insights Scale Limits

Logs Insights queries time out for very large log groups (multi-TB) or very long time windows. If your query times out, split the time window. A query scanning 24 hours of a high-volume log group can be split into 4 × 6-hour queries. For organizations at this scale, consider forwarding logs to OpenSearch (formerly Elasticsearch) or a purpose-built log analytics system — CloudWatch Logs Insights is convenient but not designed for multi-TB interactive query patterns.

---

The shift from "add more logging" to "correlate existing signals" changes how you staff on-call rotations. With structured logs, trace IDs, and a practiced 10-minute workflow, a single on-call engineer can debug complex distributed issues without needing to wake up domain experts at 3 AM. The instrumentation investment — a few days of work to add OpenTelemetry to your services — pays back in reduced mean time to resolution on every incident for the lifetime of the system.

---

_Related reading: [AWS CloudWatch Observability: Metrics, Logs, Alarms Best Practices](/blog/aws-cloudwatch-observability-metrics-logs-alarms-best-practices/) covers the infrastructure that supports this debugging workflow. [Observability beyond CloudWatch (2026)](/blog/aws-observability-beyond-cloudwatch-otel-prometheus-grafana-2026/) covers ADOT collectors, AMP/AMG, and Application Signals when CloudWatch-only correlation breaks down. [AWS CloudWatch Logging Costs and Observability](/blog/aws-cloudwatch-logging-costs-observability/) covers managing the cost of logs at scale — the other side of the observability equation. For cardinality budgets, sampling rules, and FinOps fixes on CloudWatch and OpenTelemetry, see [AWS Observability Costs: Cardinality Budgets & FinOps Limits](/blog/aws-observability-finops-cardinality-cost-control/)._

## Related reading

- [The AWS CLI Bug That Broke /dev/null Across Your Entire System](/blog/aws-cli-chmod-dev-null-streaming-bug-2026/)
- [AWS Environment Parity: Why Dev/Staging/Prod Drift Costs More Than It Saves](/blog/aws-environment-parity-dev-staging-production/)
- [What DevOps Guides Don](/blog/devops-exercises-aws-production-reality/)
- [DevOps on AWS: CodePipeline vs GitHub Actions vs Jenkins](/blog/devops-on-aws-codepipeline-vs-github-actions-vs-jenkins/)
- [Two Free LocalStack Alternatives in 2026: MiniStack vs floci](/blog/ministack-free-localstack-alternative-aws-emulator/)
- [The Terraform Command Cheat Sheet for AWS Engineers (2026 Edition)](/blog/terraform-commands-cheat-sheet-aws-2026/)
- [How to Build Ultra-Fast Asset Pipelines with Bun, Vite, and Rust-Based Tooling (2026)](/blog/ultra-fast-asset-pipelines-bun-vite-rust/)

## FAQ

### What is the difference between CloudWatch Logs, X-Ray, and OpenTelemetry on AWS?
CloudWatch Logs is AWS log storage and querying — it ingests structured or unstructured text logs from any AWS service and allows querying with Logs Insights. X-Ray is AWS distributed tracing — it captures trace segments from instrumented applications, stitches them into request traces across service boundaries, and visualizes service maps. OpenTelemetry is a vendor-neutral instrumentation standard (CNCF project) that generates logs, metrics, and traces in a unified format. On AWS, you can use ADOT (AWS Distro for OpenTelemetry) to collect OTel traces and send them to X-Ray, Jaeger, or Grafana Tempo — giving you portable instrumentation that is not locked to X-Ray. The practical approach: instrument with OpenTelemetry, send traces to X-Ray via ADOT for AWS console integration, send metrics to CloudWatch or a Prometheus-compatible backend.

### How do you correlate logs across Lambda, ECS, and RDS in a single request?
Correlation requires a trace ID that propagates across service boundaries in HTTP headers (X-Amzn-Trace-Id for X-Ray, traceparent for W3C Trace Context / OpenTelemetry). Each service logs the trace ID in every log line. In Lambda, X-Ray injects the trace ID automatically if active tracing is enabled. In ECS, the ADOT sidecar or X-Ray SDK extracts the trace ID from incoming requests and propagates it to outbound calls. For RDS, you cannot instrument the database itself, but you can log the query and the trace ID in your application before sending it to RDS — enabling correlation between application trace spans and RDS Performance Insights data using the shared timestamp.

### How much does CloudWatch Logs Insights cost at production scale?
CloudWatch Logs Insights charges $0.005 per GB of data scanned per query. At 100 GB of logs per day, a query scanning the last 24 hours costs $0.50. That seems low, but: automated queries from dashboards (running every minute = 1,440 queries/day × $0.50 = $720/day), wide queries (selecting * instead of specific fields doubles scanned data), and inefficient query patterns (no time range filter) compound quickly. Cost optimization: use filter patterns when storing logs to reduce stored volume, use log groups per service with separate retention policies, query specific log groups rather than all logs, and use CloudWatch Metrics (cheaper) instead of Logs Insights for aggregate dashboards that do not require individual log inspection.

### What sampling rate should you use for distributed traces to balance cost and visibility?
X-Ray default sampling samples 1 request per second + 5% of additional requests. At 1,000 RPS, this samples 1 + 50 = 51 traces/second = 4.4 million traces/day. X-Ray charges $5 per million traces after the first 100,000 free — 4.4 million traces/day × 30 days × $5/million = $660/month for traces alone. Tail-based sampling is more cost-effective: sample 100% of error traces and 100% of high-latency traces (p99+), sample 1–5% of successful fast requests. This captures all debugging-relevant traces at 20–30% of the cost of uniform 5% sampling at high RPS. Implement tail-based sampling with the OpenTelemetry Collector processor or AWS X-Ray groups with sampling rules.

---

*Source: https://www.factualminds.com/blog/debug-production-distributed-aws-systems/*
