---
title: Datadog with AWS
description: Datadog on AWS in 2026: unified observability for CloudWatch, EKS, Lambda, Bedrock LLM workloads, and security posture across multi-cloud estates.
url: https://www.factualminds.com/integrations/datadog-aws/
category: observability
updated: 2026-04-29
---

# Datadog with AWS

> Deep visibility into AWS infrastructure, Bedrock/SageMaker workloads, and applications — with a single tagging taxonomy across CloudWatch and Datadog.

## Datadog + AWS overview

Datadog is an enterprise observability and security platform. On AWS, it ingests CloudWatch metrics through Amazon Data Firehose-backed Metric Streams, collects host and container telemetry via Agent v7 and the Datadog Operator on EKS, and captures serverless telemetry through the Datadog Lambda Extension — all tied together by Unified Service Tagging so the same `env`/`service`/`version` tag flows from metric to trace to log to LLM call.

FactualMinds deploys Datadog on AWS for teams that have either outgrown CloudWatch's cross-service correlation or need consolidated visibility across AWS, on-prem, and a second cloud. We keep CloudWatch as the AWS-native source of truth for service quotas, AWS Health, and alarm-driven auto-recovery — Datadog becomes the investigative and SLO layer on top.

## What's new for Datadog on AWS in 2026

- **LLM Observability GA** — prompt, completion, and tool-call capture for Bedrock (including Claude Sonnet 4, Llama 4, and Amazon Nova), SageMaker endpoints, and self-hosted models. Integrates with APM trace IDs so an investigative view on a slow checkout can walk from the Bedrock call back to the user request.
- **Database Monitoring for Aurora, RDS Postgres/MySQL, DynamoDB, and ElastiCache** — captures explain plans and lock/deadlock data without a proxy; the DynamoDB integration now includes PITR cost metrics and table-class usage.
- **Cloud SIEM** — detection rules over CloudTrail, GuardDuty, Security Hub, and VPC Flow Logs; pairs well with AWS Security Lake for long-term OCSF storage.
- **Watchdog Root Cause Analysis** — uses causal inference on infra, APM, and RUM signals to propose a likely root cause per alert, reducing on-call MTTI.
- **Kubernetes Monitoring UI refresh** — cluster, workload, and pod views unified for EKS, EKS Auto Mode, and EKS Hybrid Nodes; works with Pod Identity so Agents no longer need IRSA annotations.
- **Flex Logs + tier pricing** — archive-tier ingestion is ~80% cheaper than standard for logs queried infrequently (audit, compliance). Most AWS estates see 30–60% log-bill reduction after a Flex Logs pass.
- **Data Streams Monitoring for Kinesis, MSK, SNS, SQS, and EventBridge** — end-to-end flow view with lag, throughput, and producer/consumer correlation.

## How Datadog monitors AWS (implementation patterns)

**CloudWatch Metric Streams (preferred for AWS-service metrics)**

- Amazon Data Firehose → Datadog endpoint. Sub-minute metric freshness vs the 10–15 min delay of the legacy polling integration.
- Deploy via the Datadog-published CloudFormation StackSet across AWS Organizations; supports Control Tower landing zones.
- Tag filters in the StackSet limit which services stream — important for keeping costs aligned with actual needs.

**Datadog Agent v7 + Datadog Operator on EKS**

- Operator installs Agent DaemonSet, Cluster Agent, and Admission Controller with a single `DatadogAgent` CRD.
- Compatible with EKS Auto Mode; Pod Identity integration removes the IRSA dance for Agent permissions.
- APM auto-instrumentation now covers Python, Node, Java, Go, .NET, Ruby, PHP — Admission Controller injects the tracer library without rebuilding images.

**Datadog Lambda Extension** (serverless)

- Runs as a Lambda layer; no forwarder Lambda or CloudWatch log-group subscription to maintain.
- Captures traces, enhanced metrics, and logs from the function runtime directly — lower latency and cost than CloudWatch-based approaches for high-volume functions.
- Supports ARM64 Graviton runtimes; cold-start overhead measured in single-digit ms.

**AWS PrivateLink endpoints**

- Metric/log/trace ingestion over PrivateLink for regulated workloads (HIPAA, PCI DSS 4.0.1, FedRAMP Moderate).
- Pair with VPC endpoint policies to deny egress to the public Datadog endpoints if you need egress lock-down.

## Key Datadog + AWS features

**Infrastructure monitoring**

- Real-time metrics across EC2, RDS, Aurora, DynamoDB, S3, Lambda, ECS, EKS, ElastiCache, SNS, SQS, Kinesis, EventBridge, and 700+ SaaS integrations.
- Automatic resource discovery and relationship mapping via AWS tags and CloudFormation stack names.
- Host map and container map for visual fleet-level health.

**LLM Observability (GA in 2024)**

- Monitor Bedrock, SageMaker, and self-hosted LLMs; track prompt/completion quality with built-in evaluators and custom judges.
- Drift and regression detection across model versions — critical when Bedrock routes you through provisioned throughput or the Nova/Claude/Llama model families.
- Integrated with APM: the slow checkout trace shows the slow Bedrock call without instrumenting manually.

**Database Monitoring**

- Aurora, RDS Postgres/MySQL, DynamoDB, and MongoDB Atlas — plan capture, lock analysis, and query-level P95 latency.
- No proxy, no extra network hop; uses `pg_stat_statements` / Performance Schema and the Agent.

**Application Performance Monitoring (APM)**

- Distributed tracing across microservices, including OpenTelemetry-native ingest (OTLP).
- Database query profiling, service dependency maps, and Continuous Profiler for CPU/memory bottlenecks in production.

**Log Management + Flex Logs**

- Centralized log ingestion with parsing, enrichment, and Live Tail.
- Flex Logs for audit/compliance logs — ~80% cheaper than standard tier, same query syntax.
- Log-based metrics convert high-volume logs into cost-efficient metrics for dashboards and alerts.

**Cloud SIEM**

- Detection rules on CloudTrail, GuardDuty, Security Hub, and VPC Flow Logs with out-of-the-box rulesets aligned to MITRE ATT&CK.
- Pairs with AWS Security Lake (OCSF) for long-term storage and with Amazon Detective for investigation pivots.

**Cost Management**

- Tracks AWS spend alongside performance metrics; correlates deploys with cost deltas.
- Plugs into AWS Cost Optimization Hub and CUR 2.0 with Split Cost Allocation Data for per-tenant attribution.

## Datadog pricing for AWS (2026)

Pricing evolves — verify at [datadoghq.com/pricing](https://www.datadoghq.com/pricing/). Current ballparks:

**Infrastructure monitoring**

- Pro: ~$15–$23/host/month. Enterprise: ~$23–$34/host/month.
- Per-container and serverless pricing available for EKS/Fargate workloads.

**APM + Continuous Profiler**

- ~$31–$40/host/month.

**Log Management**

- Standard tier: per-GB ingested + per-GB retained.
- Flex Logs: ~80% cheaper ingestion for logs queried infrequently.

**LLM Observability, Cloud SIEM, Database Monitoring**

- Sold separately; all usage-based.

**Typical totals**: small teams $400–$1,500/month, mid-market $3k–$15k/month, enterprise on annual contracts with significant discount.

## Datadog vs CloudWatch vs open-source

**Datadog**

- Full-featured observability + security + LLM + cost in one pane.
- Best for multi-service, multi-cloud, or AI/ML workloads; best investigative experience for on-call.
- Higher sticker price — significant ROI when tied to MTTR and SLO improvements.

**CloudWatch + Application Signals + AWS Managed Grafana**

- Free/low-cost for AWS-native telemetry; Application Signals adds service maps and SLOs.
- Native IAM model; no extra trust relationship or external vendor review.
- Weaker cross-account correlation; LLM observability is basic compared with Datadog.

**Open-source (Prometheus + Grafana + OpenTelemetry + Loki)**

- Maximum control; lowest licence cost but highest operational overhead.
- AWS Managed Prometheus + Managed Grafana + ADOT Collector removes most of the toil.
- Good fit for teams with strong DevOps expertise who want portability.

## When Datadog is NOT the right call

- You run a single-region, AWS-only workload with fewer than ~20 services and no multi-cloud ambition — CloudWatch + Application Signals is usually enough, and significantly cheaper.
- You have strict data-residency rules that no Datadog region satisfies — enterprise DE site or onshore-Australia pattern may force you to AWS-native or a self-hosted stack.
- Your primary observability problem is database performance and nothing else — RDS Performance Insights + Aurora DB Activity Streams may be sufficient without adding a third-party bill.
- You have zero capacity to maintain a tagging taxonomy — Datadog's value drops sharply without Unified Service Tagging discipline.

## Implementation: multi-account onboarding via CloudFormation StackSet

Datadog publishes a CloudFormation template that creates the IAM role, event subscriptions, and CloudWatch Metric Streams per account. Deploy via StackSet across the AWS Organization:

```bash
# Excerpt — Datadog provides the canonical template via the AWS Integration page
aws cloudformation create-stack-set \
  --stack-set-name datadog-aws-integration \
  --template-url https://datadog-cloudformation-template.s3.amazonaws.com/aws/main.yaml \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameters \
      ParameterKey=DatadogApiKey,ParameterValue="<api-key-from-secrets-manager>" \
      ParameterKey=DatadogSite,ParameterValue=datadoghq.com \
      ParameterKey=ExternalId,ParameterValue="<datadog-supplied-external-id>" \
      ParameterKey=InstallDatadogPolicies,ParameterValue=true \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false

aws cloudformation create-stack-instances \
  --stack-set-name datadog-aws-integration \
  --deployment-targets OrganizationalUnitIds=ou-xxx-yyyy \
  --regions us-east-1
```

Always source the template URL and parameters from the Datadog Admin → Integrations → AWS page — Datadog publishes updated templates as the trust contract evolves.

## Implementation: Datadog Operator with Pod Identity on EKS

```yaml
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: datadog
spec:
  global:
    clusterName: prod-eks-eu-west-1
    site: datadoghq.com
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
    tags:
      - 'team:platform'
      - 'env:prod'
      - 'cost-center:eng'
  features:
    apm:
      enabled: true
    logCollection:
      enabled: true
      containerCollectAll: true
    orchestratorExplorer:
      enabled: true
    liveContainerCollection:
      enabled: true
    eventCollection:
      collectKubernetesEvents: true
  override:
    nodeAgent:
      serviceAccountName: datadog-agent
      # Pod Identity association created out-of-band via:
      # aws eks create-pod-identity-association \
      #   --cluster-name prod-eks-eu-west-1 \
      #   --namespace datadog \
      #   --service-account datadog-agent \
      #   --role-arn arn:aws:iam::123:role/datadog-agent-pod-identity
```

Pod Identity replaces IRSA — no OIDC provider, no ServiceAccount annotation. Cluster Agent and Admission Controller are managed by the Operator.

## Failure modes & resilience

**1. CloudWatch Metric Streams Firehose backpressure.** A spike in metric volume (sudden Lambda concurrency, EKS node-fleet replacement) can fill the Firehose buffer; Datadog ingest lags by minutes. Mitigation: monitor `aws.firehose.delivery_to_http_endpoint.records_delivered_count` against incoming records; raise Firehose buffer size and Datadog endpoint concurrency via the StackSet update.

**2. Custom-metric cardinality runaway.** A single new tag key with high cardinality (`request_id`, `user_id`, raw URL path) explodes metric counts and Datadog bills. Datadog enforces a per-organization custom-metrics limit. Mitigation: tag schema review at PR time; a periodic query on `datadog.estimated_usage.custom_metrics` filtered by `metric_name`; drop high-cardinality tags via the metrics-without-limits feature or convert to log-based metrics.

**3. Exclusion-filter drift.** Filters configured to drop noisy logs are easy to forget; cost climbs silently as new services emit similar logs without matching filters. Mitigation: quarterly review of top log-cost contributors; codify exclusion filters in the Datadog Terraform provider so changes go through PR review.

**4. Lambda Extension cold-start.** First invocation of a Lambda function with the Datadog Extension layer adds 100–300 ms to init for `DD_API_KEY` decryption (when sourced from Secrets Manager). Mitigation: use `DD_API_KEY_SECRET_ARN` only for environments that justify the cost; for latency-critical functions, set the API key as a Lambda env var with a Provisioned Concurrency configuration to amortize.

**5. Agent host-reporting drift on Auto Mode.** Auto Mode replaces nodes; transient reporting gaps (~30 s) appear during replacement. Mitigation: dashboards should query over windows ≥ 1 min; alarms with 2/3 datapoints to avoid replacement-induced false positives.

**6. Datadog API rate limits.** 300 reqs/hour for most public APIs, 600 for Logs Search. Bulk dashboard imports or programmatic monitor management can trip this. Mitigation: backoff with jitter; use the Terraform provider with `parallelism` capped.

**7. Datadog itself is down.** Region incidents happen. Mitigation: keep CloudWatch alarms on the truly load-bearing AWS-service metrics (RDS CPU, Lambda errors, ALB 5xx) so on-call gets paged even if Datadog is unavailable. Don't centralize EVERY alert in Datadog.

## Observability runbook (alerting on Datadog itself)

**Meta-monitors we ship:**

| Monitor                                            | Threshold                    | First action                                                     |
| -------------------------------------------------- | ---------------------------- | ---------------------------------------------------------------- |
| `datadog.agent.up` per host (no-data alert)        | no data `> 10 min`           | Confirm node still exists; check Agent status / logs             |
| Custom-metric count by service                     | `> 100k` distinct timeseries | Cardinality review; drop tags or convert to log-based metric     |
| Log ingestion volume by service                    | `> 2×` 7-day baseline        | Sudden log explosion; identify and exclude or move to Flex Logs  |
| Firehose `delivery_to_http_endpoint.success` ratio | `< 99%` for 15 min           | Datadog endpoint health; AWS Firehose error logs                 |
| `aws.integration.run_status` by AWS account        | failure                      | Datadog Admin → Integrations → AWS → check role assumption error |
| LLM Observability prompts failing eval             | spike > baseline             | Prompt regression; pair with Bedrock Guardrails findings         |
| Custom-metric usage `> 80%` of contracted limit    | monthly                      | Renegotiate or trim before hard cap                              |

**Debug path: "metric missing in Datadog":**

1. Confirm the metric is being emitted: from the host, `agent status` → list of integrations and their last collection.
2. Datadog Admin → Integrations → AWS → check that the relevant service is enabled (CloudWatch namespaces are opt-in).
3. Inspect Metric Streams: AWS console → CloudWatch → Metric Streams → status `running`; recent errors in the destination Firehose.
4. Tag filter mismatch: Datadog filters at ingest may drop the metric — review include/exclude rules.
5. Custom metric: confirm the host/container has DogStatsD enabled and the metric name is not collapsing to a quota-limited family.

## Best practices

**Tagging**

- Unified Service Tagging (`service`, `env`, `version`) on every piece of telemetry — enforce via Admission Controller and CI pipeline checks.
- Inherit AWS tags (`cost-center`, `team`, `pii-classification`) via the CloudWatch integration.

**Alerts**

- Alert on business/SLO metrics (error rate, P99 latency, checkout success) first; alert on infra second.
- Composite monitors for noise reduction; dynamic baselines for seasonal workloads.

**Cost control**

- Exclusion filters on known-noisy logs; Flex Logs for audit/compliance.
- Log-based metrics for anything you alert on from logs.
- Quarterly review of custom metric cardinality — the #1 cause of runaway Datadog bills.

**Security review**

- External ID on the Datadog IAM role; scoped managed policy, no `*:*`.
- PrivateLink endpoints for regulated workloads; VPC endpoint policies to lock egress.

## Related reading

- [`AWS CloudWatch observability: metrics, logs, alarms, and best practices`](/blog/aws-cloudwatch-observability-metrics-logs-alarms-best-practices/)
- [`AWS CloudWatch logging costs: observability without the shock bill`](/blog/aws-cloudwatch-logging-costs-observability/)
- [`Amazon Bedrock AgentCore in production`](/blog/amazon-bedrock-agentcore-production/)

## Related services

- [AWS Cloud Cost Optimization Services](/services/aws-cloud-cost-optimization-services/)
- [AWS Architecture Review](/services/aws-architecture-review/)
- [DevOps Pipeline Setup](/services/devops-pipeline-setup/)

## FAQ

### How does Datadog integrate with AWS in 2026?
The modern pattern uses AWS Integration via IAM role (no long-lived keys) plus CloudWatch Metric Streams over Amazon Data Firehose for near real-time metric ingest. On EC2 and EKS, install Datadog Agent v7 or the Datadog Operator; for Lambda use the Datadog Lambda Extension layer (no forwarder Lambda). For multi-account estates, onboard accounts through the AWS Integration page using the CloudFormation StackSet Datadog publishes — it creates the IAM role and event subscriptions consistently across Organizations.

### What AWS metrics and logs does Datadog collect?
Out of the box: EC2 CPU/memory/disk, EBS IOPS, RDS performance insights, S3 object counts and sizes, Lambda duration and cold starts, ELB/ALB latency, DynamoDB throughput, EventBridge rule failures, and GuardDuty/Security Hub findings. Logs can arrive via the CloudWatch Logs subscription filter, S3 archive ingestion, or the Agent. Custom metrics land via DogStatsD, OTLP, or the API — and the same tags flow to metrics, traces, and logs when you use Unified Service Tagging.

### Can Datadog replace CloudWatch entirely?
Usually no — and trying to is where most teams overspend. CloudWatch is billed per metric/log whether you look at it or not, but a handful of AWS features only emit to CloudWatch natively (most CloudWatch Alarms, AWS Health events, Lambda CloudWatch metrics used by service quotas). The pragmatic pattern: keep CloudWatch for AWS-service-native alarms and quota dashboards; use Datadog as the single pane of glass for application traces, custom metrics, LLM observability, and cross-account correlation. Datadog ingests CloudWatch via Metric Streams, so you still see everything in one place.

### How do I correlate logs, metrics, and traces in Datadog?
Use Unified Service Tagging: `env`, `service`, and `version` tags on every piece of telemetry, propagated by the Agent/Tracer. For AWS resources, Datadog inherits tags from CloudWatch and Resource Groups so existing `cost-center`/`team` tags show up automatically. Enable Data Streams Monitoring for Kafka/Kinesis/SNS/SQS to get end-to-end flow tracking. Trace Explorer and Live Tail support identical query syntax across telemetry types.

### What does Datadog cost on a typical AWS estate in 2026?
Pricing changes — always confirm at datadoghq.com/pricing. Current ballparks: Infrastructure Pro starts ~$15/host/month, APM adds ~$31/host/month, Log Management is priced per GB ingested + retained (Flex Logs is the cheaper tier for audit/compliance logs queried infrequently), LLM Observability and Cloud SIEM are sold separately. For mid-market AWS estates we typically see $3k–$15k/month, with the biggest optimization levers being exclusion filters, Flex Logs for archive-pattern logs, log-based metrics, and dropping high-cardinality custom metrics.

### Datadog LLM Observability on Bedrock vs CloudWatch GenAI observability — which do I use?
Both, for different questions. CloudWatch GenAI observability (CloudWatch Application Signals + Bedrock invocation metrics) is free-tier for the basics — token usage, invocation latency, and error rates across Bedrock models — and is sufficient if you only need operational alerts. Datadog LLM Observability (GA in 2024, matured through 2025) adds prompt/completion capture, hallucination scoring, quality evaluators, and correlation with APM traces, so a slow checkout trace can be tied to a specific Bedrock Claude call. Teams running Bedrock AgentCore or multi-step agents almost always need the prompt/trace view, which CloudWatch does not provide.

### How do we audit the Datadog-to-AWS trust relationship for security review?
Three checks: (1) the Datadog IAM role must require the external ID Datadog assigns to your account — prevents confused-deputy; (2) the role should be scoped with the Datadog-published managed policy, no `*`-on-`*`; (3) for regulated workloads, enable Datadog AWS PrivateLink endpoints so metric/log traffic never transits the public internet. Pair with Datadog Cloud SIEM for detection on the AWS CloudTrail feed if you want Datadog to alert on the IAM role itself being modified.

---

*Source: https://www.factualminds.com/integrations/datadog-aws/*
