Monitoring & Observability

Datadog with AWS

Deep visibility into AWS infrastructure, Bedrock/SageMaker workloads, and applications — with a single tagging taxonomy across CloudWatch and Datadog.

Last updated: April 29, 2026Observability & MonitoringAuthor: FactualMinds Cloud Integration TeamReviewed by: FactualMinds AWS-certified architects (Solutions Architect – Professional)

Ask AI:ChatGPT Claude Perplexity Gemini

Datadog + AWS overview

Datadog is an enterprise observability and security platform. On AWS, it ingests CloudWatch metrics through Amazon Data Firehose-backed Metric Streams, collects host and container telemetry via Agent v7 and the Datadog Operator on EKS, and captures serverless telemetry through the Datadog Lambda Extension — all tied together by Unified Service Tagging so the same env/service/version tag flows from metric to trace to log to LLM call.

FactualMinds deploys Datadog on AWS for teams that have either outgrown CloudWatch’s cross-service correlation or need consolidated visibility across AWS, on-prem, and a second cloud. We keep CloudWatch as the AWS-native source of truth for service quotas, AWS Health, and alarm-driven auto-recovery — Datadog becomes the investigative and SLO layer on top.

What’s new for Datadog on AWS in 2026

LLM Observability GA — prompt, completion, and tool-call capture for Bedrock (including Claude Sonnet 4, Llama 4, and Amazon Nova), SageMaker endpoints, and self-hosted models. Integrates with APM trace IDs so an investigative view on a slow checkout can walk from the Bedrock call back to the user request.
Database Monitoring for Aurora, RDS Postgres/MySQL, DynamoDB, and ElastiCache — captures explain plans and lock/deadlock data without a proxy; the DynamoDB integration now includes PITR cost metrics and table-class usage.
Cloud SIEM — detection rules over CloudTrail, GuardDuty, Security Hub, and VPC Flow Logs; pairs well with AWS Security Lake for long-term OCSF storage.
Watchdog Root Cause Analysis — uses causal inference on infra, APM, and RUM signals to propose a likely root cause per alert, reducing on-call MTTI.
Kubernetes Monitoring UI refresh — cluster, workload, and pod views unified for EKS, EKS Auto Mode, and EKS Hybrid Nodes; works with Pod Identity so Agents no longer need IRSA annotations.
Flex Logs + tier pricing — archive-tier ingestion is ~80% cheaper than standard for logs queried infrequently (audit, compliance). Most AWS estates see 30–60% log-bill reduction after a Flex Logs pass.
Data Streams Monitoring for Kinesis, MSK, SNS, SQS, and EventBridge — end-to-end flow view with lag, throughput, and producer/consumer correlation.

How Datadog monitors AWS (implementation patterns)

CloudWatch Metric Streams (preferred for AWS-service metrics)

Amazon Data Firehose → Datadog endpoint. Sub-minute metric freshness vs the 10–15 min delay of the legacy polling integration.
Deploy via the Datadog-published CloudFormation StackSet across AWS Organizations; supports Control Tower landing zones.
Tag filters in the StackSet limit which services stream — important for keeping costs aligned with actual needs.

Datadog Agent v7 + Datadog Operator on EKS

Operator installs Agent DaemonSet, Cluster Agent, and Admission Controller with a single DatadogAgent CRD.
Compatible with EKS Auto Mode; Pod Identity integration removes the IRSA dance for Agent permissions.
APM auto-instrumentation now covers Python, Node, Java, Go, .NET, Ruby, PHP — Admission Controller injects the tracer library without rebuilding images.

Datadog Lambda Extension (serverless)

Runs as a Lambda layer; no forwarder Lambda or CloudWatch log-group subscription to maintain.
Captures traces, enhanced metrics, and logs from the function runtime directly — lower latency and cost than CloudWatch-based approaches for high-volume functions.
Supports ARM64 Graviton runtimes; cold-start overhead measured in single-digit ms.

AWS PrivateLink endpoints

Metric/log/trace ingestion over PrivateLink for regulated workloads (HIPAA, PCI DSS 4.0.1, FedRAMP Moderate).
Pair with VPC endpoint policies to deny egress to the public Datadog endpoints if you need egress lock-down.

Key Datadog + AWS features

Infrastructure monitoring

Real-time metrics across EC2, RDS, Aurora, DynamoDB, S3, Lambda, ECS, EKS, ElastiCache, SNS, SQS, Kinesis, EventBridge, and 700+ SaaS integrations.
Automatic resource discovery and relationship mapping via AWS tags and CloudFormation stack names.
Host map and container map for visual fleet-level health.

LLM Observability (GA in 2024)

Monitor Bedrock, SageMaker, and self-hosted LLMs; track prompt/completion quality with built-in evaluators and custom judges.
Drift and regression detection across model versions — critical when Bedrock routes you through provisioned throughput or the Nova/Claude/Llama model families.
Integrated with APM: the slow checkout trace shows the slow Bedrock call without instrumenting manually.

Database Monitoring

Aurora, RDS Postgres/MySQL, DynamoDB, and MongoDB Atlas — plan capture, lock analysis, and query-level P95 latency.
No proxy, no extra network hop; uses pg_stat_statements / Performance Schema and the Agent.

Application Performance Monitoring (APM)

Distributed tracing across microservices, including OpenTelemetry-native ingest (OTLP).
Database query profiling, service dependency maps, and Continuous Profiler for CPU/memory bottlenecks in production.

Log Management + Flex Logs

Centralized log ingestion with parsing, enrichment, and Live Tail.
Flex Logs for audit/compliance logs — ~80% cheaper than standard tier, same query syntax.
Log-based metrics convert high-volume logs into cost-efficient metrics for dashboards and alerts.

Cloud SIEM

Detection rules on CloudTrail, GuardDuty, Security Hub, and VPC Flow Logs with out-of-the-box rulesets aligned to MITRE ATT&CK.
Pairs with AWS Security Lake (OCSF) for long-term storage and with Amazon Detective for investigation pivots.

Cost Management

Tracks AWS spend alongside performance metrics; correlates deploys with cost deltas.
Plugs into AWS Cost Optimization Hub and CUR 2.0 with Split Cost Allocation Data for per-tenant attribution.

Datadog pricing for AWS (2026)

Pricing evolves — verify at datadoghq.com/pricing. Current ballparks:

Infrastructure monitoring

Pro: ~$15–$23/host/month. Enterprise: ~$23–$34/host/month.
Per-container and serverless pricing available for EKS/Fargate workloads.

APM + Continuous Profiler

~$31–$40/host/month.

Log Management

Standard tier: per-GB ingested + per-GB retained.
Flex Logs: ~80% cheaper ingestion for logs queried infrequently.

LLM Observability, Cloud SIEM, Database Monitoring

Sold separately; all usage-based.

Typical totals: small teams $400–$1,500/month, mid-market $3k–$15k/month, enterprise on annual contracts with significant discount.

Datadog vs CloudWatch vs open-source

Datadog

Full-featured observability + security + LLM + cost in one pane.
Best for multi-service, multi-cloud, or AI/ML workloads; best investigative experience for on-call.
Higher sticker price — significant ROI when tied to MTTR and SLO improvements.

CloudWatch + Application Signals + AWS Managed Grafana

Free/low-cost for AWS-native telemetry; Application Signals adds service maps and SLOs.
Native IAM model; no extra trust relationship or external vendor review.
Weaker cross-account correlation; LLM observability is basic compared with Datadog.

Open-source (Prometheus + Grafana + OpenTelemetry + Loki)

Maximum control; lowest licence cost but highest operational overhead.
AWS Managed Prometheus + Managed Grafana + ADOT Collector removes most of the toil.
Good fit for teams with strong DevOps expertise who want portability.

When Datadog is NOT the right call

You run a single-region, AWS-only workload with fewer than ~20 services and no multi-cloud ambition — CloudWatch + Application Signals is usually enough, and significantly cheaper.
You have strict data-residency rules that no Datadog region satisfies — enterprise DE site or onshore-Australia pattern may force you to AWS-native or a self-hosted stack.
Your primary observability problem is database performance and nothing else — RDS Performance Insights + Aurora DB Activity Streams may be sufficient without adding a third-party bill.
You have zero capacity to maintain a tagging taxonomy — Datadog’s value drops sharply without Unified Service Tagging discipline.

Implementation: multi-account onboarding via CloudFormation StackSet

Datadog publishes a CloudFormation template that creates the IAM role, event subscriptions, and CloudWatch Metric Streams per account. Deploy via StackSet across the AWS Organization:

# Excerpt — Datadog provides the canonical template via the AWS Integration page
aws cloudformation create-stack-set \
  --stack-set-name datadog-aws-integration \
  --template-url https://datadog-cloudformation-template.s3.amazonaws.com/aws/main.yaml \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameters \
      ParameterKey=DatadogApiKey,ParameterValue="<api-key-from-secrets-manager>" \
      ParameterKey=DatadogSite,ParameterValue=datadoghq.com \
      ParameterKey=ExternalId,ParameterValue="<datadog-supplied-external-id>" \
      ParameterKey=InstallDatadogPolicies,ParameterValue=true \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false

aws cloudformation create-stack-instances \
  --stack-set-name datadog-aws-integration \
  --deployment-targets OrganizationalUnitIds=ou-xxx-yyyy \
  --regions us-east-1

Always source the template URL and parameters from the Datadog Admin → Integrations → AWS page — Datadog publishes updated templates as the trust contract evolves.

Implementation: Datadog Operator with Pod Identity on EKS

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: datadog
spec:
  global:
    clusterName: prod-eks-eu-west-1
    site: datadoghq.com
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
    tags:
      - 'team:platform'
      - 'env:prod'
      - 'cost-center:eng'
  features:
    apm:
      enabled: true
    logCollection:
      enabled: true
      containerCollectAll: true
    orchestratorExplorer:
      enabled: true
    liveContainerCollection:
      enabled: true
    eventCollection:
      collectKubernetesEvents: true
  override:
    nodeAgent:
      serviceAccountName: datadog-agent
      # Pod Identity association created out-of-band via:
      # aws eks create-pod-identity-association \
      #   --cluster-name prod-eks-eu-west-1 \
      #   --namespace datadog \
      #   --service-account datadog-agent \
      #   --role-arn arn:aws:iam::123:role/datadog-agent-pod-identity

Pod Identity replaces IRSA — no OIDC provider, no ServiceAccount annotation. Cluster Agent and Admission Controller are managed by the Operator.

Failure modes & resilience

1. CloudWatch Metric Streams Firehose backpressure. A spike in metric volume (sudden Lambda concurrency, EKS node-fleet replacement) can fill the Firehose buffer; Datadog ingest lags by minutes. Mitigation: monitor aws.firehose.delivery_to_http_endpoint.records_delivered_count against incoming records; raise Firehose buffer size and Datadog endpoint concurrency via the StackSet update.

2. Custom-metric cardinality runaway. A single new tag key with high cardinality (request_id, user_id, raw URL path) explodes metric counts and Datadog bills. Datadog enforces a per-organization custom-metrics limit. Mitigation: tag schema review at PR time; a periodic query on datadog.estimated_usage.custom_metrics filtered by metric_name; drop high-cardinality tags via the metrics-without-limits feature or convert to log-based metrics.

3. Exclusion-filter drift. Filters configured to drop noisy logs are easy to forget; cost climbs silently as new services emit similar logs without matching filters. Mitigation: quarterly review of top log-cost contributors; codify exclusion filters in the Datadog Terraform provider so changes go through PR review.

4. Lambda Extension cold-start. First invocation of a Lambda function with the Datadog Extension layer adds 100–300 ms to init for DD_API_KEY decryption (when sourced from Secrets Manager). Mitigation: use DD_API_KEY_SECRET_ARN only for environments that justify the cost; for latency-critical functions, set the API key as a Lambda env var with a Provisioned Concurrency configuration to amortize.

5. Agent host-reporting drift on Auto Mode. Auto Mode replaces nodes; transient reporting gaps (~30 s) appear during replacement. Mitigation: dashboards should query over windows ≥ 1 min; alarms with 2/3 datapoints to avoid replacement-induced false positives.

6. Datadog API rate limits. 300 reqs/hour for most public APIs, 600 for Logs Search. Bulk dashboard imports or programmatic monitor management can trip this. Mitigation: backoff with jitter; use the Terraform provider with parallelism capped.

7. Datadog itself is down. Region incidents happen. Mitigation: keep CloudWatch alarms on the truly load-bearing AWS-service metrics (RDS CPU, Lambda errors, ALB 5xx) so on-call gets paged even if Datadog is unavailable. Don’t centralize EVERY alert in Datadog.

Observability runbook (alerting on Datadog itself)

Meta-monitors we ship:

Monitor	Threshold	First action
`datadog.agent.up` per host (no-data alert)	no data `> 10 min`	Confirm node still exists; check Agent status / logs
Custom-metric count by service	`> 100k` distinct timeseries	Cardinality review; drop tags or convert to log-based metric
Log ingestion volume by service	`> 2×` 7-day baseline	Sudden log explosion; identify and exclude or move to Flex Logs
Firehose `delivery_to_http_endpoint.success` ratio	`< 99%` for 15 min	Datadog endpoint health; AWS Firehose error logs
`aws.integration.run_status` by AWS account	failure	Datadog Admin → Integrations → AWS → check role assumption error
LLM Observability prompts failing eval	spike > baseline	Prompt regression; pair with Bedrock Guardrails findings
Custom-metric usage `> 80%` of contracted limit	monthly	Renegotiate or trim before hard cap

Debug path: “metric missing in Datadog”:

Confirm the metric is being emitted: from the host, agent status → list of integrations and their last collection.
Datadog Admin → Integrations → AWS → check that the relevant service is enabled (CloudWatch namespaces are opt-in).
Inspect Metric Streams: AWS console → CloudWatch → Metric Streams → status running; recent errors in the destination Firehose.
Tag filter mismatch: Datadog filters at ingest may drop the metric — review include/exclude rules.
Custom metric: confirm the host/container has DogStatsD enabled and the metric name is not collapsing to a quota-limited family.

Best practices

Tagging

Unified Service Tagging (service, env, version) on every piece of telemetry — enforce via Admission Controller and CI pipeline checks.
Inherit AWS tags (cost-center, team, pii-classification) via the CloudWatch integration.

Alerts

Alert on business/SLO metrics (error rate, P99 latency, checkout success) first; alert on infra second.
Composite monitors for noise reduction; dynamic baselines for seasonal workloads.

Cost control

Exclusion filters on known-noisy logs; Flex Logs for audit/compliance.
Log-based metrics for anything you alert on from logs.
Quarterly review of custom metric cardinality — the #1 cause of runaway Datadog bills.

Security review

External ID on the Datadog IAM role; scoped managed policy, no *:*.
PrivateLink endpoints for regulated workloads; VPC endpoint policies to lock egress.

700+

AWS & SaaS integrations in Datadog

Telemetry types unified (metrics, logs, traces, LLM)

30-60%

Typical log-bill reduction after a Flex Logs + exclusion-filter pass

Tools & Calculators

Self-serve calculators and assessments that pair with this integration.

AWS CloudWatch Cost Calculator

Baseline your CloudWatch + Datadog spend before you consolidate dashboards.

Open Tool

Related AWS Services

Consulting engagements that frequently pair with this integration.

AWS Well-Architected Review — Free Assessment

Free AWS Well-Architected Review from FactualMinds. Identify risks, compliance gaps, and optimization opportunities.

Explore Service

AWS Cost Optimization & FinOps Consulting

AWS cost optimization and FinOps consulting from FactualMinds — reduce spend by 20-40% with expert right-sizing and strategy.

Explore Service

AWS DevOps Consulting

AWS DevOps consulting — CI/CD pipeline setup, infrastructure as code (SAM/CDK), and deployment automation.

Explore Service

Who typically runs this integration?

The roles that most often own or review this stack.

AWS Solutions for DevOps & Platform Engineers

EKS Auto Mode, OIDC-native CI/CD, supply-chain security, CDK Toolkit v2, and eBPF observability for platform teams building the platform on AWS in 2026.

Explore

AWS Solutions for FinOps Teams

FinOps Framework 2025 rollout, AI unit economics, CUR 2.0 with Split Cost Allocation, and Bedrock cost controls for cloud finance leaders on AWS.

Explore

Related Integrations

Other AWS integration guides commonly deployed alongside this one.

Kubernetes on AWS (EKS)

Amazon EKS in 2026: Auto Mode GA, Hybrid Nodes, Karpenter 1.0, Pod Identity, Graviton-first node pools, and ECR enhanced scanning — cheaper, safer K8s.

View Guide

GitHub Actions with AWS

GitHub Actions to AWS in 2026: OIDC keyless auth, Artifact Attestations, Immutable Actions, ARM runners, and reusable workflows to ECS, Lambda, EKS.

View Guide

Frequently Asked Questions

How does Datadog integrate with AWS in 2026?

The modern pattern uses AWS Integration via IAM role (no long-lived keys) plus CloudWatch Metric Streams over Amazon Data Firehose for near real-time metric ingest. On EC2 and EKS, install Datadog Agent v7 or the Datadog Operator; for Lambda use the Datadog Lambda Extension layer (no forwarder Lambda). For multi-account estates, onboard accounts through the AWS Integration page using the CloudFormation StackSet Datadog publishes — it creates the IAM role and event subscriptions consistently across Organizations.

What AWS metrics and logs does Datadog collect?

Out of the box: EC2 CPU/memory/disk, EBS IOPS, RDS performance insights, S3 object counts and sizes, Lambda duration and cold starts, ELB/ALB latency, DynamoDB throughput, EventBridge rule failures, and GuardDuty/Security Hub findings. Logs can arrive via the CloudWatch Logs subscription filter, S3 archive ingestion, or the Agent. Custom metrics land via DogStatsD, OTLP, or the API — and the same tags flow to metrics, traces, and logs when you use Unified Service Tagging.

Can Datadog replace CloudWatch entirely?

Usually no — and trying to is where most teams overspend. CloudWatch is billed per metric/log whether you look at it or not, but a handful of AWS features only emit to CloudWatch natively (most CloudWatch Alarms, AWS Health events, Lambda CloudWatch metrics used by service quotas). The pragmatic pattern: keep CloudWatch for AWS-service-native alarms and quota dashboards; use Datadog as the single pane of glass for application traces, custom metrics, LLM observability, and cross-account correlation. Datadog ingests CloudWatch via Metric Streams, so you still see everything in one place.

How do I correlate logs, metrics, and traces in Datadog?

Use Unified Service Tagging: `env`, `service`, and `version` tags on every piece of telemetry, propagated by the Agent/Tracer. For AWS resources, Datadog inherits tags from CloudWatch and Resource Groups so existing `cost-center`/`team` tags show up automatically. Enable Data Streams Monitoring for Kafka/Kinesis/SNS/SQS to get end-to-end flow tracking. Trace Explorer and Live Tail support identical query syntax across telemetry types.

What does Datadog cost on a typical AWS estate in 2026?

Pricing changes — always confirm at datadoghq.com/pricing. Current ballparks: Infrastructure Pro starts ~$15/host/month, APM adds ~$31/host/month, Log Management is priced per GB ingested + retained (Flex Logs is the cheaper tier for audit/compliance logs queried infrequently), LLM Observability and Cloud SIEM are sold separately. For mid-market AWS estates we typically see $3k–$15k/month, with the biggest optimization levers being exclusion filters, Flex Logs for archive-pattern logs, log-based metrics, and dropping high-cardinality custom metrics.

Datadog LLM Observability on Bedrock vs CloudWatch GenAI observability — which do I use?

Both, for different questions. CloudWatch GenAI observability (CloudWatch Application Signals + Bedrock invocation metrics) is free-tier for the basics — token usage, invocation latency, and error rates across Bedrock models — and is sufficient if you only need operational alerts. Datadog LLM Observability (GA in 2024, matured through 2025) adds prompt/completion capture, hallucination scoring, quality evaluators, and correlation with APM traces, so a slow checkout trace can be tied to a specific Bedrock Claude call. Teams running Bedrock AgentCore or multi-step agents almost always need the prompt/trace view, which CloudWatch does not provide.

How do we audit the Datadog-to-AWS trust relationship for security review?

Three checks: (1) the Datadog IAM role must require the external ID Datadog assigns to your account — prevents confused-deputy; (2) the role should be scoped with the Datadog-published managed policy, no `*`-on-`*`; (3) for regulated workloads, enable Datadog AWS PrivateLink endpoints so metric/log traffic never transits the public internet. Pair with Datadog Cloud SIEM for detection on the AWS CloudTrail feed if you want Datadog to alert on the IAM role itself being modified.

Need Help with This Integration?

Our AWS-certified engineers can design, implement, and operate this integration end-to-end — or review what you already have.

Talk to AWS Experts

AWS CloudWatch Cost Calculator

Datadog with AWS

Datadog + AWS overview

What’s new for Datadog on AWS in 2026

How Datadog monitors AWS (implementation patterns)

Key Datadog + AWS features

Datadog pricing for AWS (2026)

Datadog vs CloudWatch vs open-source

When Datadog is NOT the right call

Implementation: multi-account onboarding via CloudFormation StackSet

Implementation: Datadog Operator with Pod Identity on EKS

Failure modes & resilience

Observability runbook (alerting on Datadog itself)

Best practices

Tools & Calculators

AWS CloudWatch Cost Calculator

Related AWS Services

AWS Well-Architected Review — Free Assessment

AWS Cost Optimization & FinOps Consulting

AWS DevOps Consulting

Who typically runs this integration?

AWS Solutions for DevOps & Platform Engineers

AWS Solutions for FinOps Teams

Related Integrations

Kubernetes on AWS (EKS)

GitHub Actions with AWS

Frequently Asked Questions

Related Reading

AWS CloudWatch Observability: Metrics, Logs, and Alarms Best Practices

Logging Yourself Into Bankruptcy

Amazon Bedrock AgentCore: Building Production-Ready AI Agents on AWS

Need Help with This Integration?

Datadog with AWS

AI & assistant-friendly summary

Summary

Key Facts

Entity Definitions

Datadog + AWS overview

What’s new for Datadog on AWS in 2026

How Datadog monitors AWS (implementation patterns)

Key Datadog + AWS features

Datadog pricing for AWS (2026)

Datadog vs CloudWatch vs open-source

When Datadog is NOT the right call

Implementation: multi-account onboarding via CloudFormation StackSet

Implementation: Datadog Operator with Pod Identity on EKS

Failure modes & resilience

Observability runbook (alerting on Datadog itself)

Best practices

Related reading

Related services

Tools & Calculators

AWS CloudWatch Cost Calculator

Related AWS Services

AWS Well-Architected Review — Free Assessment

AWS Cost Optimization & FinOps Consulting

AWS DevOps Consulting

Who typically runs this integration?

AWS Solutions for DevOps & Platform Engineers

AWS Solutions for FinOps Teams

Related Integrations

Kubernetes on AWS (EKS)

GitHub Actions with AWS

Frequently Asked Questions

Related Reading

AWS CloudWatch Observability: Metrics, Logs, and Alarms Best Practices

Logging Yourself Into Bankruptcy

Amazon Bedrock AgentCore: Building Production-Ready AI Agents on AWS

Need Help with This Integration?