Monitoring & Observability
Datadog with AWS
Deep visibility into AWS infrastructure, Bedrock/SageMaker workloads, and applications — with a single tagging taxonomy across CloudWatch and Datadog.
Last updated:April 29, 2026Author:FactualMinds Cloud Integration TeamReviewed by:FactualMinds AWS-certified architects (Solutions Architect – Professional)
AI & assistant-friendly summary
This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.
Summary
Datadog on AWS in 2026: unified observability for CloudWatch, EKS, Lambda, Bedrock LLM workloads, and security posture across multi-cloud estates.
Key Facts
- • Datadog on AWS in 2026: unified observability for CloudWatch, EKS, Lambda, Bedrock LLM workloads, and security posture across multi-cloud estates
- • Deep visibility into AWS infrastructure, Bedrock/SageMaker workloads, and applications — with a single tagging taxonomy across CloudWatch and Datadog
- • How does Datadog integrate with AWS in 2026
- • The modern pattern uses AWS Integration via IAM role (no long-lived keys) plus CloudWatch Metric Streams over Amazon Data Firehose for near real-time metric ingest
- • On EC2 and EKS, install Datadog Agent v7 or the Datadog Operator; for Lambda use the Datadog Lambda Extension layer (no forwarder Lambda)
Entity Definitions
- Amazon Bedrock
- Amazon Bedrock is relevant to datadog with aws.
- Bedrock
- Bedrock is relevant to datadog with aws.
- SageMaker
- SageMaker is relevant to datadog with aws.
- Lambda
- Lambda is relevant to datadog with aws.
- EC2
- EC2 is relevant to datadog with aws.
- S3
- S3 is relevant to datadog with aws.
- RDS
- RDS is relevant to datadog with aws.
- Aurora
- Aurora is relevant to datadog with aws.
- DynamoDB
- DynamoDB is relevant to datadog with aws.
- CloudWatch
- CloudWatch is relevant to datadog with aws.
- IAM
- IAM is relevant to datadog with aws.
- VPC
- VPC is relevant to datadog with aws.
- EKS
- EKS is relevant to datadog with aws.
- ECS
- ECS is relevant to datadog with aws.
- EventBridge
- EventBridge is relevant to datadog with aws.
## Datadog + AWS overview
Datadog is an enterprise observability and security platform. On AWS, it ingests CloudWatch metrics through Amazon Data Firehose-backed Metric Streams, collects host and container telemetry via Agent v7 and the Datadog Operator on EKS, and captures serverless telemetry through the Datadog Lambda Extension — all tied together by Unified Service Tagging so the same `env`/`service`/`version` tag flows from metric to trace to log to LLM call.
FactualMinds deploys Datadog on AWS for teams that have either outgrown CloudWatch's cross-service correlation or need consolidated visibility across AWS, on-prem, and a second cloud. We keep CloudWatch as the AWS-native source of truth for service quotas, AWS Health, and alarm-driven auto-recovery — Datadog becomes the investigative and SLO layer on top.
## What's new for Datadog on AWS in 2026
- **LLM Observability GA** — prompt, completion, and tool-call capture for Bedrock (including Claude Sonnet 4, Llama 4, and Amazon Nova), SageMaker endpoints, and self-hosted models. Integrates with APM trace IDs so an investigative view on a slow checkout can walk from the Bedrock call back to the user request.
- **Database Monitoring for Aurora, RDS Postgres/MySQL, DynamoDB, and ElastiCache** — captures explain plans and lock/deadlock data without a proxy; the DynamoDB integration now includes PITR cost metrics and table-class usage.
- **Cloud SIEM** — detection rules over CloudTrail, GuardDuty, Security Hub, and VPC Flow Logs; pairs well with AWS Security Lake for long-term OCSF storage.
- **Watchdog Root Cause Analysis** — uses causal inference on infra, APM, and RUM signals to propose a likely root cause per alert, reducing on-call MTTI.
- **Kubernetes Monitoring UI refresh** — cluster, workload, and pod views unified for EKS, EKS Auto Mode, and EKS Hybrid Nodes; works with Pod Identity so Agents no longer need IRSA annotations.
- **Flex Logs + tier pricing** — archive-tier ingestion is ~80% cheaper than standard for logs queried infrequently (audit, compliance). Most AWS estates see 30–60% log-bill reduction after a Flex Logs pass.
- **Data Streams Monitoring for Kinesis, MSK, SNS, SQS, and EventBridge** — end-to-end flow view with lag, throughput, and producer/consumer correlation.
## How Datadog monitors AWS (implementation patterns)
**CloudWatch Metric Streams (preferred for AWS-service metrics)**
- Amazon Data Firehose → Datadog endpoint. Sub-minute metric freshness vs the 10–15 min delay of the legacy polling integration.
- Deploy via the Datadog-published CloudFormation StackSet across AWS Organizations; supports Control Tower landing zones.
- Tag filters in the StackSet limit which services stream — important for keeping costs aligned with actual needs.
**Datadog Agent v7 + Datadog Operator on EKS**
- Operator installs Agent DaemonSet, Cluster Agent, and Admission Controller with a single `DatadogAgent` CRD.
- Compatible with EKS Auto Mode; Pod Identity integration removes the IRSA dance for Agent permissions.
- APM auto-instrumentation now covers Python, Node, Java, Go, .NET, Ruby, PHP — Admission Controller injects the tracer library without rebuilding images.
**Datadog Lambda Extension** (serverless)
- Runs as a Lambda layer; no forwarder Lambda or CloudWatch log-group subscription to maintain.
- Captures traces, enhanced metrics, and logs from the function runtime directly — lower latency and cost than CloudWatch-based approaches for high-volume functions.
- Supports ARM64 Graviton runtimes; cold-start overhead measured in single-digit ms.
**AWS PrivateLink endpoints**
- Metric/log/trace ingestion over PrivateLink for regulated workloads (HIPAA, PCI DSS 4.0.1, FedRAMP Moderate).
- Pair with VPC endpoint policies to deny egress to the public Datadog endpoints if you need egress lock-down.
## Key Datadog + AWS features
**Infrastructure monitoring**
- Real-time metrics across EC2, RDS, Aurora, DynamoDB, S3, Lambda, ECS, EKS, ElastiCache, SNS, SQS, Kinesis, EventBridge, and 700+ SaaS integrations.
- Automatic resource discovery and relationship mapping via AWS tags and CloudFormation stack names.
- Host map and container map for visual fleet-level health.
**LLM Observability (GA in 2024)**
- Monitor Bedrock, SageMaker, and self-hosted LLMs; track prompt/completion quality with built-in evaluators and custom judges.
- Drift and regression detection across model versions — critical when Bedrock routes you through provisioned throughput or the Nova/Claude/Llama model families.
- Integrated with APM: the slow checkout trace shows the slow Bedrock call without instrumenting manually.
**Database Monitoring**
- Aurora, RDS Postgres/MySQL, DynamoDB, and MongoDB Atlas — plan capture, lock analysis, and query-level P95 latency.
- No proxy, no extra network hop; uses `pg_stat_statements` / Performance Schema and the Agent.
**Application Performance Monitoring (APM)**
- Distributed tracing across microservices, including OpenTelemetry-native ingest (OTLP).
- Database query profiling, service dependency maps, and Continuous Profiler for CPU/memory bottlenecks in production.
**Log Management + Flex Logs**
- Centralized log ingestion with parsing, enrichment, and Live Tail.
- Flex Logs for audit/compliance logs — ~80% cheaper than standard tier, same query syntax.
- Log-based metrics convert high-volume logs into cost-efficient metrics for dashboards and alerts.
**Cloud SIEM**
- Detection rules on CloudTrail, GuardDuty, Security Hub, and VPC Flow Logs with out-of-the-box rulesets aligned to MITRE ATT&CK.
- Pairs with AWS Security Lake (OCSF) for long-term storage and with Amazon Detective for investigation pivots.
**Cost Management**
- Tracks AWS spend alongside performance metrics; correlates deploys with cost deltas.
- Plugs into AWS Cost Optimization Hub and CUR 2.0 with Split Cost Allocation Data for per-tenant attribution.
## Datadog pricing for AWS (2026)
Pricing evolves — verify at [datadoghq.com/pricing](https://www.datadoghq.com/pricing/). Current ballparks:
**Infrastructure monitoring**
- Pro: ~$15–$23/host/month. Enterprise: ~$23–$34/host/month.
- Per-container and serverless pricing available for EKS/Fargate workloads.
**APM + Continuous Profiler**
- ~$31–$40/host/month.
**Log Management**
- Standard tier: per-GB ingested + per-GB retained.
- Flex Logs: ~80% cheaper ingestion for logs queried infrequently.
**LLM Observability, Cloud SIEM, Database Monitoring**
- Sold separately; all usage-based.
**Typical totals**: small teams $400–$1,500/month, mid-market $3k–$15k/month, enterprise on annual contracts with significant discount.
## Datadog vs CloudWatch vs open-source
**Datadog**
- Full-featured observability + security + LLM + cost in one pane.
- Best for multi-service, multi-cloud, or AI/ML workloads; best investigative experience for on-call.
- Higher sticker price — significant ROI when tied to MTTR and SLO improvements.
**CloudWatch + Application Signals + AWS Managed Grafana**
- Free/low-cost for AWS-native telemetry; Application Signals adds service maps and SLOs.
- Native IAM model; no extra trust relationship or external vendor review.
- Weaker cross-account correlation; LLM observability is basic compared with Datadog.
**Open-source (Prometheus + Grafana + OpenTelemetry + Loki)**
- Maximum control; lowest licence cost but highest operational overhead.
- AWS Managed Prometheus + Managed Grafana + ADOT Collector removes most of the toil.
- Good fit for teams with strong DevOps expertise who want portability.
## When Datadog is NOT the right call
- You run a single-region, AWS-only workload with fewer than ~20 services and no multi-cloud ambition — CloudWatch + Application Signals is usually enough, and significantly cheaper.
- You have strict data-residency rules that no Datadog region satisfies — enterprise DE site or onshore-Australia pattern may force you to AWS-native or a self-hosted stack.
- Your primary observability problem is database performance and nothing else — RDS Performance Insights + Aurora DB Activity Streams may be sufficient without adding a third-party bill.
- You have zero capacity to maintain a tagging taxonomy — Datadog's value drops sharply without Unified Service Tagging discipline.
## Implementation: multi-account onboarding via CloudFormation StackSet
Datadog publishes a CloudFormation template that creates the IAM role, event subscriptions, and CloudWatch Metric Streams per account. Deploy via StackSet across the AWS Organization:
```bash
# Excerpt — Datadog provides the canonical template via the AWS Integration page
aws cloudformation create-stack-set \
--stack-set-name datadog-aws-integration \
--template-url https://datadog-cloudformation-template.s3.amazonaws.com/aws/main.yaml \
--capabilities CAPABILITY_NAMED_IAM \
--parameters \
ParameterKey=DatadogApiKey,ParameterValue="<api-key-from-secrets-manager>" \
ParameterKey=DatadogSite,ParameterValue=datadoghq.com \
ParameterKey=ExternalId,ParameterValue="<datadog-supplied-external-id>" \
ParameterKey=InstallDatadogPolicies,ParameterValue=true \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false
aws cloudformation create-stack-instances \
--stack-set-name datadog-aws-integration \
--deployment-targets OrganizationalUnitIds=ou-xxx-yyyy \
--regions us-east-1
```
Always source the template URL and parameters from the Datadog Admin → Integrations → AWS page — Datadog publishes updated templates as the trust contract evolves.
## Implementation: Datadog Operator with Pod Identity on EKS
```yaml
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
name: datadog
namespace: datadog
spec:
global:
clusterName: prod-eks-eu-west-1
site: datadoghq.com
credentials:
apiSecret:
secretName: datadog-secret
keyName: api-key
tags:
- 'team:platform'
- 'env:prod'
- 'cost-center:eng'
features:
apm:
enabled: true
logCollection:
enabled: true
containerCollectAll: true
orchestratorExplorer:
enabled: true
liveContainerCollection:
enabled: true
eventCollection:
collectKubernetesEvents: true
override:
nodeAgent:
serviceAccountName: datadog-agent
# Pod Identity association created out-of-band via:
# aws eks create-pod-identity-association \
# --cluster-name prod-eks-eu-west-1 \
# --namespace datadog \
# --service-account datadog-agent \
# --role-arn arn:aws:iam::123:role/datadog-agent-pod-identity
```
Pod Identity replaces IRSA — no OIDC provider, no ServiceAccount annotation. Cluster Agent and Admission Controller are managed by the Operator.
## Failure modes & resilience
**1. CloudWatch Metric Streams Firehose backpressure.** A spike in metric volume (sudden Lambda concurrency, EKS node-fleet replacement) can fill the Firehose buffer; Datadog ingest lags by minutes. Mitigation: monitor `aws.firehose.delivery_to_http_endpoint.records_delivered_count` against incoming records; raise Firehose buffer size and Datadog endpoint concurrency via the StackSet update.
**2. Custom-metric cardinality runaway.** A single new tag key with high cardinality (`request_id`, `user_id`, raw URL path) explodes metric counts and Datadog bills. Datadog enforces a per-organization custom-metrics limit. Mitigation: tag schema review at PR time; a periodic query on `datadog.estimated_usage.custom_metrics` filtered by `metric_name`; drop high-cardinality tags via the metrics-without-limits feature or convert to log-based metrics.
**3. Exclusion-filter drift.** Filters configured to drop noisy logs are easy to forget; cost climbs silently as new services emit similar logs without matching filters. Mitigation: quarterly review of top log-cost contributors; codify exclusion filters in the Datadog Terraform provider so changes go through PR review.
**4. Lambda Extension cold-start.** First invocation of a Lambda function with the Datadog Extension layer adds 100–300 ms to init for `DD_API_KEY` decryption (when sourced from Secrets Manager). Mitigation: use `DD_API_KEY_SECRET_ARN` only for environments that justify the cost; for latency-critical functions, set the API key as a Lambda env var with a Provisioned Concurrency configuration to amortize.
**5. Agent host-reporting drift on Auto Mode.** Auto Mode replaces nodes; transient reporting gaps (~30 s) appear during replacement. Mitigation: dashboards should query over windows ≥ 1 min; alarms with 2/3 datapoints to avoid replacement-induced false positives.
**6. Datadog API rate limits.** 300 reqs/hour for most public APIs, 600 for Logs Search. Bulk dashboard imports or programmatic monitor management can trip this. Mitigation: backoff with jitter; use the Terraform provider with `parallelism` capped.
**7. Datadog itself is down.** Region incidents happen. Mitigation: keep CloudWatch alarms on the truly load-bearing AWS-service metrics (RDS CPU, Lambda errors, ALB 5xx) so on-call gets paged even if Datadog is unavailable. Don't centralize EVERY alert in Datadog.
## Observability runbook (alerting on Datadog itself)
**Meta-monitors we ship:**
| Monitor | Threshold | First action |
| -------------------------------------------------- | ---------------------------- | ---------------------------------------------------------------- |
| `datadog.agent.up` per host (no-data alert) | no data `> 10 min` | Confirm node still exists; check Agent status / logs |
| Custom-metric count by service | `> 100k` distinct timeseries | Cardinality review; drop tags or convert to log-based metric |
| Log ingestion volume by service | `> 2×` 7-day baseline | Sudden log explosion; identify and exclude or move to Flex Logs |
| Firehose `delivery_to_http_endpoint.success` ratio | `< 99%` for 15 min | Datadog endpoint health; AWS Firehose error logs |
| `aws.integration.run_status` by AWS account | failure | Datadog Admin → Integrations → AWS → check role assumption error |
| LLM Observability prompts failing eval | spike > baseline | Prompt regression; pair with Bedrock Guardrails findings |
| Custom-metric usage `> 80%` of contracted limit | monthly | Renegotiate or trim before hard cap |
**Debug path: "metric missing in Datadog":**
1. Confirm the metric is being emitted: from the host, `agent status` → list of integrations and their last collection.
2. Datadog Admin → Integrations → AWS → check that the relevant service is enabled (CloudWatch namespaces are opt-in).
3. Inspect Metric Streams: AWS console → CloudWatch → Metric Streams → status `running`; recent errors in the destination Firehose.
4. Tag filter mismatch: Datadog filters at ingest may drop the metric — review include/exclude rules.
5. Custom metric: confirm the host/container has DogStatsD enabled and the metric name is not collapsing to a quota-limited family.
## Best practices
**Tagging**
- Unified Service Tagging (`service`, `env`, `version`) on every piece of telemetry — enforce via Admission Controller and CI pipeline checks.
- Inherit AWS tags (`cost-center`, `team`, `pii-classification`) via the CloudWatch integration.
**Alerts**
- Alert on business/SLO metrics (error rate, P99 latency, checkout success) first; alert on infra second.
- Composite monitors for noise reduction; dynamic baselines for seasonal workloads.
**Cost control**
- Exclusion filters on known-noisy logs; Flex Logs for audit/compliance.
- Log-based metrics for anything you alert on from logs.
- Quarterly review of custom metric cardinality — the #1 cause of runaway Datadog bills.
**Security review**
- External ID on the Datadog IAM role; scoped managed policy, no `*:*`.
- PrivateLink endpoints for regulated workloads; VPC endpoint policies to lock egress.
## Related reading
- [`AWS CloudWatch observability: metrics, logs, alarms, and best practices`](/blog/aws-cloudwatch-observability-metrics-logs-alarms-best-practices/)
- [`AWS CloudWatch logging costs: observability without the shock bill`](/blog/aws-cloudwatch-logging-costs-observability/)
- [`Amazon Bedrock AgentCore in production`](/blog/amazon-bedrock-agentcore-production/)
## Related services
- [AWS Cloud Cost Optimization Services](/services/aws-cloud-cost-optimization-services/)
- [AWS Architecture Review](/services/aws-architecture-review/)
- [DevOps Pipeline Setup](/services/devops-pipeline-setup/) Datadog + AWS overview
Datadog is an enterprise observability and security platform. On AWS, it ingests CloudWatch metrics through Amazon Data Firehose-backed Metric Streams, collects host and container telemetry via Agent v7 and the Datadog Operator on EKS, and captures serverless telemetry through the Datadog Lambda Extension — all tied together by Unified Service Tagging so the same env/service/version tag flows from metric to trace to log to LLM call.
FactualMinds deploys Datadog on AWS for teams that have either outgrown CloudWatch’s cross-service correlation or need consolidated visibility across AWS, on-prem, and a second cloud. We keep CloudWatch as the AWS-native source of truth for service quotas, AWS Health, and alarm-driven auto-recovery — Datadog becomes the investigative and SLO layer on top.
What’s new for Datadog on AWS in 2026
- LLM Observability GA — prompt, completion, and tool-call capture for Bedrock (including Claude Sonnet 4, Llama 4, and Amazon Nova), SageMaker endpoints, and self-hosted models. Integrates with APM trace IDs so an investigative view on a slow checkout can walk from the Bedrock call back to the user request.
- Database Monitoring for Aurora, RDS Postgres/MySQL, DynamoDB, and ElastiCache — captures explain plans and lock/deadlock data without a proxy; the DynamoDB integration now includes PITR cost metrics and table-class usage.
- Cloud SIEM — detection rules over CloudTrail, GuardDuty, Security Hub, and VPC Flow Logs; pairs well with AWS Security Lake for long-term OCSF storage.
- Watchdog Root Cause Analysis — uses causal inference on infra, APM, and RUM signals to propose a likely root cause per alert, reducing on-call MTTI.
- Kubernetes Monitoring UI refresh — cluster, workload, and pod views unified for EKS, EKS Auto Mode, and EKS Hybrid Nodes; works with Pod Identity so Agents no longer need IRSA annotations.
- Flex Logs + tier pricing — archive-tier ingestion is ~80% cheaper than standard for logs queried infrequently (audit, compliance). Most AWS estates see 30–60% log-bill reduction after a Flex Logs pass.
- Data Streams Monitoring for Kinesis, MSK, SNS, SQS, and EventBridge — end-to-end flow view with lag, throughput, and producer/consumer correlation.
How Datadog monitors AWS (implementation patterns)
CloudWatch Metric Streams (preferred for AWS-service metrics)
- Amazon Data Firehose → Datadog endpoint. Sub-minute metric freshness vs the 10–15 min delay of the legacy polling integration.
- Deploy via the Datadog-published CloudFormation StackSet across AWS Organizations; supports Control Tower landing zones.
- Tag filters in the StackSet limit which services stream — important for keeping costs aligned with actual needs.
Datadog Agent v7 + Datadog Operator on EKS
- Operator installs Agent DaemonSet, Cluster Agent, and Admission Controller with a single
DatadogAgentCRD. - Compatible with EKS Auto Mode; Pod Identity integration removes the IRSA dance for Agent permissions.
- APM auto-instrumentation now covers Python, Node, Java, Go, .NET, Ruby, PHP — Admission Controller injects the tracer library without rebuilding images.
Datadog Lambda Extension (serverless)
- Runs as a Lambda layer; no forwarder Lambda or CloudWatch log-group subscription to maintain.
- Captures traces, enhanced metrics, and logs from the function runtime directly — lower latency and cost than CloudWatch-based approaches for high-volume functions.
- Supports ARM64 Graviton runtimes; cold-start overhead measured in single-digit ms.
AWS PrivateLink endpoints
- Metric/log/trace ingestion over PrivateLink for regulated workloads (HIPAA, PCI DSS 4.0.1, FedRAMP Moderate).
- Pair with VPC endpoint policies to deny egress to the public Datadog endpoints if you need egress lock-down.
Key Datadog + AWS features
Infrastructure monitoring
- Real-time metrics across EC2, RDS, Aurora, DynamoDB, S3, Lambda, ECS, EKS, ElastiCache, SNS, SQS, Kinesis, EventBridge, and 700+ SaaS integrations.
- Automatic resource discovery and relationship mapping via AWS tags and CloudFormation stack names.
- Host map and container map for visual fleet-level health.
LLM Observability (GA in 2024)
- Monitor Bedrock, SageMaker, and self-hosted LLMs; track prompt/completion quality with built-in evaluators and custom judges.
- Drift and regression detection across model versions — critical when Bedrock routes you through provisioned throughput or the Nova/Claude/Llama model families.
- Integrated with APM: the slow checkout trace shows the slow Bedrock call without instrumenting manually.
Database Monitoring
- Aurora, RDS Postgres/MySQL, DynamoDB, and MongoDB Atlas — plan capture, lock analysis, and query-level P95 latency.
- No proxy, no extra network hop; uses
pg_stat_statements/ Performance Schema and the Agent.
Application Performance Monitoring (APM)
- Distributed tracing across microservices, including OpenTelemetry-native ingest (OTLP).
- Database query profiling, service dependency maps, and Continuous Profiler for CPU/memory bottlenecks in production.
Log Management + Flex Logs
- Centralized log ingestion with parsing, enrichment, and Live Tail.
- Flex Logs for audit/compliance logs — ~80% cheaper than standard tier, same query syntax.
- Log-based metrics convert high-volume logs into cost-efficient metrics for dashboards and alerts.
Cloud SIEM
- Detection rules on CloudTrail, GuardDuty, Security Hub, and VPC Flow Logs with out-of-the-box rulesets aligned to MITRE ATT&CK.
- Pairs with AWS Security Lake (OCSF) for long-term storage and with Amazon Detective for investigation pivots.
Cost Management
- Tracks AWS spend alongside performance metrics; correlates deploys with cost deltas.
- Plugs into AWS Cost Optimization Hub and CUR 2.0 with Split Cost Allocation Data for per-tenant attribution.
Datadog pricing for AWS (2026)
Pricing evolves — verify at datadoghq.com/pricing. Current ballparks:
Infrastructure monitoring
- Pro: ~$15–$23/host/month. Enterprise: ~$23–$34/host/month.
- Per-container and serverless pricing available for EKS/Fargate workloads.
APM + Continuous Profiler
- ~$31–$40/host/month.
Log Management
- Standard tier: per-GB ingested + per-GB retained.
- Flex Logs: ~80% cheaper ingestion for logs queried infrequently.
LLM Observability, Cloud SIEM, Database Monitoring
- Sold separately; all usage-based.
Typical totals: small teams $400–$1,500/month, mid-market $3k–$15k/month, enterprise on annual contracts with significant discount.
Datadog vs CloudWatch vs open-source
Datadog
- Full-featured observability + security + LLM + cost in one pane.
- Best for multi-service, multi-cloud, or AI/ML workloads; best investigative experience for on-call.
- Higher sticker price — significant ROI when tied to MTTR and SLO improvements.
CloudWatch + Application Signals + AWS Managed Grafana
- Free/low-cost for AWS-native telemetry; Application Signals adds service maps and SLOs.
- Native IAM model; no extra trust relationship or external vendor review.
- Weaker cross-account correlation; LLM observability is basic compared with Datadog.
Open-source (Prometheus + Grafana + OpenTelemetry + Loki)
- Maximum control; lowest licence cost but highest operational overhead.
- AWS Managed Prometheus + Managed Grafana + ADOT Collector removes most of the toil.
- Good fit for teams with strong DevOps expertise who want portability.
When Datadog is NOT the right call
- You run a single-region, AWS-only workload with fewer than ~20 services and no multi-cloud ambition — CloudWatch + Application Signals is usually enough, and significantly cheaper.
- You have strict data-residency rules that no Datadog region satisfies — enterprise DE site or onshore-Australia pattern may force you to AWS-native or a self-hosted stack.
- Your primary observability problem is database performance and nothing else — RDS Performance Insights + Aurora DB Activity Streams may be sufficient without adding a third-party bill.
- You have zero capacity to maintain a tagging taxonomy — Datadog’s value drops sharply without Unified Service Tagging discipline.
Implementation: multi-account onboarding via CloudFormation StackSet
Datadog publishes a CloudFormation template that creates the IAM role, event subscriptions, and CloudWatch Metric Streams per account. Deploy via StackSet across the AWS Organization:
# Excerpt — Datadog provides the canonical template via the AWS Integration page
aws cloudformation create-stack-set \
--stack-set-name datadog-aws-integration \
--template-url https://datadog-cloudformation-template.s3.amazonaws.com/aws/main.yaml \
--capabilities CAPABILITY_NAMED_IAM \
--parameters \
ParameterKey=DatadogApiKey,ParameterValue="<api-key-from-secrets-manager>" \
ParameterKey=DatadogSite,ParameterValue=datadoghq.com \
ParameterKey=ExternalId,ParameterValue="<datadog-supplied-external-id>" \
ParameterKey=InstallDatadogPolicies,ParameterValue=true \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false
aws cloudformation create-stack-instances \
--stack-set-name datadog-aws-integration \
--deployment-targets OrganizationalUnitIds=ou-xxx-yyyy \
--regions us-east-1
Always source the template URL and parameters from the Datadog Admin → Integrations → AWS page — Datadog publishes updated templates as the trust contract evolves.
Implementation: Datadog Operator with Pod Identity on EKS
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
name: datadog
namespace: datadog
spec:
global:
clusterName: prod-eks-eu-west-1
site: datadoghq.com
credentials:
apiSecret:
secretName: datadog-secret
keyName: api-key
tags:
- 'team:platform'
- 'env:prod'
- 'cost-center:eng'
features:
apm:
enabled: true
logCollection:
enabled: true
containerCollectAll: true
orchestratorExplorer:
enabled: true
liveContainerCollection:
enabled: true
eventCollection:
collectKubernetesEvents: true
override:
nodeAgent:
serviceAccountName: datadog-agent
# Pod Identity association created out-of-band via:
# aws eks create-pod-identity-association \
# --cluster-name prod-eks-eu-west-1 \
# --namespace datadog \
# --service-account datadog-agent \
# --role-arn arn:aws:iam::123:role/datadog-agent-pod-identity
Pod Identity replaces IRSA — no OIDC provider, no ServiceAccount annotation. Cluster Agent and Admission Controller are managed by the Operator.
Failure modes & resilience
1. CloudWatch Metric Streams Firehose backpressure. A spike in metric volume (sudden Lambda concurrency, EKS node-fleet replacement) can fill the Firehose buffer; Datadog ingest lags by minutes. Mitigation: monitor aws.firehose.delivery_to_http_endpoint.records_delivered_count against incoming records; raise Firehose buffer size and Datadog endpoint concurrency via the StackSet update.
2. Custom-metric cardinality runaway. A single new tag key with high cardinality (request_id, user_id, raw URL path) explodes metric counts and Datadog bills. Datadog enforces a per-organization custom-metrics limit. Mitigation: tag schema review at PR time; a periodic query on datadog.estimated_usage.custom_metrics filtered by metric_name; drop high-cardinality tags via the metrics-without-limits feature or convert to log-based metrics.
3. Exclusion-filter drift. Filters configured to drop noisy logs are easy to forget; cost climbs silently as new services emit similar logs without matching filters. Mitigation: quarterly review of top log-cost contributors; codify exclusion filters in the Datadog Terraform provider so changes go through PR review.
4. Lambda Extension cold-start. First invocation of a Lambda function with the Datadog Extension layer adds 100–300 ms to init for DD_API_KEY decryption (when sourced from Secrets Manager). Mitigation: use DD_API_KEY_SECRET_ARN only for environments that justify the cost; for latency-critical functions, set the API key as a Lambda env var with a Provisioned Concurrency configuration to amortize.
5. Agent host-reporting drift on Auto Mode. Auto Mode replaces nodes; transient reporting gaps (~30 s) appear during replacement. Mitigation: dashboards should query over windows ≥ 1 min; alarms with 2/3 datapoints to avoid replacement-induced false positives.
6. Datadog API rate limits. 300 reqs/hour for most public APIs, 600 for Logs Search. Bulk dashboard imports or programmatic monitor management can trip this. Mitigation: backoff with jitter; use the Terraform provider with parallelism capped.
7. Datadog itself is down. Region incidents happen. Mitigation: keep CloudWatch alarms on the truly load-bearing AWS-service metrics (RDS CPU, Lambda errors, ALB 5xx) so on-call gets paged even if Datadog is unavailable. Don’t centralize EVERY alert in Datadog.
Observability runbook (alerting on Datadog itself)
Meta-monitors we ship:
| Monitor | Threshold | First action |
|---|---|---|
datadog.agent.up per host (no-data alert) | no data > 10 min | Confirm node still exists; check Agent status / logs |
| Custom-metric count by service | > 100k distinct timeseries | Cardinality review; drop tags or convert to log-based metric |
| Log ingestion volume by service | > 2× 7-day baseline | Sudden log explosion; identify and exclude or move to Flex Logs |
Firehose delivery_to_http_endpoint.success ratio | < 99% for 15 min | Datadog endpoint health; AWS Firehose error logs |
aws.integration.run_status by AWS account | failure | Datadog Admin → Integrations → AWS → check role assumption error |
| LLM Observability prompts failing eval | spike > baseline | Prompt regression; pair with Bedrock Guardrails findings |
Custom-metric usage > 80% of contracted limit | monthly | Renegotiate or trim before hard cap |
Debug path: “metric missing in Datadog”:
- Confirm the metric is being emitted: from the host,
agent status→ list of integrations and their last collection. - Datadog Admin → Integrations → AWS → check that the relevant service is enabled (CloudWatch namespaces are opt-in).
- Inspect Metric Streams: AWS console → CloudWatch → Metric Streams → status
running; recent errors in the destination Firehose. - Tag filter mismatch: Datadog filters at ingest may drop the metric — review include/exclude rules.
- Custom metric: confirm the host/container has DogStatsD enabled and the metric name is not collapsing to a quota-limited family.
Best practices
Tagging
- Unified Service Tagging (
service,env,version) on every piece of telemetry — enforce via Admission Controller and CI pipeline checks. - Inherit AWS tags (
cost-center,team,pii-classification) via the CloudWatch integration.
Alerts
- Alert on business/SLO metrics (error rate, P99 latency, checkout success) first; alert on infra second.
- Composite monitors for noise reduction; dynamic baselines for seasonal workloads.
Cost control
- Exclusion filters on known-noisy logs; Flex Logs for audit/compliance.
- Log-based metrics for anything you alert on from logs.
- Quarterly review of custom metric cardinality — the #1 cause of runaway Datadog bills.
Security review
- External ID on the Datadog IAM role; scoped managed policy, no
*:*. - PrivateLink endpoints for regulated workloads; VPC endpoint policies to lock egress.
Related reading
AWS CloudWatch observability: metrics, logs, alarms, and best practicesAWS CloudWatch logging costs: observability without the shock billAmazon Bedrock AgentCore in production
Related services
Tools & Calculators
Self-serve calculators and assessments that pair with this integration.
AWS CloudWatch Cost Calculator
Baseline your CloudWatch + Datadog spend before you consolidate dashboards.
Related AWS Services
Consulting engagements that frequently pair with this integration.
AWS Well-Architected Review — Free Assessment
Free AWS Well-Architected Review from FactualMinds. Identify risks, compliance gaps, and optimization opportunities.
AWS Cost Optimization & FinOps Consulting
AWS cost optimization and FinOps consulting from FactualMinds — reduce spend by 20-40% with expert right-sizing and strategy.
AWS DevOps Consulting
AWS DevOps consulting — CI/CD pipeline setup, infrastructure as code (SAM/CDK), and deployment automation.
Who typically runs this integration?
The roles that most often own or review this stack.
AWS Solutions for DevOps & Platform Engineers
EKS Auto Mode, OIDC-native CI/CD, supply-chain security, CDK Toolkit v2, and eBPF observability for platform teams building the platform on AWS in 2026.
AWS Solutions for FinOps Teams
FinOps Framework 2025 rollout, AI unit economics, CUR 2.0 with Split Cost Allocation, and Bedrock cost controls for cloud finance leaders on AWS.
Related Integrations
Other AWS integration guides commonly deployed alongside this one.
Kubernetes on AWS (EKS)
Amazon EKS in 2026: Auto Mode GA, Hybrid Nodes, Karpenter 1.0, Pod Identity, Graviton-first node pools, and ECR enhanced scanning — cheaper, safer K8s.
GitHub Actions with AWS
GitHub Actions to AWS in 2026: OIDC keyless auth, Artifact Attestations, Immutable Actions, ARM runners, and reusable workflows to ECS, Lambda, EKS.
Frequently Asked Questions
How does Datadog integrate with AWS in 2026?
What AWS metrics and logs does Datadog collect?
Can Datadog replace CloudWatch entirely?
How do I correlate logs, metrics, and traces in Datadog?
What does Datadog cost on a typical AWS estate in 2026?
Datadog LLM Observability on Bedrock vs CloudWatch GenAI observability — which do I use?
How do we audit the Datadog-to-AWS trust relationship for security review?
Related Reading
- AWS CloudWatch Observability: Metrics, Logs, and Alarms Best Practices
CloudWatch is the most underused service on every AWS bill — and the most overspent on the ones that take it seriously. Logs, metrics, and alarm patterns that catch real outages without burying you in noise (or in the bill).
- Logging Yourself Into Bankruptcy
Observability is not free, and the industry has collectively underpriced it. CloudWatch log ingestion, metrics explosion, and X-Ray trace volume can together exceed your compute bill — especially once AI workloads introduce high-cardinality telemetry at scale.
- Amazon Bedrock AgentCore: Building Production-Ready AI Agents on AWS
Amazon Bedrock AgentCore solves the production gaps in Bedrock Agents API: persistent memory, tool reliability, and agent observability. Here is the architecture guide.
Need Help with This Integration?
Our AWS-certified engineers can design, implement, and operate this integration end-to-end — or review what you already have.