AWS CloudWatch Observability Best Practices | FactualMinds

CloudWatch is the observability foundation for every AWS workload. It collects metrics from every AWS service, stores and queries logs, fires alarms, and renders dashboards — all without deploying any monitoring infrastructure. Yet most teams use only a fraction of CloudWatch’s capabilities, missing the monitoring practices that prevent outages and accelerate debugging.

This guide covers the production observability patterns we implement for clients through our managed services and DevOps engagements.

Metrics: What to Monitor

The Four Golden Signals

For every service, monitor these four signals (from Google’s SRE book, applicable to any production system):

Signal	What It Measures	CloudWatch Metric Example
Latency	How long requests take	ALB TargetResponseTime p50, p95, p99
Traffic	Request volume	ALB RequestCount, API Gateway Count
Errors	Failure rate	ALB HTTPCode_Target_5XX_Count, Lambda Errors
Saturation	How full your resources are	EC2 CPUUtilization, RDS FreeStorageSpace

If you monitor nothing else, monitor these four signals for every service that receives traffic. They tell you whether your service is working (errors), how fast (latency), how busy (traffic), and whether it is running out of capacity (saturation).

Custom Metrics

AWS built-in metrics cover infrastructure. Your application’s business logic generates custom metrics that are often more valuable:

Application metrics:

Orders processed per minute
Payment success/failure rate
User registration rate
API response times by endpoint
Queue depth and processing lag

Using CloudWatch Embedded Metric Format (EMF):

{
  "_aws": {
    "Timestamp": 1648657200000,
    "CloudWatchMetrics": [
      {
        "Namespace": "MyApp",
        "Dimensions": [["Service", "Environment"]],
        "Metrics": [{ "Name": "OrdersProcessed", "Unit": "Count" }]
      }
    ]
  },
  "Service": "OrderService",
  "Environment": "production",
  "OrdersProcessed": 42
}

EMF lets you emit custom metrics as structured JSON log lines. CloudWatch extracts metrics automatically — no API calls needed, no SDK dependency, and each metric costs the same as a standard CloudWatch custom metric.

Metric Math

Combine metrics for more meaningful signals:

Error rate: (Errors / Invocations) * 100 — More useful than raw error count because it normalizes for traffic volume
Availability: ((TotalRequests - 5xxErrors) / TotalRequests) * 100
Cache hit ratio: CacheHits / (CacheHits + CacheMisses) * 100

Metric Math computes derived metrics without additional cost beyond the source metrics.

Logs: Structured Logging

Why Structured Logging Matters

Unstructured logs (free-text strings) are human-readable but machine-hostile:

[2026-06-05 14:30:22] ERROR: Payment failed for user 12345, amount $99.99, reason: card_declined

Structured logs (JSON) are both human-readable and queryable:

{
  "timestamp": "2026-06-05T14:30:22Z",
  "level": "ERROR",
  "message": "Payment failed",
  "userId": "12345",
  "amount": 99.99,
  "currency": "USD",
  "reason": "card_declined",
  "requestId": "abc-123",
  "traceId": "1-abc-def"
}

With structured logs, you can query: “Show me all payment failures over $100 in the last hour” using CloudWatch Logs Insights:

fields @timestamp, userId, amount, reason
| filter level = "ERROR" and message = "Payment failed" and amount > 100
| sort @timestamp desc
| limit 50

Logs Insights Query Patterns

Error investigation:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as errorCount by bin(5m)
| sort errorCount desc

Latency analysis:

fields @timestamp, duration, endpoint
| filter ispresent(duration)
| stats avg(duration) as avgDuration, pct(duration, 95) as p95, pct(duration, 99) as p99 by endpoint
| sort p99 desc

Request tracing:

fields @timestamp, @message
| filter requestId = "abc-123"
| sort @timestamp asc

Log Cost Optimization

CloudWatch Logs can become expensive. At $0.50/GB ingested and $0.03/GB stored, a verbose application logging 100 GB/day costs $1,500/month in ingestion alone.

Cost reduction strategies:

Log levels — Use DEBUG only in development, INFO for normal operations, WARN/ERROR for problems. Never log full request/response bodies in production.
Sampling — Log 1 in 10 successful requests but log every error. This reduces volume by 90% while retaining all failure data.
Retention policies — Set log group retention to 30 days for application logs, 90 days for security logs, and 1 year for audit logs. Default retention is forever.
Log class — Use CloudWatch Logs Infrequent Access class for logs that are rarely queried but must be retained for compliance. 50% cheaper for ingestion.

Alarms: Alert on What Matters

Alarm Design Principles

Alert on symptoms, not causes. An alarm on “API error rate > 5%” is more useful than “EC2 CPU > 80%.” High CPU might not cause user impact; high error rate definitely does.

Alert on rates, not counts. “50 errors in 5 minutes” means different things at different traffic levels. “Error rate > 2%” is meaningful regardless of scale.

Set thresholds based on data, not guesses. Use CloudWatch Anomaly Detection for metrics with variable baselines (traffic volume, latency during peak hours). Anomaly Detection learns your metric patterns and alerts on deviations.

Alarm Tiers

Tier	Severity	Response	Example
P1 — Critical	Service down or data loss	Immediate (PagerDuty, phone call)	API error rate > 10%, database unreachable
P2 — High	Degraded performance	Within 1 hour (Slack, email)	p99 latency > 2s, disk > 85%
P3 — Warning	Potential issue	Next business day	Memory trending up, cost anomaly
P4 — Info	Informational	Review in weekly ops meeting	New deployment, scaling event

Composite Alarms

Reduce alert noise by combining multiple alarms:

Composite Alarm: "Service Degraded"
  = (ErrorRateAlarm IN ALARM) AND (LatencyAlarm IN ALARM)

A composite alarm fires only when both error rate AND latency are problematic — reducing false positives from temporary latency spikes that do not affect error rate.

Alarm Actions

Action	AWS Service	Use Case
SNS notification	SNS → Email/Slack/PagerDuty	Alert humans
Auto Scaling	Auto Scaling policy	Scale up/down based on metric
Lambda function	Lambda	Automated remediation
Systems Manager	SSM Automation	Run remediation runbook
EventBridge	EventBridge rule	Trigger complex workflows

Dashboards

Dashboard Design

One dashboard per service/team. A dashboard that shows everything shows nothing. Create focused dashboards:

Executive dashboard — Availability, error rates, costs (updated daily)
Service dashboard — Golden signals for each service (real-time)
Infrastructure dashboard — EC2, RDS, ElastiCache resource utilization
Cost dashboard — Daily spend by service, anomaly indicators

Dashboard Best Practices

Time-align all widgets — Use the dashboard time picker, not per-widget time ranges
Red/yellow/green indicators — Use CloudWatch alarm status widgets that show health at a glance
Include context — Add text widgets explaining what each metric means and what “normal” looks like
Auto-refresh — Set dashboards to refresh every 1-5 minutes for operational views

X-Ray: Distributed Tracing

For serverless applications and microservices, X-Ray traces requests across services:

Client → API Gateway → Lambda A → DynamoDB
                     → Lambda B → SQS → Lambda C → S3

X-Ray generates a service map showing:

Which services your request traversed
How long each service took
Where errors occurred
Which downstream call is the bottleneck

Enable X-Ray on: API Gateway, Lambda, ECS, EC2 (with X-Ray daemon), and supported AWS SDK calls. The trace propagation is automatic — no custom correlation ID management needed.

CloudWatch vs Third-Party Tools

Factor	CloudWatch	Datadog/New Relic/Grafana Cloud
Cost (small env)	$50-$200/month	$200-$500/month
Cost (large env)	$500-$2,000/month	$5,000-$20,000/month
AWS integration	Native (zero config)	Agent/integration required
Custom dashboards	Good	Excellent
APM depth	X-Ray (good)	Excellent
Log analytics	Logs Insights (good)	Excellent (more intuitive)
Multi-cloud	AWS only	Multi-cloud

Our recommendation: Start with CloudWatch. It provides 80% of the observability most teams need at a fraction of the cost. Add a third-party tool when you need deeper APM, more intuitive dashboards, or multi-cloud visibility.

For cost optimization, CloudWatch’s native integration and lower cost make it the default choice for AWS-only environments.

Getting Started

Observability is not something you add after launching — it is something you build alongside your application. Start with the four golden signals, add structured logging, and build dashboards before you need them. The time to set up monitoring is not during an outage.

For CloudWatch setup and monitoring strategy as part of our managed services, or for observability architecture design, talk to our team.

AWS CloudWatch Observability: Metrics, Logs, and Alarms Best Practices

Metrics: What to Monitor

The Four Golden Signals

Custom Metrics

Metric Math

Logs: Structured Logging

Why Structured Logging Matters

Logs Insights Query Patterns

Log Cost Optimization

Alarms: Alert on What Matters

Alarm Design Principles

Alarm Tiers

Composite Alarms

Alarm Actions

Dashboards

Dashboard Design

Dashboard Best Practices

X-Ray: Distributed Tracing

CloudWatch vs Third-Party Tools

Getting Started

Ready to discuss your AWS strategy?

Recommended Reading

AWS CodePipeline: CI/CD Pipeline Patterns for Production

AWS CloudFormation Best Practices for Production Infrastructure

Terraform vs AWS CDK: Infrastructure as Code Decision Guide

DevOps on AWS: CodePipeline vs GitHub Actions vs Jenkins

AI & assistant-friendly summary

Summary

Key Facts

Entity Definitions

Related Content

Metrics: What to Monitor

The Four Golden Signals

Custom Metrics

Metric Math

Logs: Structured Logging

Why Structured Logging Matters

Logs Insights Query Patterns

Log Cost Optimization

Alarms: Alert on What Matters

Alarm Design Principles

Alarm Tiers

Composite Alarms

Alarm Actions

Dashboards

Dashboard Design

Dashboard Best Practices

X-Ray: Distributed Tracing

CloudWatch vs Third-Party Tools

Getting Started

Ready to discuss your AWS strategy?

Recommended Reading

AWS CodePipeline: CI/CD Pipeline Patterns for Production

AWS CloudFormation Best Practices for Production Infrastructure

Terraform vs AWS CDK: Infrastructure as Code Decision Guide

DevOps on AWS: CodePipeline vs GitHub Actions vs Jenkins