Can I promise customers the same uptime as the AWS EC2 SLA?

No. The Amazon Compute SLA commits 99.99% at the Region level (multi-AZ) and 99.5% for a single instance—but your service is only as available as every dependency on its critical request path, composed multiplicatively. A path of ALB (99.99%) × EC2 multi-AZ (99.99%) × RDS Multi-AZ (99.95%) × S3 (99.9%) composes to roughly 99.83%, not 99.99%. Promise below your composed floor, not at any single component SLA.

What is the difference between an SLA and an SLO?

An SLA is the external promise to customers, usually carrying financial penalties; an SLO is the internal target your team operates to. Set the SLO tighter than the SLA so the gap between them is your error budget and early-warning track. A common pattern: SLA 99.5% (216 min/month allowed), SLO 99.9% (43.2 min/month target). When you burn the error budget between them, freeze risky releases before you risk breaching the contract.

Do AWS service credits cover what I owe customers when I breach my SLA?

No, and assuming they do is a dangerous modeling error. AWS SLA credits (the Compute SLA pays 10%-100% of the affected service bill) reimburse a fraction of your AWS spend—not your customers losses and not your SLA penalty payouts, which are typically orders of magnitude larger. You must request AWS credits via Support, and they apply as future bill credits. Budget your own SLA penalties separately from AWS credits.

When should we NOT offer a high uptime SLA?

Avoid promising 99.9%+ before you can measure your real availability, before your critical path is redundant enough to support the number, and before you have an on-call rotation that can actually hit it. Offering a 99.99% SLA on a single-Region, single-RDS-instance architecture is writing a check the architecture cannot cash. Lower the promise (99.5%), deliver better (99.8%), and earn the right to raise it with evidence.

How do I measure availability for SLA reporting?

Define the measurement before the contract: what counts as "down" (error rate threshold, synthetic probe failures, or successful-request ratio), the measurement window (30-day vs 30.44-day month materially changes the minute budget), and exclusions (scheduled maintenance, customer-caused errors, force majeure). Measure with synthetic canaries plus real success-ratio SLIs—do not rely on infrastructure up/down alone, because a healthy EC2 instance serving 500s is still down to the customer.

What could go wrong if maintenance windows are not excluded?

If your SLA does not carve out scheduled maintenance, every planned deploy or patch eats your error budget and can trigger penalty clauses for downtime you controlled and announced. Define a maintenance window exclusion (with advance notice requirements) in the contract, or commit to zero-downtime deploys. Teams that skip this clause routinely breach their own SLA during routine patching.

Customer-Facing SLA on AWS 2026: SLO + Error Budgets

Designing a Customer-Facing SLA on AWS (2026): SLO Error Budgets and the Composite-Availability Math Most Teams Skip

Quick summary: A stack of ALB + EC2 + RDS Multi-AZ + S3 composes to ~99.83% availability—so promising customers 99.9% is a check you cannot cash. This guide does the composition math, converts it to an error budget (99.9% = 43.2 min/month), and shows why AWS service credits never fund your SLA penalties.

Key Takeaways

A stack of ALB + EC2 + RDS Multi-AZ + S3 composes to ~99
83% availability—so promising customers 99
9% is a check you cannot cash
This guide does the composition math, converts it to an error budget (99
9% = 43

As of May 2026, the Amazon Compute SLA commits 99.99% monthly uptime at the Region level (multi-AZ) and 99.5% for a single instance, with service credits from 10% to 100% of the affected bill. Teams read “99.99%,” put it in a customer contract, and quietly sign up to lose money—because their service is not 99.99%, even though EC2 is.

Symptom → mechanism → AWS control

Production symptom	Mechanism	AWS control
SLA missed despite green dashboards	Measuring infra not user journeys	CloudWatch Synthetics canaries, Application Signals SLOs
Error budget blown silently	No burn-rate alerting	AMP/Grafana burn-rate alerts, PagerDuty on 14.4x burn
Composite SLA math wrong	Multiplicative not additive	Document effective SLA = ∏ SLO components

Opinionated take: Publish composite SLA math in your customer contract and measure with Synthetics canaries—not ALB healthy host count.

This guide is for engineering and product leaders writing a customer-facing SLA on AWS. It does the composite-availability math that turns component SLAs into your real ceiling, converts that into an SLO error budget, and explains why AWS service credits never fund your SLA penalties. Use it with the SLO error-budget worksheet.

Benchmark pattern (not a cited client) — A composite B2B SaaS wanted to advertise 99.9%. Critical path: ALB → EC2 (multi-AZ) → RDS Multi-AZ → S3. Composed availability worked out to ~99.83% (~73 min/month of expected dependency downtime), below the 99.9% they planned to promise (43.2 min/month). They shipped a 99.5% SLA, ran an internal 99.9% SLO, and used the gap as their release error budget. Math is reproducible in the worksheet.

Step 1: Your SLA ceiling is the product of your dependencies

Availability composes multiplicatively along the critical request path. A request that must touch four services is only as available as all four together:

ALB        0.9999
EC2 (m-AZ) 0.9999
RDS (m-AZ) 0.9995
S3         0.9990
-----------------------------------
composite  0.99830  ≈ 99.83%

So the published 99.99% EC2 number is irrelevant to your promise once RDS (99.95%) and S3 (99.9%) sit on the same path. Opinionated take: compute your composite floor before sales writes a number on a slide. Promising 99.9% on a 99.83% stack is a guaranteed penalty, not an aspiration.

Step 2: Convert availability to a downtime budget

Contracts are written in nines; on-call lives in minutes. Translate (30-day month):

Availability	Downtime / month	Downtime / year
99.5%	216 min (3.6 h)	1.83 days
99.9%	43.2 min	8.77 h
99.95%	21.6 min	4.38 h
99.99%	4.32 min	52.6 min

4.32 minutes per month is one bad deploy. If your release process can’t reliably beat that, do not promise 99.99%.

Step 3: Set the SLO tighter than the SLA

SLA = external promise (with penalties). Set it below measured capability.
SLO = internal target. Set it above the SLA.
Error budget = the gap. Burn it on releases; freeze changes when it’s gone.

Example: SLA 99.5% (216 min/mo), SLO 99.9% (43.2 min/mo). The ~173-minute gap is the monthly budget you spend on shipping risk before the contract is in danger. This is standard SRE practice—operationalize it with CloudWatch metrics, logs, and alarms and synthetic canaries for the SLI.

Step 4: Raise the ceiling only when the math supports it

Three real levers, in order of cost:

Remove the weak dependency from the path — make S3 reads cached/async so an S3 blip degrades instead of fails. Cheapest.
Add redundancy — RDS read replicas, multi-Region active-passive, graceful degradation. See DR strategies and resilience patterns.
Lower the promise — promise 99.5%, deliver 99.8%, keep the goodwill and the budget.

What broke — A team advertised 99.9% on a single-Region stack with one RDS instance (not Multi-AZ). A 25-minute RDS failover during a minor-version upgrade put them at 99.94% for the month—inside their SLA by minutes, but only by luck. The next month a scheduled patch (no maintenance exclusion in the contract) burned 38 minutes and a customer claimed a penalty for announced downtime. Two fixes: added the maintenance-window exclusion clause, and moved RDS to Multi-AZ before re-confirming the 99.9%. The contract clause failure cost more than the architecture gap.

Step 5: Do not model AWS credits as your penalty fund

AWS SLA credits reimburse a slice of your AWS bill, not your customers’ losses or your SLA payouts. The Compute SLA pays 10%–100% of the affected service charge, requested via Support, applied as future credits. Your customer SLA penalty is a business cost you fund yourself—keep the two budgets entirely separate. Conflating them is how a “covered” outage becomes an unbudgeted loss.

AWS services map

Need	Service	Skip when
User-journey SLO measurement	CloudWatch Synthetics + Application Signals	Internal tooling with no customer SLA
Error budget tracking	AMP + Grafana SLO dashboards	Pre-PMF with no availability commitment
SLA reporting	AWS Health Dashboard + custom CUR metrics	Best-effort internal services

What to do this week

Map your critical request path and list every AWS dependency on it.
Multiply their SLAs in the worksheet to get your composite floor.
Set the customer SLA below the floor and an internal SLO above it.
Add measurement definition and a maintenance-window exclusion to the contract template before the next deal.

Reproduce this — Clone the SLO error-budget worksheet. Fill in your critical-path dependencies and their current published SLAs (verify against the Amazon Compute SLA and each service’s SLA page), multiply, and read your real ceiling and error budget.

What this post doesn’t cover

AWS’s own service SLAs in detail—always read the per-service SLA pages for current terms.
DR architecture depth (pilot light / warm standby / multi-site).
Full Well-Architected Reliability pillar (six pillars explained).
Incident response runbooks and post-incident review process — see security incident response runbooks (2026) and chaos engineering with FIS for validating SLOs before customers do.
Regulated availability obligations (e.g. DORA operational-resilience testing).

If you only do one thing: Multiply your critical-path dependency SLAs this week. If the product is below the number in your contracts, you have a problem to fix before your customers find it.

Designing a Customer-Facing SLA on AWS (2026): SLO Error Budgets and the Composite-Availability Math Most Teams Skip

Symptom → mechanism → AWS control

Step 1: Your SLA ceiling is the product of your dependencies

Step 2: Convert availability to a downtime budget

Step 3: Set the SLO tighter than the SLA

Step 4: Raise the ceiling only when the math supports it

Step 5: Do not model AWS credits as your penalty fund

AWS services map

What to do this week

What this post doesn’t cover

More in This Track

Related AWS Services

AWS Architecture Review

AWS Serverless

AWS Migration

Recommended Reading

The 10 AWS Announcements That Matter for Enterprise Teams (Q2 2026)

Production Resilience on AWS: Timeouts, Retries With Jitter, Circuit Limits, and Graceful Shutdown

From One FIS Experiment to a Resilience Program (2026): AWS Fault Injection Service, Stop Conditions, and GameDays That Actually Change Behavior

Observability Beyond CloudWatch (2026): When to Add Application Signals, ADOT, Managed Prometheus, and Grafana — and When Not To

AI & assistant-friendly summary

Summary

Key Facts

Entity Definitions

Related Content

Symptom → mechanism → AWS control

Step 1: Your SLA ceiling is the product of your dependencies

Step 2: Convert availability to a downtime budget

Step 3: Set the SLO tighter than the SLA

Step 4: Raise the ceiling only when the math supports it

Step 5: Do not model AWS credits as your penalty fund

AWS services map

What to do this week

What this post doesn’t cover

More in This Track

Related AWS Services

AWS Architecture Review

AWS Serverless

AWS Migration

Recommended Reading

The 10 AWS Announcements That Matter for Enterprise Teams (Q2 2026)

Production Resilience on AWS: Timeouts, Retries With Jitter, Circuit Limits, and Graceful Shutdown

From One FIS Experiment to a Resilience Program (2026): AWS Fault Injection Service, Stop Conditions, and GameDays That Actually Change Behavior

Observability Beyond CloudWatch (2026): When to Add Application Signals, ADOT, Managed Prometheus, and Grafana — and When Not To