Skip to main content

AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

A stack of ALB + EC2 + RDS Multi-AZ + S3 composes to ~99.83% availability—so promising customers 99.9% is a check you cannot cash. This guide does the composition math, converts it to an error budget (99.9% = 43.2 min/month), and shows why AWS service credits never fund your SLA penalties.

Key Facts

  • A stack of ALB + EC2 + RDS Multi-AZ + S3 composes to ~99
  • 83% availability—so promising customers 99
  • 9% is a check you cannot cash
  • This guide does the composition math, converts it to an error budget (99
  • 9% = 43

Entity Definitions

EC2
EC2 is an AWS service discussed in this article.
S3
S3 is an AWS service discussed in this article.
RDS
RDS is an AWS service discussed in this article.
CloudWatch
CloudWatch is an AWS service discussed in this article.

Designing a Customer-Facing SLA on AWS (2026): SLO Error Budgets and the Composite-Availability Math Most Teams Skip

Quick summary: A stack of ALB + EC2 + RDS Multi-AZ + S3 composes to ~99.83% availability—so promising customers 99.9% is a check you cannot cash. This guide does the composition math, converts it to an error budget (99.9% = 43.2 min/month), and shows why AWS service credits never fund your SLA penalties.

Key Takeaways

  • A stack of ALB + EC2 + RDS Multi-AZ + S3 composes to ~99
  • 83% availability—so promising customers 99
  • 9% is a check you cannot cash
  • This guide does the composition math, converts it to an error budget (99
  • 9% = 43
Designing a Customer-Facing SLA on AWS (2026): SLO Error Budgets and the Composite-Availability Math Most Teams Skip
Table of Contents

As of May 2026, the Amazon Compute SLA commits 99.99% monthly uptime at the Region level (multi-AZ) and 99.5% for a single instance, with service credits from 10% to 100% of the affected bill. Teams read “99.99%,” put it in a customer contract, and quietly sign up to lose money—because their service is not 99.99%, even though EC2 is.

This guide is for engineering and product leaders writing a customer-facing SLA on AWS. It does the composite-availability math that turns component SLAs into your real ceiling, converts that into an SLO error budget, and explains why AWS service credits never fund your SLA penalties. Use it with the SLO error-budget worksheet.

Benchmark pattern (not a cited client) — A composite B2B SaaS wanted to advertise 99.9%. Critical path: ALB → EC2 (multi-AZ) → RDS Multi-AZ → S3. Composed availability worked out to ~99.83% (~73 min/month of expected dependency downtime), below the 99.9% they planned to promise (43.2 min/month). They shipped a 99.5% SLA, ran an internal 99.9% SLO, and used the gap as their release error budget. Math is reproducible in the worksheet.

Step 1: Your SLA ceiling is the product of your dependencies

Availability composes multiplicatively along the critical request path. A request that must touch four services is only as available as all four together:

ALB        0.9999
EC2 (m-AZ) 0.9999
RDS (m-AZ) 0.9995
S3         0.9990
-----------------------------------
composite  0.99830  ≈ 99.83%

So the published 99.99% EC2 number is irrelevant to your promise once RDS (99.95%) and S3 (99.9%) sit on the same path. Opinionated take: compute your composite floor before sales writes a number on a slide. Promising 99.9% on a 99.83% stack is a guaranteed penalty, not an aspiration.

Step 2: Convert availability to a downtime budget

Contracts are written in nines; on-call lives in minutes. Translate (30-day month):

AvailabilityDowntime / monthDowntime / year
99.5%216 min (3.6 h)1.83 days
99.9%43.2 min8.77 h
99.95%21.6 min4.38 h
99.99%4.32 min52.6 min

4.32 minutes per month is one bad deploy. If your release process can’t reliably beat that, do not promise 99.99%.

Step 3: Set the SLO tighter than the SLA

  • SLA = external promise (with penalties). Set it below measured capability.
  • SLO = internal target. Set it above the SLA.
  • Error budget = the gap. Burn it on releases; freeze changes when it’s gone.

Example: SLA 99.5% (216 min/mo), SLO 99.9% (43.2 min/mo). The ~173-minute gap is the monthly budget you spend on shipping risk before the contract is in danger. This is standard SRE practice—operationalize it with CloudWatch metrics, logs, and alarms and synthetic canaries for the SLI.

Step 4: Raise the ceiling only when the math supports it

Three real levers, in order of cost:

  1. Remove the weak dependency from the path — make S3 reads cached/async so an S3 blip degrades instead of fails. Cheapest.
  2. Add redundancy — RDS read replicas, multi-Region active-passive, graceful degradation. See DR strategies and resilience patterns.
  3. Lower the promise — promise 99.5%, deliver 99.8%, keep the goodwill and the budget.

What broke — A team advertised 99.9% on a single-Region stack with one RDS instance (not Multi-AZ). A 25-minute RDS failover during a minor-version upgrade put them at 99.94% for the month—inside their SLA by minutes, but only by luck. The next month a scheduled patch (no maintenance exclusion in the contract) burned 38 minutes and a customer claimed a penalty for announced downtime. Two fixes: added the maintenance-window exclusion clause, and moved RDS to Multi-AZ before re-confirming the 99.9%. The contract clause failure cost more than the architecture gap.

Step 5: Do not model AWS credits as your penalty fund

AWS SLA credits reimburse a slice of your AWS bill, not your customers’ losses or your SLA payouts. The Compute SLA pays 10%–100% of the affected service charge, requested via Support, applied as future credits. Your customer SLA penalty is a business cost you fund yourself—keep the two budgets entirely separate. Conflating them is how a “covered” outage becomes an unbudgeted loss.

What to do this week

  1. Map your critical request path and list every AWS dependency on it.
  2. Multiply their SLAs in the worksheet to get your composite floor.
  3. Set the customer SLA below the floor and an internal SLO above it.
  4. Add measurement definition and a maintenance-window exclusion to the contract template before the next deal.

Reproduce this — Clone the SLO error-budget worksheet. Fill in your critical-path dependencies and their current published SLAs (verify against the Amazon Compute SLA and each service’s SLA page), multiply, and read your real ceiling and error budget.

What this post doesn’t cover


Related: AWS managed services · 24/7 managed support & monitoring · Well-Architected six pillars

If you only do one thing: Multiply your critical-path dependency SLAs this week. If the product is below the number in your contracts, you have a problem to fix before your customers find it.

PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Recommended Reading

Explore All Articles »