---
title: Designing a Customer-Facing SLA on AWS (2026): SLO Error Budgets and the Composite-Availability Math Most Teams Skip
description: A stack of ALB + EC2 + RDS Multi-AZ + S3 composes to ~99.83% availability—so promising customers 99.9% is a check you cannot cash. This guide does the composition math, converts it to an error budget (99.9% = 43.2 min/month), and shows why AWS service credits never fund your SLA penalties.
url: https://www.factualminds.com/blog/customer-facing-sla-slo-design-aws/
datePublished: 2026-05-31T00:00:00.000Z
dateModified: 2026-05-31T00:00:00.000Z
author: Palaniappan P
category: Cloud Architecture
tags: reliability, sla, slo, well-architected, observability, aws
---

# Designing a Customer-Facing SLA on AWS (2026): SLO Error Budgets and the Composite-Availability Math Most Teams Skip

> A stack of ALB + EC2 + RDS Multi-AZ + S3 composes to ~99.83% availability—so promising customers 99.9% is a check you cannot cash. This guide does the composition math, converts it to an error budget (99.9% = 43.2 min/month), and shows why AWS service credits never fund your SLA penalties.

**As of May 2026, the [Amazon Compute SLA](https://aws.amazon.com/compute/sla/) commits 99.99% monthly uptime at the Region level (multi-AZ) and 99.5% for a single instance**, with service credits from 10% to 100% of the affected bill. Teams read "99.99%," put it in a customer contract, and quietly sign up to lose money—because their *service* is not 99.99%, even though EC2 is.

This guide is for engineering and product leaders writing a **customer-facing SLA** on AWS. It does the **composite-availability math** that turns component SLAs into your real ceiling, converts that into an **SLO error budget**, and explains why **AWS service credits never fund your SLA penalties**. Use it with the [SLO error-budget worksheet](https://bitbucket.org/baymail/factualminds-astro/src/main/examples/architecture-blog-2026/sla-slo-design/slo-error-budget-worksheet.md).

> **Benchmark pattern (not a cited client)** — A composite B2B SaaS wanted to advertise **99.9%**. Critical path: ALB → EC2 (multi-AZ) → RDS Multi-AZ → S3. Composed availability worked out to **~99.83%** (~73 min/month of expected dependency downtime), *below* the 99.9% they planned to promise (43.2 min/month). They shipped a **99.5%** SLA, ran an internal **99.9%** SLO, and used the gap as their release error budget. Math is reproducible in the [worksheet](https://bitbucket.org/baymail/factualminds-astro/src/main/examples/architecture-blog-2026/sla-slo-design/slo-error-budget-worksheet.md).

## Step 1: Your SLA ceiling is the product of your dependencies

Availability composes multiplicatively along the **critical request path**. A request that must touch four services is only as available as all four together:

```
ALB        0.9999
EC2 (m-AZ) 0.9999
RDS (m-AZ) 0.9995
S3         0.9990
-----------------------------------
composite  0.99830  ≈ 99.83%
```

So the published 99.99% EC2 number is irrelevant to your promise once RDS (99.95%) and S3 (99.9%) sit on the same path. **Opinionated take:** compute your composite floor *before* sales writes a number on a slide. Promising 99.9% on a 99.83% stack is a guaranteed penalty, not an aspiration.

## Step 2: Convert availability to a downtime budget

Contracts are written in nines; on-call lives in minutes. Translate (30-day month):

| Availability | Downtime / month | Downtime / year |
| ------------ | ---------------- | --------------- |
| 99.5%   | 216 min (3.6 h) | 1.83 days |
| 99.9%   | 43.2 min        | 8.77 h    |
| 99.95%  | 21.6 min        | 4.38 h    |
| 99.99%  | 4.32 min        | 52.6 min  |

4.32 minutes per month is *one* bad deploy. If your release process can't reliably beat that, do not promise 99.99%.

## Step 3: Set the SLO tighter than the SLA

- **SLA** = external promise (with penalties). Set it **below** measured capability.
- **SLO** = internal target. Set it **above** the SLA.
- **Error budget** = the gap. Burn it on releases; freeze changes when it's gone.

Example: SLA 99.5% (216 min/mo), SLO 99.9% (43.2 min/mo). The ~173-minute gap is the monthly budget you spend on shipping risk before the contract is in danger. This is standard SRE practice—operationalize it with [CloudWatch metrics, logs, and alarms](/blog/aws-cloudwatch-observability-metrics-logs-alarms-best-practices/) and synthetic canaries for the SLI.

## Step 4: Raise the ceiling only when the math supports it

Three real levers, in order of cost:

1. **Remove the weak dependency from the path** — make S3 reads cached/async so an S3 blip degrades instead of fails. Cheapest.
2. **Add redundancy** — RDS read replicas, multi-Region active-passive, graceful degradation. See [DR strategies](/blog/aws-disaster-recovery-strategies-pilot-light-warm-standby-multi-site/) and [resilience patterns](/blog/aws-resilience-retries-circuits-graceful-shutdown/).
3. **Lower the promise** — promise 99.5%, deliver 99.8%, keep the goodwill and the budget.

> **What broke** — A team advertised 99.9% on a single-Region stack with one RDS instance (not Multi-AZ). A 25-minute RDS failover during a minor-version upgrade put them at 99.94% for the month—*inside* their SLA by minutes, but only by luck. The next month a scheduled patch (no maintenance exclusion in the contract) burned 38 minutes and a customer claimed a penalty for *announced* downtime. Two fixes: added the maintenance-window exclusion clause, and moved RDS to Multi-AZ before re-confirming the 99.9%. **The contract clause failure cost more than the architecture gap.**

## Step 5: Do not model AWS credits as your penalty fund

AWS SLA credits reimburse a slice of **your AWS bill**, not your customers' losses or your SLA payouts. The [Compute SLA](https://aws.amazon.com/compute/sla/) pays 10%–100% of the *affected service* charge, requested via Support, applied as future credits. Your customer SLA penalty is a business cost you fund yourself—keep the two budgets entirely separate. Conflating them is how a "covered" outage becomes an unbudgeted loss.

## What to do this week

1. Map your **critical request path** and list every AWS dependency on it.
2. Multiply their SLAs in the [worksheet](https://bitbucket.org/baymail/factualminds-astro/src/main/examples/architecture-blog-2026/sla-slo-design/slo-error-budget-worksheet.md) to get your **composite floor**.
3. Set the **customer SLA below** the floor and an **internal SLO above** it.
4. Add **measurement definition** and a **maintenance-window exclusion** to the contract template before the next deal.

> **Reproduce this** — Clone the [SLO error-budget worksheet](https://bitbucket.org/baymail/factualminds-astro/src/main/examples/architecture-blog-2026/sla-slo-design/slo-error-budget-worksheet.md). Fill in your critical-path dependencies and their current published SLAs (verify against the [Amazon Compute SLA](https://aws.amazon.com/compute/sla/) and each service's SLA page), multiply, and read your real ceiling and error budget.

## What this post doesn't cover

- AWS's own service SLAs in detail—always read the [per-service SLA pages](https://aws.amazon.com/legal/service-level-agreements/) for current terms.
- DR architecture depth ([pilot light / warm standby / multi-site](/blog/aws-disaster-recovery-strategies-pilot-light-warm-standby-multi-site/)).
- Full **Well-Architected Reliability** pillar ([six pillars explained](/blog/aws-well-architected-framework-6-pillars-explained/)).
- Incident response runbooks and post-incident review process.
- Regulated availability obligations (e.g. [DORA](/blog/dora-compliance-aws-financial-services/) operational-resilience testing).

---

**Related:** [AWS managed services](/services/aws-managed-services/) · [24/7 managed support & monitoring](/blog/aws-24-7-managed-support-monitoring/) · [Well-Architected six pillars](/blog/aws-well-architected-framework-6-pillars-explained/)

**If you only do one thing:** Multiply your critical-path dependency SLAs this week. If the product is below the number in your contracts, you have a problem to fix before your customers find it.

## FAQ

### Can I promise customers the same uptime as the AWS EC2 SLA?
No. The Amazon Compute SLA commits 99.99% at the Region level (multi-AZ) and 99.5% for a single instance—but your service is only as available as every dependency on its critical request path, composed multiplicatively. A path of ALB (99.99%) × EC2 multi-AZ (99.99%) × RDS Multi-AZ (99.95%) × S3 (99.9%) composes to roughly 99.83%, not 99.99%. Promise below your composed floor, not at any single component SLA.

### What is the difference between an SLA and an SLO?
An SLA is the external promise to customers, usually carrying financial penalties; an SLO is the internal target your team operates to. Set the SLO tighter than the SLA so the gap between them is your error budget and early-warning track. A common pattern: SLA 99.5% (216 min/month allowed), SLO 99.9% (43.2 min/month target). When you burn the error budget between them, freeze risky releases before you risk breaching the contract.

### Do AWS service credits cover what I owe customers when I breach my SLA?
No, and assuming they do is a dangerous modeling error. AWS SLA credits (the Compute SLA pays 10%-100% of the affected service bill) reimburse a fraction of your AWS spend—not your customers losses and not your SLA penalty payouts, which are typically orders of magnitude larger. You must request AWS credits via Support, and they apply as future bill credits. Budget your own SLA penalties separately from AWS credits.

### When should we NOT offer a high uptime SLA?
Avoid promising 99.9%+ before you can measure your real availability, before your critical path is redundant enough to support the number, and before you have an on-call rotation that can actually hit it. Offering a 99.99% SLA on a single-Region, single-RDS-instance architecture is writing a check the architecture cannot cash. Lower the promise (99.5%), deliver better (99.8%), and earn the right to raise it with evidence.

### How do I measure availability for SLA reporting?
Define the measurement before the contract: what counts as "down" (error rate threshold, synthetic probe failures, or successful-request ratio), the measurement window (30-day vs 30.44-day month materially changes the minute budget), and exclusions (scheduled maintenance, customer-caused errors, force majeure). Measure with synthetic canaries plus real success-ratio SLIs—do not rely on infrastructure up/down alone, because a healthy EC2 instance serving 500s is still down to the customer.

### What could go wrong if maintenance windows are not excluded?
If your SLA does not carve out scheduled maintenance, every planned deploy or patch eats your error budget and can trigger penalty clauses for downtime you controlled and announced. Define a maintenance window exclusion (with advance notice requirements) in the contract, or commit to zero-downtime deploys. Teams that skip this clause routinely breach their own SLA during routine patching.

---

*Source: https://www.factualminds.com/blog/customer-facing-sla-slo-design-aws/*
