Paxos, Raft, and Byzantine Fault Tolerance: What Cloud Architects Need
Quick summary: You rarely implement Raft on EC2—you buy it in Aurora, DynamoDB, and EKS etcd. This guide explains quorum math so you trust managed services and avoid rolling your own coordinator.
Key Takeaways
- You rarely implement Raft on EC2—you buy it in Aurora, DynamoDB, and EKS etcd
- June 2026: Raft (etcd backing EKS) elects a leader with majority quorum; Paxos family underpins many storage systems
- Byzantine fault tolerance (BFT) handles malicious nodes—overkill for AWS VPC trust boundaries unless blockchain or multi-party trust
- What to do this week 1
- Map each critical state store to its failure quorum (N/2+1)
Table of Contents
June 2026: Raft (etcd backing EKS) elects a leader with majority quorum; Paxos family underpins many storage systems. Byzantine fault tolerance (BFT) handles malicious nodes—overkill for AWS VPC trust boundaries unless blockchain or multi-party trust.
What AWS already consensus-manages
| Component | Consensus inside |
|---|---|
| Aurora storage | Quorum replicas |
| DynamoDB | Partition replication |
| EKS control plane | etcd (Raft) |
| MSK | Kafka controller election |
Do not run homegrown Raft on EC2 for app locks—use DynamoDB or Step Functions with idempotency.
Architect takeaway
When someone proposes “self-hosted ZooKeeper,” ask what managed equivalent buys: operational quorum, fencing, upgrades.
What to do this week
- Map each critical state store to its failure quorum (N/2+1).
- Run EKS etcd backup/restore drill documentation review.
- Skip BFT designs unless threat model includes malicious peers.
What this guide doesn’t cover
Exactly-once and CQRS—part 5 of track.
AWS Cloud Architect & AI Expert
AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.