Secret Management & Encryption
HashiCorp Vault on AWS
Centralised secret management, dynamic credentials, and envelope encryption with Vault — sitting alongside AWS KMS, Secrets Manager, and IAM.
Last updated:April 29, 2026Author:FactualMinds Cloud Integration TeamReviewed by:FactualMinds AWS-certified architects (Solutions Architect – Professional)
AI & assistant-friendly summary
This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.
Summary
HashiCorp Vault on AWS: dynamic DB credentials, transit-engine encryption, HCP Vault Secrets, and EKS Secrets Operator vs AWS Secrets Manager guidance.
Key Facts
- • HashiCorp Vault on AWS: dynamic DB credentials, transit-engine encryption, HCP Vault Secrets, and EKS Secrets Operator vs AWS Secrets Manager guidance
- • Centralised secret management, dynamic credentials, and envelope encryption with Vault — sitting alongside AWS KMS, Secrets Manager, and IAM
- • Should I use Vault or AWS Secrets Manager
- • For single-cloud AWS workloads with straightforward static or RDS-rotated secrets, AWS Secrets Manager wins on simplicity, cost, and audit (CloudTrail events out of the box)
- • 1) Does a leaked credential need to be useless in under 24 hours
Entity Definitions
- Lambda
- Lambda is relevant to hashicorp vault on aws.
- EC2
- EC2 is relevant to hashicorp vault on aws.
- S3
- S3 is relevant to hashicorp vault on aws.
- RDS
- RDS is relevant to hashicorp vault on aws.
- Aurora
- Aurora is relevant to hashicorp vault on aws.
- DynamoDB
- DynamoDB is relevant to hashicorp vault on aws.
- CloudWatch
- CloudWatch is relevant to hashicorp vault on aws.
- IAM
- IAM is relevant to hashicorp vault on aws.
- VPC
- VPC is relevant to hashicorp vault on aws.
- EKS
- EKS is relevant to hashicorp vault on aws.
- ECS
- ECS is relevant to hashicorp vault on aws.
- EventBridge
- EventBridge is relevant to hashicorp vault on aws.
- Secrets Manager
- Secrets Manager is relevant to hashicorp vault on aws.
- AWS Secrets Manager
- AWS Secrets Manager is relevant to hashicorp vault on aws.
- Parameter Store
- Parameter Store is relevant to hashicorp vault on aws.
## HashiCorp Vault on AWS
Vault is an enterprise secret management and encryption-as-a-service platform. On AWS it sits alongside IAM, KMS, and Secrets Manager — owning the domains those services do not cover as well: dynamic database credentials with sub-hour TTLs, transit-engine encryption for many small items, multi-cloud policy consistency, and centralised PKI for both AWS and on-prem certificates.
> **Licensing note (2026)**: IBM closed its acquisition of HashiCorp in early 2025. Vault remains under the Business Source License 1.1 adopted in August 2023 — free for all non-competing production use, with enterprise features behind a commercial licence. OpenBao (community fork) exists as a Linux Foundation project but lacks DR replication, namespaces, and FIPS transit for regulated workloads. Always verify current terms at hashicorp.com.
## Why Vault on AWS
**Centralised secret storage**
- Passwords, API keys, TLS certificates, and tokens in one audited store.
- Encryption at rest (AES-256-GCM) with automatic key rotation; audit device captures every read, write, and token operation.
- KMS auto-unseal removes the operator burden of handling unseal keys in person.
**Dynamic credentials**
- Generate temporary RDS/Aurora/Redshift/MongoDB/Snowflake passwords on demand, valid for 1 hour by default, auto-revoked at expiry.
- Shrinks blast radius from "replay leaked creds forever" to "replay for under an hour".
- Per-app credentials and per-session audit means you can answer "which microservice or pipeline run caused this DB lock?" without guesswork.
**Transit engine (encryption-as-a-service)**
- Send plaintext, receive ciphertext without Vault ever storing the plaintext.
- Convergent encryption, derived keys, and key rotation without re-encrypting every record.
- Throughput and cost profile that outperforms raw KMS calls for many small items.
**Multi-cloud & on-prem**
- Same policy model across AWS, Azure, GCP, and on-prem workloads — important for M&A, hybrid, and regulated environments that cannot put all secrets in one cloud.
## Vault vs AWS Secrets Manager — decision matrix
| Question | Secrets Manager | Vault |
| ---------------------------------------------------- | ------------------------ | ------------------------------------ |
| Single-cloud AWS workload? | ✅ Preferred | Overkill for most |
| Need dynamic DB creds under 60 min TTL? | ❌ | ✅ |
| Need transit engine / envelope encryption at volume? | ❌ (use KMS directly) | ✅ |
| Multi-cloud or hybrid consistency required? | ❌ | ✅ |
| Need centralised PKI for AWS + on-prem? | Partial (ACM Private CA) | ✅ |
| Need SSH CA for ephemeral server access? | ❌ | ✅ |
| AWS-native rotation + Lambda rotators is enough? | ✅ | Overkill |
| Existing Vault footprint across org? | — | ✅ |
| Simplest audit via CloudTrail? | ✅ | Vault audit device (fine, but extra) |
| Cost for small AWS-only team? | Lower | Higher (infra or HCP) |
**Default recommendation**: start with Secrets Manager for AWS-only workloads; add Vault when a specific driver above applies. Many regulated customers run both — Secrets Manager for AWS-service consumers, Vault for dynamic DB creds and transit, with Vault Secrets Sync keeping a one-way mirror to Secrets Manager for ECS/Lambda ergonomics.
## Vault architecture on AWS
**Self-hosted (control-plane-sensitive workloads)**
- 3–5 node cluster on EC2 in an Auto Scaling Group across AZs.
- **Integrated Storage (Raft)** is now the HashiCorp-recommended backend — DynamoDB/S3 backends are still supported but Raft is simpler, faster, and enables performance replication to DR regions.
- Network Load Balancer in front for TLS termination via ACM.
- **KMS auto-unseal** — Vault uses an AWS KMS key to unseal itself after restart; rotate the KMS key annually.
- VPC endpoints for KMS, STS, and CloudWatch to keep traffic off the internet.
**HCP Vault Dedicated** (managed cluster)
- HashiCorp runs the cluster; you consume via AWS PrivateLink.
- Dev tier starts ~$200/month; production tiers scale by node count and replication.
- Best when you want a full Vault feature set without running the cluster yourself.
**HCP Vault Secrets** (lightweight SaaS) — GA 2024
- REST API for static secrets; free tier up to 25 secrets; paid from ~$0.03/secret/month.
- Best starting point for teams that need a managed key-value store with better audit than Parameter Store but do not yet need dynamic or transit.
## Authentication methods we deploy
- **AWS auth method** — EC2 instances and Lambda functions authenticate to Vault using their instance identity document or IAM role; Vault verifies via AWS STS.
- **Kubernetes auth** — pods authenticate with their projected ServiceAccount token; Vault verifies against the cluster's TokenReview API. On EKS, pair with **Pod Identity** for outbound calls.
- **OIDC / JWT** — authenticate GitHub Actions, GitLab CI, and human SSO via an OIDC trust relationship; pairs with OIDC subject-claim filtering similar to the pattern we use for AWS IAM + GitHub Actions.
- **AppRole** — service-to-service authentication for on-prem or legacy workloads that cannot use IAM/OIDC.
## Secret engines we deploy on AWS
- **Database** — dynamic RDS/Aurora (Postgres/MySQL), Redshift, MongoDB Atlas, and Snowflake credentials with configurable TTL and max-TTL.
- **AWS** — generate temporary IAM access keys or assume-role credentials; useful for short-lived CLI sessions or third-party tools that cannot use IAM directly.
- **Transit** — encryption-as-a-service with convergent encryption, key rotation, and datakey generation.
- **PKI** — issue X.509 TLS certs for services running on AWS and on-prem; ACME server (Vault 1.14+) means cert-manager and traditional ACME clients can pull from Vault directly.
- **SSH CA** — sign short-lived SSH certs for engineer access to EC2 bastion hosts or on-prem Linux fleets.
## Vault Secrets Operator (VSO) for EKS
The 2026 default pattern for Kubernetes workloads on EKS:
1. Install VSO via Helm; configure a `VaultConnection` and `VaultAuth` pointing at your Vault cluster with Kubernetes or JWT auth.
2. App teams declare `VaultStaticSecret` or `VaultDynamicSecret` CRDs in their namespace; VSO reconciles them into native Kubernetes Secrets that the app consumes as normal.
3. VSO handles renewal and rotation automatically; dynamic secrets flow into a rolling Deployment restart when TTL approaches expiry.
4. Pair with EKS Pod Identity for the outbound AWS calls VSO makes during auth verification.
This replaces the legacy Vault Agent sidecar + init-container pattern for most workloads. Fall back to Agent-sidecar + tmpfs when secrets must never sit in etcd.
## Vault 1.17 / 1.18 features worth enabling
- **Adaptive overload protection** — targeted request-type throttling so a misbehaving client cannot take the whole cluster down.
- **Multi-issuer PKI with ACME** — Vault can be the ACME server for cert-manager on EKS and for external workloads.
- **KV v2 transformations** — patch subkeys atomically; useful for complex configuration documents.
- **Secrets Sync GA** — one-way sync Vault → AWS Secrets Manager / GitHub / GCP / Vercel / HCP Terraform; keeps AWS-native consumers working while Vault stays the source of truth.
- **Workload identity federation** — authenticate Vault to other systems without stored IAM credentials.
## Implementation: VaultDynamicSecret with VSO
```yaml
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultDynamicSecret
metadata:
name: orders-db-creds
namespace: orders
spec:
vaultAuthRef: orders-eks-auth
mount: database
path: creds/orders-app
destination:
create: true
name: orders-db
refreshAfter: 30m # rotate before TTL expires
rolloutRestartTargets:
- kind: Deployment
name: orders-api
```
The `rolloutRestartTargets` restarts the Deployment on rotation so applications pick up new credentials cleanly. For applications that hot-reload secrets without restart, omit and have the app re-read the projected secret on `403 PERMISSION_DENIED`.
## Failure modes & resilience
**1. Leader election timeout / quorum loss.** Raft requires a majority — losing 2 of 3 nodes makes the cluster unavailable. Recovery: identify the surviving node, run `vault operator raft remove-peer` for the dead nodes, then `vault operator raft join` for replacements. Critical: never `force-leave` a healthy node by mistake. Practice this in a non-prod cluster every quarter.
**2. KMS auto-unseal key revocation.** If the auto-unseal KMS key is deleted, scheduled for deletion, or has its key policy revoked, restarted Vault nodes cannot unseal. Mitigation: KMS key with `DeletionWindowInDays = 30`; alert on key policy changes via CloudTrail; replicate the key to a DR region with `aws_kms_replica_key`. Test annual key rotation in staging — the auto-unseal seal handles rotation transparently if both old and new versions are accessible.
**3. Transit-engine throughput ceiling.** A single Vault node maxes around 2,000–5,000 transit ops/s depending on instance type and key type (AES-GCM faster than RSA). Symptom: `429 Too Many Requests` on transit endpoints. Mitigation: scale horizontally via performance replication standby clusters serving read-heavy transit traffic; use `derived = true` keys to amortize key derivation; consider client-side envelope encryption with KMS-wrapped DEKs for very high throughput.
**4. Dynamic-DB credential renewal collisions.** A burst of new pods all requesting credentials simultaneously can hit per-database connection limits while Vault creates short-lived users. Mitigation: set DB role `max_connections_per_role`; use `default_ttl ≥ 1h` so steady-state pods don't churn; alarm on `database_connections_used` for the upstream RDS/Aurora.
**5. Audit device backpressure.** Vault blocks all requests if audit logging fails. Symptom: cluster appears healthy but every request 500s. Mitigation: configure two audit devices (file + syslog or file + socket); the request succeeds if at least one writes. Monitor audit-device disk usage and rotate logs aggressively.
**6. Token-policy drift.** Long-lived service tokens accumulate over years; orphaned tokens from departed engineers persist. Mitigation: enforce token max-TTL via `auth/token/tune`; run `vault list auth/token/accessors` quarterly and revoke orphans; prefer auth methods (AWS, K8s, OIDC) over raw tokens.
**7. Performance replication lag.** Read-heavy secondaries can lag the primary during heavy write traffic. Stale reads on a secondary may serve outdated dynamic credentials. Mitigation: route writes and reads-after-writes to the primary; alert on `vault.replication.wal_lag` exceeding the SLA.
## Observability runbook
**Metrics to scrape (Vault Prometheus endpoint or Datadog Vault integration):**
| Metric | Alarm threshold | First action |
| ----------------------------------------- | --------------------- | --------------------------------------------------------------------- |
| `vault.core.unsealed` | `= 0` | Page on-call; check KMS auto-unseal key health |
| `vault.core.leader_election_count` | `> 1` per hour | Investigate node health, network partitions; review Raft quorum |
| `vault.audit.log_request_failure` | any | Rotate to backup audit device; investigate disk / sink health |
| `vault.runtime.alloc_bytes` | rising trend | Memory leak; check for client misuse (many ephemeral tokens) |
| `vault.token.creation` | spike `> 5×` baseline | Likely runaway client or auth-method abuse |
| `vault.replication.wal_lag` (HCP/PR) | `> 1 min` | Network or primary write storm; investigate |
| `vault.adaptive_overload.throttled_count` | sustained `> 0` | Client overload protection kicking in; identify and rate-limit caller |
**Raft snapshot + restore runbook:**
```bash
# Daily snapshot — schedule via systemd timer or EventBridge-triggered runner
vault operator raft snapshot save /tmp/vault-$(date +%F).snap
aws s3 cp /tmp/vault-$(date +%F).snap \
s3://acme-vault-snapshots/$(date +%F)/ \
--sse aws:kms --sse-kms-key-id alias/vault-snapshot
# Restore (DR — only in a fresh cluster)
vault operator raft snapshot restore -force /tmp/vault-2026-04-29.snap
# Re-init unseal: each operator unseals with their key share or KMS auto-unseal
```
Test restore quarterly to a staging cluster. Tabletop drill: "primary AZ is gone — recover in under 30 minutes". Document the actual time and improve.
**Unseal-key rotation cadence:** Rotate the KMS auto-unseal key annually by adding a new alias and updating the seal stanza:
```hcl
seal "awskms" {
region = "eu-west-1"
kms_key_id = "alias/vault-unseal-2026"
}
```
The seal supports key migration via `vault operator seal-migration`. Old material remains accessible for decrypt during the transition; do not delete the previous key until the migration is verified across all nodes.
**Debug path: "client getting 403":**
1. `vault token lookup` (with the client token) — check expiry, policies, accessor.
2. `vault read sys/policies/acl/<policy>` — confirm path and capability match the failed request.
3. Audit log on the Vault node: search for the request ID; look at `error` field.
4. If using AWS or K8s auth: confirm the bound entity (instance role, ServiceAccount) still matches the auth role; trust source rotated?
5. Adaptive overload: check `vault.adaptive_overload.throttled_count` — the client may be throttled, not denied.
## When Vault is NOT the right call
- Single-cloud AWS workload with simple static/rotated secrets — use **AWS Secrets Manager** + KMS CMKs; audit via CloudTrail; done.
- No need for dynamic credentials and no multi-cloud ambition — Vault's operational cost outweighs the benefit.
- Tiny team with no dedicated platform engineer — HCP Vault Secrets is the lightweight option, but even that is extra surface vs Secrets Manager for most use cases.
- Hard requirement for OSI-approved open source — OpenBao exists, but at the cost of enterprise features you probably need if you were considering Vault in the first place.
## Best practices
**Security**
- Enable audit device to a forward-only CloudWatch Logs stream and S3 Object Lock bucket; treat audit log integrity as part of the SOC 2 / PCI evidence pipeline.
- Human access to Vault admin functions requires MFA and a separate `admin` namespace — never the root token.
- Rotate KMS auto-unseal keys annually; tie the rotation into the control-plane pipeline.
- IP allow-list the Vault API via VPC endpoint policies or AWS Network Firewall rules for public-facing ALBs.
**Operations**
- Back up Raft snapshots to an S3 bucket with Object Lock; test restore quarterly.
- Performance replication or DR replication to a secondary region for RPO-sensitive workloads.
- Monitor with Datadog, CloudWatch, or the native Prometheus endpoint; alert on leader changes, sealed state, and adaptive-overload throttling.
**Application integration**
- Prefer Vault Secrets Operator over sidecars for EKS workloads.
- Cache secrets in-process with a TTL slightly shorter than Vault's TTL; handle 403 gracefully by re-fetching.
- Never log secret values — enable Vault's secret-ID filtering on audit devices so secret material never hits logs.
## Related reading
- [`AWS Secrets Manager vs Parameter Store: when to use which`](/blog/aws-secrets-manager-vs-parameter-store-when-to-use-which/)
- [`PCI DSS compliance on AWS: architecture guide for fintech`](/blog/pci-dss-compliance-aws-architecture-guide-fintech/)
- [`How to achieve SOC 2 compliance on AWS in 2026`](/blog/how-to-achieve-soc2-compliance-aws-2026/)
## Related services
- [AWS Cloud Security](/services/aws-cloud-security/)
- [Cloud Compliance Services](/services/cloud-compliance-services/)
- [DevOps Pipeline Setup](/services/devops-pipeline-setup/) HashiCorp Vault on AWS
Vault is an enterprise secret management and encryption-as-a-service platform. On AWS it sits alongside IAM, KMS, and Secrets Manager — owning the domains those services do not cover as well: dynamic database credentials with sub-hour TTLs, transit-engine encryption for many small items, multi-cloud policy consistency, and centralised PKI for both AWS and on-prem certificates.
Licensing note (2026): IBM closed its acquisition of HashiCorp in early 2025. Vault remains under the Business Source License 1.1 adopted in August 2023 — free for all non-competing production use, with enterprise features behind a commercial licence. OpenBao (community fork) exists as a Linux Foundation project but lacks DR replication, namespaces, and FIPS transit for regulated workloads. Always verify current terms at hashicorp.com.
Why Vault on AWS
Centralised secret storage
- Passwords, API keys, TLS certificates, and tokens in one audited store.
- Encryption at rest (AES-256-GCM) with automatic key rotation; audit device captures every read, write, and token operation.
- KMS auto-unseal removes the operator burden of handling unseal keys in person.
Dynamic credentials
- Generate temporary RDS/Aurora/Redshift/MongoDB/Snowflake passwords on demand, valid for 1 hour by default, auto-revoked at expiry.
- Shrinks blast radius from “replay leaked creds forever” to “replay for under an hour”.
- Per-app credentials and per-session audit means you can answer “which microservice or pipeline run caused this DB lock?” without guesswork.
Transit engine (encryption-as-a-service)
- Send plaintext, receive ciphertext without Vault ever storing the plaintext.
- Convergent encryption, derived keys, and key rotation without re-encrypting every record.
- Throughput and cost profile that outperforms raw KMS calls for many small items.
Multi-cloud & on-prem
- Same policy model across AWS, Azure, GCP, and on-prem workloads — important for M&A, hybrid, and regulated environments that cannot put all secrets in one cloud.
Vault vs AWS Secrets Manager — decision matrix
| Question | Secrets Manager | Vault |
|---|---|---|
| Single-cloud AWS workload? | ✅ Preferred | Overkill for most |
| Need dynamic DB creds under 60 min TTL? | ❌ | ✅ |
| Need transit engine / envelope encryption at volume? | ❌ (use KMS directly) | ✅ |
| Multi-cloud or hybrid consistency required? | ❌ | ✅ |
| Need centralised PKI for AWS + on-prem? | Partial (ACM Private CA) | ✅ |
| Need SSH CA for ephemeral server access? | ❌ | ✅ |
| AWS-native rotation + Lambda rotators is enough? | ✅ | Overkill |
| Existing Vault footprint across org? | — | ✅ |
| Simplest audit via CloudTrail? | ✅ | Vault audit device (fine, but extra) |
| Cost for small AWS-only team? | Lower | Higher (infra or HCP) |
Default recommendation: start with Secrets Manager for AWS-only workloads; add Vault when a specific driver above applies. Many regulated customers run both — Secrets Manager for AWS-service consumers, Vault for dynamic DB creds and transit, with Vault Secrets Sync keeping a one-way mirror to Secrets Manager for ECS/Lambda ergonomics.
Vault architecture on AWS
Self-hosted (control-plane-sensitive workloads)
- 3–5 node cluster on EC2 in an Auto Scaling Group across AZs.
- Integrated Storage (Raft) is now the HashiCorp-recommended backend — DynamoDB/S3 backends are still supported but Raft is simpler, faster, and enables performance replication to DR regions.
- Network Load Balancer in front for TLS termination via ACM.
- KMS auto-unseal — Vault uses an AWS KMS key to unseal itself after restart; rotate the KMS key annually.
- VPC endpoints for KMS, STS, and CloudWatch to keep traffic off the internet.
HCP Vault Dedicated (managed cluster)
- HashiCorp runs the cluster; you consume via AWS PrivateLink.
- Dev tier starts ~$200/month; production tiers scale by node count and replication.
- Best when you want a full Vault feature set without running the cluster yourself.
HCP Vault Secrets (lightweight SaaS) — GA 2024
- REST API for static secrets; free tier up to 25 secrets; paid from ~$0.03/secret/month.
- Best starting point for teams that need a managed key-value store with better audit than Parameter Store but do not yet need dynamic or transit.
Authentication methods we deploy
- AWS auth method — EC2 instances and Lambda functions authenticate to Vault using their instance identity document or IAM role; Vault verifies via AWS STS.
- Kubernetes auth — pods authenticate with their projected ServiceAccount token; Vault verifies against the cluster’s TokenReview API. On EKS, pair with Pod Identity for outbound calls.
- OIDC / JWT — authenticate GitHub Actions, GitLab CI, and human SSO via an OIDC trust relationship; pairs with OIDC subject-claim filtering similar to the pattern we use for AWS IAM + GitHub Actions.
- AppRole — service-to-service authentication for on-prem or legacy workloads that cannot use IAM/OIDC.
Secret engines we deploy on AWS
- Database — dynamic RDS/Aurora (Postgres/MySQL), Redshift, MongoDB Atlas, and Snowflake credentials with configurable TTL and max-TTL.
- AWS — generate temporary IAM access keys or assume-role credentials; useful for short-lived CLI sessions or third-party tools that cannot use IAM directly.
- Transit — encryption-as-a-service with convergent encryption, key rotation, and datakey generation.
- PKI — issue X.509 TLS certs for services running on AWS and on-prem; ACME server (Vault 1.14+) means cert-manager and traditional ACME clients can pull from Vault directly.
- SSH CA — sign short-lived SSH certs for engineer access to EC2 bastion hosts or on-prem Linux fleets.
Vault Secrets Operator (VSO) for EKS
The 2026 default pattern for Kubernetes workloads on EKS:
- Install VSO via Helm; configure a
VaultConnectionandVaultAuthpointing at your Vault cluster with Kubernetes or JWT auth. - App teams declare
VaultStaticSecretorVaultDynamicSecretCRDs in their namespace; VSO reconciles them into native Kubernetes Secrets that the app consumes as normal. - VSO handles renewal and rotation automatically; dynamic secrets flow into a rolling Deployment restart when TTL approaches expiry.
- Pair with EKS Pod Identity for the outbound AWS calls VSO makes during auth verification.
This replaces the legacy Vault Agent sidecar + init-container pattern for most workloads. Fall back to Agent-sidecar + tmpfs when secrets must never sit in etcd.
Vault 1.17 / 1.18 features worth enabling
- Adaptive overload protection — targeted request-type throttling so a misbehaving client cannot take the whole cluster down.
- Multi-issuer PKI with ACME — Vault can be the ACME server for cert-manager on EKS and for external workloads.
- KV v2 transformations — patch subkeys atomically; useful for complex configuration documents.
- Secrets Sync GA — one-way sync Vault → AWS Secrets Manager / GitHub / GCP / Vercel / HCP Terraform; keeps AWS-native consumers working while Vault stays the source of truth.
- Workload identity federation — authenticate Vault to other systems without stored IAM credentials.
Implementation: VaultDynamicSecret with VSO
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultDynamicSecret
metadata:
name: orders-db-creds
namespace: orders
spec:
vaultAuthRef: orders-eks-auth
mount: database
path: creds/orders-app
destination:
create: true
name: orders-db
refreshAfter: 30m # rotate before TTL expires
rolloutRestartTargets:
- kind: Deployment
name: orders-api
The rolloutRestartTargets restarts the Deployment on rotation so applications pick up new credentials cleanly. For applications that hot-reload secrets without restart, omit and have the app re-read the projected secret on 403 PERMISSION_DENIED.
Failure modes & resilience
1. Leader election timeout / quorum loss. Raft requires a majority — losing 2 of 3 nodes makes the cluster unavailable. Recovery: identify the surviving node, run vault operator raft remove-peer for the dead nodes, then vault operator raft join for replacements. Critical: never force-leave a healthy node by mistake. Practice this in a non-prod cluster every quarter.
2. KMS auto-unseal key revocation. If the auto-unseal KMS key is deleted, scheduled for deletion, or has its key policy revoked, restarted Vault nodes cannot unseal. Mitigation: KMS key with DeletionWindowInDays = 30; alert on key policy changes via CloudTrail; replicate the key to a DR region with aws_kms_replica_key. Test annual key rotation in staging — the auto-unseal seal handles rotation transparently if both old and new versions are accessible.
3. Transit-engine throughput ceiling. A single Vault node maxes around 2,000–5,000 transit ops/s depending on instance type and key type (AES-GCM faster than RSA). Symptom: 429 Too Many Requests on transit endpoints. Mitigation: scale horizontally via performance replication standby clusters serving read-heavy transit traffic; use derived = true keys to amortize key derivation; consider client-side envelope encryption with KMS-wrapped DEKs for very high throughput.
4. Dynamic-DB credential renewal collisions. A burst of new pods all requesting credentials simultaneously can hit per-database connection limits while Vault creates short-lived users. Mitigation: set DB role max_connections_per_role; use default_ttl ≥ 1h so steady-state pods don’t churn; alarm on database_connections_used for the upstream RDS/Aurora.
5. Audit device backpressure. Vault blocks all requests if audit logging fails. Symptom: cluster appears healthy but every request 500s. Mitigation: configure two audit devices (file + syslog or file + socket); the request succeeds if at least one writes. Monitor audit-device disk usage and rotate logs aggressively.
6. Token-policy drift. Long-lived service tokens accumulate over years; orphaned tokens from departed engineers persist. Mitigation: enforce token max-TTL via auth/token/tune; run vault list auth/token/accessors quarterly and revoke orphans; prefer auth methods (AWS, K8s, OIDC) over raw tokens.
7. Performance replication lag. Read-heavy secondaries can lag the primary during heavy write traffic. Stale reads on a secondary may serve outdated dynamic credentials. Mitigation: route writes and reads-after-writes to the primary; alert on vault.replication.wal_lag exceeding the SLA.
Observability runbook
Metrics to scrape (Vault Prometheus endpoint or Datadog Vault integration):
| Metric | Alarm threshold | First action |
|---|---|---|
vault.core.unsealed | = 0 | Page on-call; check KMS auto-unseal key health |
vault.core.leader_election_count | > 1 per hour | Investigate node health, network partitions; review Raft quorum |
vault.audit.log_request_failure | any | Rotate to backup audit device; investigate disk / sink health |
vault.runtime.alloc_bytes | rising trend | Memory leak; check for client misuse (many ephemeral tokens) |
vault.token.creation | spike > 5× baseline | Likely runaway client or auth-method abuse |
vault.replication.wal_lag (HCP/PR) | > 1 min | Network or primary write storm; investigate |
vault.adaptive_overload.throttled_count | sustained > 0 | Client overload protection kicking in; identify and rate-limit caller |
Raft snapshot + restore runbook:
# Daily snapshot — schedule via systemd timer or EventBridge-triggered runner
vault operator raft snapshot save /tmp/vault-$(date +%F).snap
aws s3 cp /tmp/vault-$(date +%F).snap \
s3://acme-vault-snapshots/$(date +%F)/ \
--sse aws:kms --sse-kms-key-id alias/vault-snapshot
# Restore (DR — only in a fresh cluster)
vault operator raft snapshot restore -force /tmp/vault-2026-04-29.snap
# Re-init unseal: each operator unseals with their key share or KMS auto-unseal
Test restore quarterly to a staging cluster. Tabletop drill: “primary AZ is gone — recover in under 30 minutes”. Document the actual time and improve.
Unseal-key rotation cadence: Rotate the KMS auto-unseal key annually by adding a new alias and updating the seal stanza:
seal "awskms" {
region = "eu-west-1"
kms_key_id = "alias/vault-unseal-2026"
}
The seal supports key migration via vault operator seal-migration. Old material remains accessible for decrypt during the transition; do not delete the previous key until the migration is verified across all nodes.
Debug path: “client getting 403”:
vault token lookup(with the client token) — check expiry, policies, accessor.vault read sys/policies/acl/<policy>— confirm path and capability match the failed request.- Audit log on the Vault node: search for the request ID; look at
errorfield. - If using AWS or K8s auth: confirm the bound entity (instance role, ServiceAccount) still matches the auth role; trust source rotated?
- Adaptive overload: check
vault.adaptive_overload.throttled_count— the client may be throttled, not denied.
When Vault is NOT the right call
- Single-cloud AWS workload with simple static/rotated secrets — use AWS Secrets Manager + KMS CMKs; audit via CloudTrail; done.
- No need for dynamic credentials and no multi-cloud ambition — Vault’s operational cost outweighs the benefit.
- Tiny team with no dedicated platform engineer — HCP Vault Secrets is the lightweight option, but even that is extra surface vs Secrets Manager for most use cases.
- Hard requirement for OSI-approved open source — OpenBao exists, but at the cost of enterprise features you probably need if you were considering Vault in the first place.
Best practices
Security
- Enable audit device to a forward-only CloudWatch Logs stream and S3 Object Lock bucket; treat audit log integrity as part of the SOC 2 / PCI evidence pipeline.
- Human access to Vault admin functions requires MFA and a separate
adminnamespace — never the root token. - Rotate KMS auto-unseal keys annually; tie the rotation into the control-plane pipeline.
- IP allow-list the Vault API via VPC endpoint policies or AWS Network Firewall rules for public-facing ALBs.
Operations
- Back up Raft snapshots to an S3 bucket with Object Lock; test restore quarterly.
- Performance replication or DR replication to a secondary region for RPO-sensitive workloads.
- Monitor with Datadog, CloudWatch, or the native Prometheus endpoint; alert on leader changes, sealed state, and adaptive-overload throttling.
Application integration
- Prefer Vault Secrets Operator over sidecars for EKS workloads.
- Cache secrets in-process with a TTL slightly shorter than Vault’s TTL; handle 403 gracefully by re-fetching.
- Never log secret values — enable Vault’s secret-ID filtering on audit devices so secret material never hits logs.
Related reading
AWS Secrets Manager vs Parameter Store: when to use whichPCI DSS compliance on AWS: architecture guide for fintechHow to achieve SOC 2 compliance on AWS in 2026
Related services
Tools & Calculators
Self-serve calculators and assessments that pair with this integration.
Related AWS Services
Consulting engagements that frequently pair with this integration.
AWS Security Consulting
AWS security consulting from an AWS Select Tier Partner. 2-week assessment, 4–6 week remediation, zero disruption. IAM hardening, public exposure, compliance gaps, and continuous monitoring.
Cloud Compliance Services — HIPAA, SOC 2, PCI DSS on AWS
Cloud compliance services — HIPAA, SOC 2, PCI DSS, ISO 27001, GDPR. Expert consulting from FactualMinds.
AWS DevOps Consulting
AWS DevOps consulting — CI/CD pipeline setup, infrastructure as code (SAM/CDK), and deployment automation.
Who typically runs this integration?
The roles that most often own or review this stack.
AWS Solutions for Compliance Officers
Continuous compliance for PCI DSS 4.0.1, ISO/IEC 27001:2022 and 42001, HIPAA, SOC 2, DORA, NIST CSF 2.0, and AI governance — evidenced through AWS Audit Manager.
AWS Solutions for DevOps & Platform Engineers
EKS Auto Mode, OIDC-native CI/CD, supply-chain security, CDK Toolkit v2, and eBPF observability for platform teams building the platform on AWS in 2026.
Related Integrations
Other AWS integration guides commonly deployed alongside this one.
Kubernetes on AWS (EKS)
Amazon EKS in 2026: Auto Mode GA, Hybrid Nodes, Karpenter 1.0, Pod Identity, Graviton-first node pools, and ECR enhanced scanning — cheaper, safer K8s.
Terraform on AWS
Terraform + AWS in 2026: Stacks GA, ephemeral values, provider-defined functions, Test Framework, OpenTofu 1.8 encryption — vs CDK and CloudFormation.
Frequently Asked Questions
Should I use Vault or AWS Secrets Manager?
Do I need dynamic database credentials, or will rotation in Secrets Manager do?
How do I deploy Vault on AWS in 2026?
How do applications on EKS get secrets from Vault without sidecars?
What is the transit engine and when should I use it over KMS?
What changed with Vault 1.17 and 1.18 that I should know about?
What is the IBM + HashiCorp status and does it affect Vault on AWS?
Related Reading
- AWS Secrets Manager vs Parameter Store: When to Use Which
Secrets Manager rotates and costs $0.40 per secret per month. Parameter Store doesn't rotate and is mostly free. Pricing, rotation, encryption, cross-account access, and the decision criteria for picking each — including the hybrid pattern most production accounts end up at.
- PCI DSS Compliance on AWS: Architecture Guide for Fintech
A practical architecture guide for PCI DSS compliance on AWS — CDE scoping, the 12 requirements mapped to AWS services, network design, encryption, logging, and audit readiness for payment-processing applications.
- How to Achieve SOC 2 Type II Compliance on AWS (2026 Checklist)
SOC 2 Type II certification proves your controls are effective over 6-12 months. This guide covers the compliance roadmap, AWS security controls, documentation requirements, and audit preparation for 2026 certification.
Need Help with This Integration?
Our AWS-certified engineers can design, implement, and operate this integration end-to-end — or review what you already have.