One platform, 40–70% lower spend

Lakehouse on AWS — S3 Tables, Iceberg, Athena, and Redshift Spectrum

Replace a $30K/month Redshift cluster with a lakehouse at $8–14K/month — S3 Tables (managed Iceberg) for the system of record, Athena for ad-hoc, Redshift Spectrum where you still need joins. The AWS-native default for unified analytics in 2026.

Last updated: May 1, 2026Author: FactualMinds AWS ArchitectsReviewed by: Palaniappan P · AWS Solutions Architect — Professional

Problem

Analytics teams on AWS run both a lake (raw S3 + Glue + Athena) and a warehouse (Redshift) in parallel — duplicated storage, duplicated ETL, two on-call rotations, and a constant argument over which one is the source of truth. The cost line on the duplicated stack creeps past $30K/month before anyone notices, and BI dashboards still take 12 seconds to load because nobody owns the curated layer.

Solution

Adopt the lakehouse shape with S3 Tables (managed Apache Iceberg) as the system of record. Athena queries the tables directly for ad-hoc and lightweight BI; Redshift Spectrum joins them with hot warehouse data for sub-second dashboards; Glue Data Catalog is the single metadata plane; Lake Formation enforces row- and column-level access. For streaming ingest, use Managed Service for Apache Flink (NOT the SQL-flavored Kinesis Data Analytics, which is retiring January 2026) reading from Kinesis Data Streams and writing back into S3 Tables.

AWS services in this pattern

Service	Role
Amazon S3 Tables	Managed Apache Iceberg storage tier (GA Dec 2024) — automatic compaction, snapshot expiry, and unreferenced-file cleanup; the system of record for the lakehouse
AWS Glue Data Catalog	Single metadata plane — every Athena query, Redshift Spectrum external table, and EMR job reads from the same catalog
Amazon Athena	Serverless SQL query engine for ad-hoc analytics and lightweight BI directly against S3 Tables; pay-per-query economics for unpredictable workloads
Amazon Redshift Spectrum	External-table query engine that joins Iceberg tables on S3 with hot warehouse data; the path to sub-second BI dashboards on top of the lakehouse
AWS Lake Formation	Centralized access control — row-level security, column masking, tag-based access on top of the Glue Catalog; the single permission plane for both Athena and Redshift Spectrum
AWS Glue (ETL)	Serverless Spark ETL for batch ingestion, schema evolution handling, and data-quality enforcement before writes land in S3 Tables
Amazon Kinesis Data Streams	Durable event stream for change-data-capture, application events, and IoT ingest
Amazon Managed Service for Apache Flink	Streaming transform layer — replaces the retiring Kinesis Data Analytics for SQL; writes Iceberg-compatible output back into S3 Tables
Amazon Kinesis Data Firehose	Buffered batch ingest from streams, logs, and SaaS connectors directly into Iceberg-formatted S3 with schema conversion
Amazon QuickSight	BI visualization layer — direct query against Athena for ad-hoc, SPICE caching against Redshift for dashboards that demand sub-second response

Architecture components

Bronze layer

Raw landing zone in S3 — append-only, partitioned by ingest date; minimal schema enforcement; the system-of-record audit trail.

Silver layer (S3 Tables)

Cleansed, conformed Iceberg tables — schema-evolved, deduplicated, partitioned for query performance; the canonical analytics dataset.

Gold layer

Curated business marts — aggregated, dimensionally modeled, optimized for specific BI workloads; can live as Iceberg tables or as Redshift native tables depending on latency budget.

Streaming path

Kinesis Data Streams → Managed Service for Apache Flink (or Firehose for simple buffer-and-write) → S3 Tables. End-to-end latency typically 30–120 seconds depending on the Flink checkpoint interval.

Catalog and governance

Glue Data Catalog is the single metadata plane; Lake Formation enforces row- and column-level access uniformly across Athena, Redshift Spectrum, and EMR. Tag-based access lets the data platform team grant by attribute, not by table-by-table grants.

Query routing

Ad-hoc and exploratory: Athena. BI dashboards with strict latency: Redshift Spectrum (or Gold-layer Redshift native tables). The decision is per-workload, not per-team.

Why this pattern

The lakehouse argument is over. The combination of open table formats (Iceberg) with a single metadata plane (Glue Data Catalog) and pluggable query engines (Athena, Spectrum, EMR, Trino) is now the AWS-native default for new analytics workloads — and S3 Tables, GA in December 2024, removed the last major operational reason to keep the warehouse and the lake apart.

The reference design below is what we deploy on most analytics modernization engagements in 2026 — replacing some combination of “raw lake on Glue + Athena” and “expensive Redshift cluster running everything” with a single Iceberg-native source of truth and the right query engine per workload.

Choosing the query engine

Workload	Athena	Redshift Spectrum	Redshift native	EMR / custom
Ad-hoc analyst queries	✅
Exploratory data-science notebooks	✅			maybe
BI dashboard, low concurrency	✅	✅
BI dashboard, high concurrency, sub-second		✅	✅
Financial close / regulatory reporting		✅	✅
Heavy custom Spark / ML feature engineering				✅
Cross-source joins (Iceberg + RDS + Redshift)		✅
Streaming aggregates with windowing				MSF

What “lakehouse done right” looks like

Three layers, one metadata plane, governed end-to-end:

Bronze (raw): append-only, date-partitioned, schema-on-read; the audit trail.
Silver (Iceberg): cleansed, conformed, schema-evolved S3 Tables; the canonical analytics dataset; what Athena and Redshift Spectrum query by default.
Gold (curated): business marts; either Iceberg with strict refresh SLAs or Redshift native tables when sub-second BI is the contract.
Catalog: single Glue Data Catalog; Lake Formation tag-based access for governance.
Streams: Kinesis Data Streams → Managed Service for Apache Flink → S3 Tables. End-to-end latency 30–120 seconds for most pipelines.

The discipline is owning the silver layer. Teams that skip silver and let every analyst query bronze directly burn through Athena spend, never converge on shared business definitions, and re-do the same joins fifty times a quarter.

Where this pattern shows up in our consulting

We deploy this stack most often in AWS Data Analytics modernization engagements — typically as an eight-to-twelve-week initiative that catalogs the existing analytics workloads, builds the silver layer on S3 Tables, migrates ad-hoc workloads to Athena, and right-sizes the remaining Redshift footprint. The cost-optimization story usually rides along; once the workloads are correctly placed, cost optimization work focuses on partition tuning, snapshot retention, and right-sizing rather than firefighting unpredictable spend.

Trade-offs

Pro

S3 Tables removes the Iceberg operational tax — automatic compaction, snapshot expiry, and unreferenced-file cleanup that previously consumed an FTE-quarter on self-managed Iceberg are now AWS's problem.

Con

S3 Tables is opinionated about Iceberg specifically — Delta Lake and Hudi workloads do not benefit and need a different evaluation. Teams already deep on Delta Lake should stay there; teams making a fresh choice should pick Iceberg.

Pro

Athena pay-per-query cuts analytics spend 40–70% versus provisioned Redshift on exploratory and ad-hoc workloads — the unpredictable queries that were keeping the warehouse oversized stop paying for idle compute.

Con

Athena cost gets worse than Redshift past a steady-state query volume threshold (typically around 5–10 concurrent dashboards refreshing on a minute). Identify the workloads that need provisioned compute and put them on Redshift Spectrum or native tables; leave the rest on Athena.

Pro

Single Glue catalog + Lake Formation policy plane eliminates the access-control drift that comes from running per-tool grants in Athena and Redshift independently — auditors get one report, not two.

Con

Lake Formation has a learning curve and a non-trivial migration story if you start from raw IAM/S3 ACLs. Plan a one-quarter focused effort to roll it out properly; do not retrofit it during a release week.

Cost notes

Lakehouse storage on S3 Tables is at S3 Standard pricing plus a small managed-table fee — typically pennies per GB-month all-in. Query compute is the dominant variable. Replacing a $30K/month Redshift cluster with a Lakehouse on Athena + a smaller Redshift cluster for dashboards routinely lands at $8K–14K/month at the same query volume; the savings come from the unpredictable workloads no longer paying for provisioned compute. Streaming ingest with MSF on a small parallelism profile lands around $400–900/month per pipeline; Firehose-only ingest is cheaper still. Cost discipline is partition strategy, file-size tuning, and Iceberg snapshot retention — the table maintenance that S3 Tables now automates is the single biggest cost driver below the query layer.

Related patterns

Event-driven microservices on AWS — EventBridge, Pipes, and the Outbox Pattern

Production event-driven architecture on AWS — EventBridge custom buses, EventBridge Pipes for the transactional outbox, SQS dead-letter queues, Step Functions for orchestration, and Lambda or Fargate workers. Decouple services without dual-writes.

Multi-Tenant SaaS on AWS — Pool, Silo, and Bridge

Production-ready multi-tenant architecture for SaaS on AWS. Covers tenant isolation models (pool, silo, bridge), per-tenant cost attribution, noisy-neighbor mitigation, and the trade-offs CTOs actually wrestle with at Series B and beyond.

Consulting engagements that deliver this pattern

AWS Data Analytics Services — Glue, Athena & QuickSight

AWS data analytics services — scalable data warehouse, ETL/ELT pipelines, real-time analytics, and business intelligence.

AWS Well-Architected Review — Free Assessment

Free AWS Well-Architected Review from FactualMinds. Identify risks, compliance gaps, and optimization opportunities.

AWS Cost Optimization & FinOps Consulting

AWS cost optimization and FinOps consulting from FactualMinds — reduce spend by 20-40% with expert right-sizing and strategy.

Deep dives

Building a Data Lake on AWS: S3 + Glue + Athena Architecture

S3 + Glue + Athena is the canonical AWS data-lake stack. It's also the one teams over-engineer the fastest. A reference architecture with the partitioning scheme, the Glue crawler config, and the Athena cost guardrails that keep query bills predictable as the lake grows.

Amazon Athena Cost Optimization: Partition Pruning, Compression, and Iceberg Tables

Athena charges per TB of data scanned. The right partitioning, compression, and table format can cut your Athena bill by 90%. Here is exactly how to do it.

How to Build a Serverless Data Pipeline with AWS Glue and Athena

AWS Glue automates ETL (Extract, Transform, Load) workflows while Athena provides serverless SQL queries. This guide covers building a complete data pipeline: ingesting raw data, transforming it, and querying at scale without managing servers.

Frequently asked questions

Why S3 Tables instead of self-managed Apache Iceberg on raw S3?

Self-managed Iceberg requires you to schedule and tune compaction, snapshot expiry, and orphan-file cleanup yourself — typically on Spark or EMR, with the cost and operational tax that implies. S3 Tables (GA December 2024) handles all three automatically and pricing converges with self-managed at most workload sizes. The trade-off is opinionation: S3 Tables is Iceberg-only and follows AWS-native conventions. For new lakehouses, S3 Tables is the default.

When should we still use Redshift native tables instead of querying via Spectrum?

When the workload is high-concurrency BI with strict sub-second latency on a stable schema — leaderboards, financial close dashboards, executive scorecards. Spectrum is excellent at joining Iceberg with hot Redshift data, but the ETL into native Redshift tables is what gets the dashboard refresh under one second. The pattern is hot-data-on-Redshift, cold-data-on-Iceberg, and Spectrum to bridge.

What replaces Kinesis Data Analytics for SQL?

Amazon Managed Service for Apache Flink (MSF). The SQL-flavored Kinesis Data Analytics is retiring January 27, 2026; new streaming-transform workloads should be designed on MSF (Studio for SQL-style notebook authoring, Apps for production deployments). Existing KDA-SQL applications should be migrated before the retirement date — the migration is straightforward but not trivial.

How does this pattern interact with our existing Redshift warehouse?

It complements rather than replaces. The migration we typically run is: catalog the existing Redshift workloads by query pattern; move ad-hoc and exploratory queries to Athena against the new S3 Tables silver layer; move ETL output from Redshift native tables to S3 Tables; keep the Redshift cluster sized for the BI dashboards that genuinely need provisioned compute. Most teams end up with a smaller Redshift footprint and a much cheaper analytics bill — without losing dashboard performance.

How do we handle GDPR right-to-erasure on Iceberg tables?

Iceberg's row-level delete operation makes erasure a managed operation — issue a DELETE, snapshot expiry eventually removes the underlying files. The audit trail lives in the Iceberg snapshot history; Lake Formation row filters and column masks handle access during the retention window. Compared to legacy data-lake approaches (rewriting partitions or maintaining a soft-delete column), Iceberg-native deletion is dramatically simpler.

Where does AI fit in?

Two places. First, S3 Tables is a clean source for embeddings — Bedrock Knowledge Bases or a custom pipeline can pull silver-layer Iceberg data into a vector store for RAG without bespoke ETL. Second, Bedrock + Athena lets analysts ask natural-language questions over the lakehouse with the right grounding contract. Both fit the lakehouse cleanly; neither requires a parallel AI data store.

Want this pattern deployed end-to-end?

Our team builds these patterns in production for SaaS, healthcare, fintech, and enterprise customers. Tell us your constraints and we'll scope the engagement.

Talk to AWS Experts

See more patterns