Skip to main content

One platform, 40–70% lower spend

Lakehouse on AWS — S3 Tables, Iceberg, Athena, and Redshift Spectrum

Replace a $30K/month Redshift cluster with a lakehouse at $8–14K/month — S3 Tables (managed Iceberg) for the system of record, Athena for ad-hoc, Redshift Spectrum where you still need joins. The AWS-native default for unified analytics in 2026.

Last updated: May 1, 2026Author: FactualMinds AWS ArchitectsReviewed by: Palaniappan P · AWS Solutions Architect — Professional

Problem

Analytics teams on AWS run both a lake (raw S3 + Glue + Athena) and a warehouse (Redshift) in parallel — duplicated storage, duplicated ETL, two on-call rotations, and a constant argument over which one is the source of truth. The cost line on the duplicated stack creeps past $30K/month before anyone notices, and BI dashboards still take 12 seconds to load because nobody owns the curated layer.

Solution

Adopt the lakehouse shape with S3 Tables (managed Apache Iceberg) as the system of record. Athena queries the tables directly for ad-hoc and lightweight BI; Redshift Spectrum joins them with hot warehouse data for sub-second dashboards; Glue Data Catalog is the single metadata plane; Lake Formation enforces row- and column-level access. For streaming ingest, use Managed Service for Apache Flink (NOT the SQL-flavored Kinesis Data Analytics, which is retiring January 2026) reading from Kinesis Data Streams and writing back into S3 Tables.

AWS services in this pattern

Service Role
Amazon S3 Tables Managed Apache Iceberg storage tier (GA Dec 2024) — automatic compaction, snapshot expiry, and unreferenced-file cleanup; the system of record for the lakehouse
AWS Glue Data Catalog Single metadata plane — every Athena query, Redshift Spectrum external table, and EMR job reads from the same catalog
Amazon Athena Serverless SQL query engine for ad-hoc analytics and lightweight BI directly against S3 Tables; pay-per-query economics for unpredictable workloads
Amazon Redshift Spectrum External-table query engine that joins Iceberg tables on S3 with hot warehouse data; the path to sub-second BI dashboards on top of the lakehouse
AWS Lake Formation Centralized access control — row-level security, column masking, tag-based access on top of the Glue Catalog; the single permission plane for both Athena and Redshift Spectrum
AWS Glue (ETL) Serverless Spark ETL for batch ingestion, schema evolution handling, and data-quality enforcement before writes land in S3 Tables
Amazon Kinesis Data Streams Durable event stream for change-data-capture, application events, and IoT ingest
Amazon Managed Service for Apache Flink Streaming transform layer — replaces the retiring Kinesis Data Analytics for SQL; writes Iceberg-compatible output back into S3 Tables
Amazon Kinesis Data Firehose Buffered batch ingest from streams, logs, and SaaS connectors directly into Iceberg-formatted S3 with schema conversion
Amazon QuickSight BI visualization layer — direct query against Athena for ad-hoc, SPICE caching against Redshift for dashboards that demand sub-second response

Architecture components

Bronze layer

Raw landing zone in S3 — append-only, partitioned by ingest date; minimal schema enforcement; the system-of-record audit trail.

Silver layer (S3 Tables)

Cleansed, conformed Iceberg tables — schema-evolved, deduplicated, partitioned for query performance; the canonical analytics dataset.

Gold layer

Curated business marts — aggregated, dimensionally modeled, optimized for specific BI workloads; can live as Iceberg tables or as Redshift native tables depending on latency budget.

Streaming path

Kinesis Data Streams → Managed Service for Apache Flink (or Firehose for simple buffer-and-write) → S3 Tables. End-to-end latency typically 30–120 seconds depending on the Flink checkpoint interval.

Catalog and governance

Glue Data Catalog is the single metadata plane; Lake Formation enforces row- and column-level access uniformly across Athena, Redshift Spectrum, and EMR. Tag-based access lets the data platform team grant by attribute, not by table-by-table grants.

Query routing

Ad-hoc and exploratory: Athena. BI dashboards with strict latency: Redshift Spectrum (or Gold-layer Redshift native tables). The decision is per-workload, not per-team.

Why this pattern

The lakehouse argument is over. The combination of open table formats (Iceberg) with a single metadata plane (Glue Data Catalog) and pluggable query engines (Athena, Spectrum, EMR, Trino) is now the AWS-native default for new analytics workloads — and S3 Tables, GA in December 2024, removed the last major operational reason to keep the warehouse and the lake apart.

The reference design below is what we deploy on most analytics modernization engagements in 2026 — replacing some combination of “raw lake on Glue + Athena” and “expensive Redshift cluster running everything” with a single Iceberg-native source of truth and the right query engine per workload.

Choosing the query engine

WorkloadAthenaRedshift SpectrumRedshift nativeEMR / custom
Ad-hoc analyst queries
Exploratory data-science notebooksmaybe
BI dashboard, low concurrency
BI dashboard, high concurrency, sub-second
Financial close / regulatory reporting
Heavy custom Spark / ML feature engineering
Cross-source joins (Iceberg + RDS + Redshift)
Streaming aggregates with windowingMSF

What “lakehouse done right” looks like

Three layers, one metadata plane, governed end-to-end:

The discipline is owning the silver layer. Teams that skip silver and let every analyst query bronze directly burn through Athena spend, never converge on shared business definitions, and re-do the same joins fifty times a quarter.

Where this pattern shows up in our consulting

We deploy this stack most often in AWS Data Analytics modernization engagements — typically as an eight-to-twelve-week initiative that catalogs the existing analytics workloads, builds the silver layer on S3 Tables, migrates ad-hoc workloads to Athena, and right-sizes the remaining Redshift footprint. The cost-optimization story usually rides along; once the workloads are correctly placed, cost optimization work focuses on partition tuning, snapshot retention, and right-sizing rather than firefighting unpredictable spend.

Trade-offs

Pro

S3 Tables removes the Iceberg operational tax — automatic compaction, snapshot expiry, and unreferenced-file cleanup that previously consumed an FTE-quarter on self-managed Iceberg are now AWS's problem.

Con

S3 Tables is opinionated about Iceberg specifically — Delta Lake and Hudi workloads do not benefit and need a different evaluation. Teams already deep on Delta Lake should stay there; teams making a fresh choice should pick Iceberg.

Pro

Athena pay-per-query cuts analytics spend 40–70% versus provisioned Redshift on exploratory and ad-hoc workloads — the unpredictable queries that were keeping the warehouse oversized stop paying for idle compute.

Con

Athena cost gets worse than Redshift past a steady-state query volume threshold (typically around 5–10 concurrent dashboards refreshing on a minute). Identify the workloads that need provisioned compute and put them on Redshift Spectrum or native tables; leave the rest on Athena.

Pro

Single Glue catalog + Lake Formation policy plane eliminates the access-control drift that comes from running per-tool grants in Athena and Redshift independently — auditors get one report, not two.

Con

Lake Formation has a learning curve and a non-trivial migration story if you start from raw IAM/S3 ACLs. Plan a one-quarter focused effort to roll it out properly; do not retrofit it during a release week.

Cost notes

Lakehouse storage on S3 Tables is at S3 Standard pricing plus a small managed-table fee — typically pennies per GB-month all-in. Query compute is the dominant variable. Replacing a $30K/month Redshift cluster with a Lakehouse on Athena + a smaller Redshift cluster for dashboards routinely lands at $8K–14K/month at the same query volume; the savings come from the unpredictable workloads no longer paying for provisioned compute. Streaming ingest with MSF on a small parallelism profile lands around $400–900/month per pipeline; Firehose-only ingest is cheaper still. Cost discipline is partition strategy, file-size tuning, and Iceberg snapshot retention — the table maintenance that S3 Tables now automates is the single biggest cost driver below the query layer.

Related patterns

Consulting engagements that deliver this pattern

Deep dives

Frequently asked questions

Why S3 Tables instead of self-managed Apache Iceberg on raw S3?

Self-managed Iceberg requires you to schedule and tune compaction, snapshot expiry, and orphan-file cleanup yourself — typically on Spark or EMR, with the cost and operational tax that implies. S3 Tables (GA December 2024) handles all three automatically and pricing converges with self-managed at most workload sizes. The trade-off is opinionation: S3 Tables is Iceberg-only and follows AWS-native conventions. For new lakehouses, S3 Tables is the default.

When should we still use Redshift native tables instead of querying via Spectrum?

When the workload is high-concurrency BI with strict sub-second latency on a stable schema — leaderboards, financial close dashboards, executive scorecards. Spectrum is excellent at joining Iceberg with hot Redshift data, but the ETL into native Redshift tables is what gets the dashboard refresh under one second. The pattern is hot-data-on-Redshift, cold-data-on-Iceberg, and Spectrum to bridge.

What replaces Kinesis Data Analytics for SQL?

Amazon Managed Service for Apache Flink (MSF). The SQL-flavored Kinesis Data Analytics is retiring January 27, 2026; new streaming-transform workloads should be designed on MSF (Studio for SQL-style notebook authoring, Apps for production deployments). Existing KDA-SQL applications should be migrated before the retirement date — the migration is straightforward but not trivial.

How does this pattern interact with our existing Redshift warehouse?

It complements rather than replaces. The migration we typically run is: catalog the existing Redshift workloads by query pattern; move ad-hoc and exploratory queries to Athena against the new S3 Tables silver layer; move ETL output from Redshift native tables to S3 Tables; keep the Redshift cluster sized for the BI dashboards that genuinely need provisioned compute. Most teams end up with a smaller Redshift footprint and a much cheaper analytics bill — without losing dashboard performance.

How do we handle GDPR right-to-erasure on Iceberg tables?

Iceberg's row-level delete operation makes erasure a managed operation — issue a DELETE, snapshot expiry eventually removes the underlying files. The audit trail lives in the Iceberg snapshot history; Lake Formation row filters and column masks handle access during the retention window. Compared to legacy data-lake approaches (rewriting partitions or maintaining a soft-delete column), Iceberg-native deletion is dramatically simpler.

Where does AI fit in?

Two places. First, S3 Tables is a clean source for embeddings — Bedrock Knowledge Bases or a custom pipeline can pull silver-layer Iceberg data into a vector store for RAG without bespoke ETL. Second, Bedrock + Athena lets analysts ask natural-language questions over the lakehouse with the right grounding contract. Both fit the lakehouse cleanly; neither requires a parallel AI data store.

Want this pattern deployed end-to-end?

Our team builds these patterns in production for SaaS, healthcare, fintech, and enterprise customers. Tell us your constraints and we'll scope the engagement.