AWS Glossary
Amazon Redshift
Fully managed cloud data warehouse for running fast SQL analytics on petabyte-scale datasets.
AI & assistant-friendly summary
This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.
Summary
Fully managed cloud data warehouse for running fast SQL analytics on petabyte-scale datasets.
Key Facts
- • Definition Amazon Redshift is a fully managed cloud data warehouse optimized for running complex SQL analytics on datasets ranging from gigabytes to petabytes
- • It is used for business intelligence, reporting, and large-scale data analysis on AWS
- • Mistake 2:** Using Redshift for OLTP workloads
- • Use RDS, Aurora, or DynamoDB for application transactions; use Redshift for analytical queries over that data
- • Mistake 3:** Skipping VACUUM and ANALYZE
Entity Definitions
- S3
- S3 is an AWS service relevant to amazon redshift.
- Amazon S3
- Amazon S3 is an AWS service relevant to amazon redshift.
- RDS
- RDS is an AWS service relevant to amazon redshift.
- Aurora
- Aurora is an AWS service relevant to amazon redshift.
- DynamoDB
- DynamoDB is an AWS service relevant to amazon redshift.
- Glue
- Glue is an AWS service relevant to amazon redshift.
- AWS Glue
- AWS Glue is an AWS service relevant to amazon redshift.
- QuickSight
- QuickSight is an AWS service relevant to amazon redshift.
- Amazon QuickSight
- Amazon QuickSight is an AWS service relevant to amazon redshift.
- serverless
- serverless is a cloud computing concept relevant to amazon redshift.
Related Content
- AWS DATA ANALYTICS — Related service
- FINOPS CONSULTING — Related service
Definition
Amazon Redshift is a fully managed cloud data warehouse optimized for running complex SQL analytics on datasets ranging from gigabytes to petabytes. Redshift uses columnar storage, massively parallel processing (MPP), and automatic compression to execute analytical queries orders of magnitude faster than row-based databases. It is used for business intelligence, reporting, and large-scale data analysis on AWS.
How Redshift Works
Columnar Storage:
- Data is stored by column, not row — queries that read 3 of 100 columns only access 3% of stored data
- Each column uses the most efficient compression algorithm for its data type automatically
- Dramatically reduces I/O for typical analytical queries (SUM, AVG, GROUP BY, JOINs)
Massively Parallel Processing (MPP):
- Data is distributed across multiple compute nodes
- Each node processes its slice of data in parallel
- Leader node receives queries, compiles an execution plan, and coordinates worker nodes
- Scale out by adding nodes; query throughput scales linearly
Sort Keys:
- Define how data is physically sorted on disk
- Queries filtering on sort key columns skip unsorted blocks (zone maps)
- Choose sort key based on most common WHERE clause columns
Redshift Serverless
Redshift Serverless automatically provisions and scales capacity based on query demand:
- No cluster to manage — just connect and query
- Pay per RPU-hour (Redshift Processing Unit) for actual query time
- Scales from 8 to 512 RPUs in seconds
- Automatic pause when idle (cost goes to near-zero)
- Best for: intermittent analytics, variable workloads, teams without DBA resources
RA3 Instances (provisioned alternative):
- Separate compute and storage scaling — storage in S3-managed Redshift Managed Storage
- Best for: high-frequency queries where predictable performance and Savings Plans discounts matter
Redshift Spectrum
Query data directly in S3 without loading it into Redshift:
- Define external tables pointing to S3 (Parquet, ORC, CSV, JSON, Avro)
- Query S3 data and Redshift native tables in the same SQL statement
- Compute scales automatically (Spectrum nodes separate from cluster nodes)
- Used for: data lake queries, historical data not worth loading into Redshift, raw log analysis
Integrations
Amazon S3 / Data Lake:
- COPY command loads data from S3 into Redshift in parallel (fastest ingestion method)
- UNLOAD command exports query results to S3 as Parquet, CSV, etc.
- Redshift Spectrum queries S3 without loading
Amazon QuickSight:
- Direct Redshift integration for BI dashboards and reports
- SPICE in-memory caching for fast dashboard loads without hitting Redshift on every query
AWS Glue / dbt:
- AWS Glue Crawlers catalog Redshift tables in the AWS Glue Data Catalog
- dbt (data build tool) is widely used for Redshift transformation pipelines
Streaming Ingestion:
- Stream data from Amazon Kinesis Data Streams or MSK (Kafka) directly into Redshift
- Near-real-time analytics without a landing zone in S3
Common Mistakes
Mistake 1: Not choosing the right distribution style. If large tables are not distributed to collocate join keys, Redshift redistributes data across nodes at query time — causing significant performance overhead. Use DISTKEY on frequently joined columns.
Mistake 2: Using Redshift for OLTP workloads. Redshift is optimized for read-heavy analytics, not transactional operations. Use RDS, Aurora, or DynamoDB for application transactions; use Redshift for analytical queries over that data.
Mistake 3: Skipping VACUUM and ANALYZE. Redshift uses soft deletes (deleted rows marked, not removed). Run VACUUM to reclaim space and re-sort data; run ANALYZE to update statistics. Enable automatic VACUUM and ANALYZE to avoid manual maintenance.
Related AWS Services
- Amazon S3: Primary data lake source for Redshift Spectrum and COPY ingestion
- AWS Glue: ETL service for loading and transforming data into Redshift
- Amazon QuickSight: BI and visualization layer on top of Redshift
- Amazon Kinesis Data Streams: Real-time data streaming into Redshift
Related FactualMinds Content
- AWS Data Analytics Services
- AWS Cloud Migration — Netezza, Teradata, and on-prem warehouse migrations to Redshift with parallel-run validation
- FinOps Consulting
- AWS Well-Architected Review Checklist
Need Help with This Topic?
Our AWS experts can help you implement and optimize these concepts for your organization.
