---
title: How to Build a Serverless Data Pipeline with AWS Glue and Athena
description: AWS Glue automates ETL (Extract, Transform, Load) workflows while Athena provides serverless SQL queries. This guide covers building a complete data pipeline: ingesting raw data, transforming it, and querying at scale without managing servers.
url: https://www.factualminds.com/blog/how-to-build-serverless-data-pipeline-glue-athena/
datePublished: 2026-04-03T00:00:00.000Z
dateModified: 2026-04-16T00:00:00.000Z
author: Palaniappan P
category: Data & Analytics
tags: how-to-guide, glue, athena, etl, data-pipeline, aws
---

# How to Build a Serverless Data Pipeline with AWS Glue and Athena

> AWS Glue automates ETL (Extract, Transform, Load) workflows while Athena provides serverless SQL queries. This guide covers building a complete data pipeline: ingesting raw data, transforming it, and querying at scale without managing servers.

AWS Glue automates data pipelines: extract raw data (logs, CSVs, databases), transform it (clean, enrich, aggregate), and load it into a data warehouse for analysis. Athena provides serverless SQL to query that data without spinning up databases.

Together, Glue + Athena eliminate infrastructure management. You define transformations, Glue runs them on a schedule, and Athena surfaces insights via SQL — all serverless.

This guide covers building a complete pipeline: ingesting data, transforming it with Glue jobs, and querying with Athena.

> **Building Data Pipelines on AWS?** FactualMinds helps organizations architect serverless data infrastructure for analytics at scale. [See our AWS data services](/services/aws-data-analytics/) or [talk to our team](/contact-us/).

## Step 1: Understand the Glue + Athena Architecture

```
Raw Data (S3, RDS, APIs)
  ↓
AWS Glue Job (scheduled ETL)
  → Extract (read from source)
  → Transform (clean, dedupe, enrich)
  → Load (write to S3 in Parquet format)
  ↓
Athena (SQL queries on S3 data)
  → Run analytics
  → Generate reports
```

**Key components:**

- **Glue Crawler**: Auto-discovers schema from raw data
- **Glue Job**: Executes PySpark ETL code
- **Glue Data Catalog**: Metadata about your tables
- **Athena**: SQL engine for querying S3 data
- **Partitioning**: Organize S3 data by date/region for fast queries

## Step 2: Set Up S3 and Data Catalog

Create S3 buckets for raw data, transformed data, and Athena results:

```bash
# Raw data (source)
aws s3api create-bucket --bucket raw-data-bucket --region us-east-1

# Transformed data (processed)
aws s3api create-bucket --bucket transformed-data-bucket --region us-east-1

# Athena results (query output)
aws s3api create-bucket --bucket athena-results-bucket --region us-east-1

# Enable versioning (optional but recommended)
aws s3api put-bucket-versioning --bucket raw-data-bucket --versioning-configuration Status=Enabled
```

## Step 3: Crawl Raw Data with Glue Crawler

A Glue Crawler automatically discovers schema from your raw data:

1. Go to **AWS Glue** → **Crawlers**
2. Click **Create crawler**
3. **Crawler name**: `raw-data-crawler`
4. **Data sources**:
   - Type: S3
   - S3 path: `s3://raw-data-bucket/logs/`
5. **IAM role**: Create or select a role with S3 access
6. **Output configuration**:
   - Database: Create new database `raw_data`
   - Table prefix: `raw_` (tables will be named `raw_logs`, etc.)
7. **Scheduling**:
   - Frequency: Daily at 2 AM
   - (or manual for testing)
8. Click **Create crawler**

Run the crawler:

```bash
aws glue start-crawler --name raw-data-crawler
```

Check results in **Glue** → **Databases** → `raw_data` → **Tables**. The crawler will have created a table with schema inferred from your data.

## Step 4: Create a Glue ETL Job

Create a Glue job to transform raw data:

1. Go to **AWS Glue** → **Jobs**
2. Click **Create job**
3. **Job name**: `transform-logs`
4. **Type**: Spark (for large datasets) or Python Shell (for small ones)
5. **IAM role**: Select the role with S3 access
6. **Script path**: `s3://glue-scripts/transform-logs.py`
7. Click **Create job**

Write the ETL script:

```python
# transform-logs.py
import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from pyspark.sql.functions import col, to_timestamp, year, month, day

args = sys.argv[1:]
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args[0], args)

# Read raw data from Glue Data Catalog
raw_logs = glueContext.create_dynamic_frame.from_table(
    database="raw_data",
    table_name="raw_logs"
)

# Convert to DataFrame for transformations
df = raw_logs.toDF()

# Transformations
df = df.filter(col("event_type").isNotNull())  # Remove nulls
df = df.withColumn("timestamp", to_timestamp(col("timestamp")))  # Parse date
df = df.dropDuplicates()  # Dedupe
df = df.withColumn("year", year(col("timestamp")))
df = df.withColumn("month", month(col("timestamp")))
df = df.withColumn("day", day(col("timestamp")))

# Write transformed data to S3 partitioned by date
df.repartition(col("year"), col("month"), col("day")).write \
    .mode("overwrite") \
    .parquet("s3://transformed-data-bucket/logs/")

job.commit()
```

Upload the script to S3:

```bash
aws s3 cp transform-logs.py s3://glue-scripts/
```

## Step 5: Run Glue Job on Schedule

1. In the Glue job, go to **Job details**
2. **Trigger**: Set to trigger daily at 3 AM (after crawler runs at 2 AM)
3. **Max retries**: 1
4. Click **Save**

Or trigger manually:

```bash
aws glue start-job-run --job-name transform-logs
```

Monitor the job:

```bash
aws glue get-job-run --job-name transform-logs --run-id xxxxx
```

## Step 6: Create Athena Table from Transformed Data

Once Glue writes transformed data to S3, create an Athena table to query it.

Option A: Glue Crawler (auto-detect schema)

Run a second crawler on the transformed data bucket:

```bash
aws glue start-crawler --name transformed-data-crawler
```

Option B: Manual Athena SQL

```sql
CREATE EXTERNAL TABLE IF NOT EXISTS transformed_logs (
    event_type STRING,
    user_id STRING,
    timestamp BIGINT,
    message STRING
)
PARTITIONED BY (
    year INT,
    month INT,
    day INT
)
STORED AS PARQUET
LOCATION 's3://transformed-data-bucket/logs/'
```

Add partitions:

```sql
ALTER TABLE transformed_logs ADD IF NOT EXISTS
    PARTITION (year=2026, month=4, day=3)
        LOCATION 's3://transformed-data-bucket/logs/year=2026/month=4/day=3/'
```

## Step 7: Query Data with Athena

Go to **Amazon Athena** → **Query editor**:

```sql
-- Count events by type
SELECT event_type, COUNT(*) as count
FROM transformed_logs
WHERE year = 2026 AND month = 4
GROUP BY event_type
ORDER BY count DESC
```

```sql
-- Find events for a specific user
SELECT timestamp, event_type, message
FROM transformed_logs
WHERE user_id = 'user-12345' AND year = 2026
ORDER BY timestamp DESC
```

```sql
-- Daily aggregation
SELECT year, month, day, COUNT(*) as total_events
FROM transformed_logs
WHERE year = 2026
GROUP BY year, month, day
ORDER BY year, month, day
```

## Step 8: Optimize for Cost and Performance

### Partition Strategy

Organize data in S3 with partitions so Athena only scans relevant data:

```
s3://transformed-data-bucket/logs/
├── year=2026/
│   ├── month=1/
│   │   ├── day=1/
│   │   ├── day=2/
│   │   └── ...
│   ├── month=2/
│   └── ...
└── year=2025/
```

Query with partition filters to reduce scans:

```sql
-- Scans ONLY April 2026 data
SELECT COUNT(*) FROM transformed_logs
WHERE year = 2026 AND month = 4

-- Scans ALL data (100x more expensive!)
SELECT COUNT(*) FROM transformed_logs
```

### Compress Data

Use Parquet format with compression:

```python
# In Glue job
df.write.mode("overwrite").option("compression", "snappy").parquet("s3://bucket/data/")
```

Parquet + Snappy compression = 50-80% smaller than CSV/JSON.

### Use Columnar Format

Query only required columns:

```sql
-- Efficient: scans only 3 columns
SELECT user_id, event_type, timestamp FROM logs

-- Inefficient: scans all columns
SELECT * FROM logs
```

## Step 9: Automate Reporting

Create a scheduled Athena query that runs daily and saves results:

```python
# lambda_function.py
import boto3
from datetime import datetime

athena = boto3.client('athena')

def run_daily_report():
    query = """
    SELECT event_type, COUNT(*) as count
    FROM transformed_logs
    WHERE year = 2026 AND month = 4 AND day = 3
    GROUP BY event_type
    """

    response = athena.start_query_execution(
        QueryString=query,
        QueryExecutionContext={'Database': 'default'},
        ResultConfiguration={'OutputLocation': 's3://athena-results-bucket/'}
    )

    print(f"Query started: {response['QueryExecutionId']}")

def lambda_handler(event, context):
    run_daily_report()
    return {'statusCode': 200}
```

Schedule via CloudWatch Events:

```bash
aws events put-rule --name daily-athena-report --schedule-expression "cron(0 8 * * ? *)"
aws events put-targets --rule daily-athena-report --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789012:function:daily-report"
```

## Step 10: Production Patterns

### Pattern 1: Multi-Stage Pipeline

```
Raw → Staging (light transformation)
    ↓
Staging → Curated (business logic)
    ↓
Curated → Analytics (final queries)
```

Each stage has its own Glue job and S3 location. This allows:

- Reusing staging data for multiple downstream jobs
- Easy debugging (inspect data at each stage)
- Auditing transformation steps

### Pattern 2: Quality Checks

Add data quality validation in Glue jobs:

```python
# Count nulls
null_count = df.filter(col("user_id").isNull()).count()
if null_count > 0:
    print(f"WARNING: {null_count} rows with null user_id")
    # Send SNS alert

# Check for unexpected values
invalid_types = df.filter(~df.event_type.isin(['login', 'logout', 'click'])).count()
if invalid_types > 0:
    print(f"ERROR: {invalid_types} rows with invalid event_type")
    # Fail job
    sys.exit(1)

# If we reach here, data is valid
df.write.mode("overwrite").parquet("s3://bucket/data/")
```

### Pattern 3: Incremental Processing

Avoid reprocessing the same data:

```python
# Only process data from last 24 hours
from datetime import datetime, timedelta

cutoff_time = (datetime.now() - timedelta(days=1)).timestamp()
df = df.filter(col("timestamp") > cutoff_time)
```

This reduces processing time and cost on large datasets.

## Common Mistakes to Avoid

1. **Not partitioning data**
   - Unpartitioned: Every query scans all data (expensive)
   - Partitioned: Queries scan only relevant data (cheap)
   - Always partition by date, region, or other common filters

2. **Using CSV instead of Parquet**
   - CSV: Slow, uncompressed, row-oriented
   - Parquet: Fast, compressed, columnar
   - Convert to Parquet in Glue jobs

3. **Not adding partition metadata**
   - Athena needs partition info to optimize queries
   - Use `MSCK REPAIR TABLE` to add partitions:
     ```sql
     MSCK REPAIR TABLE my_table
     ```

4. **Querying without WHERE clauses**
   - `SELECT * FROM logs` scans all data
   - Always add partition filters: `WHERE year=2026 AND month=4`

5. **Ignoring job failures**
   - Set up SNS alerts for failed Glue jobs
   - Monitor job logs in CloudWatch

## Cost Estimation

For 1GB of raw data ingested daily:

| Component                                     | Cost         |
| --------------------------------------------- | ------------ |
| Glue (2 DPUs, 30 mins/day)                    | ~$8/month    |
| S3 storage (30GB/month retention)             | ~$0.69/month |
| Athena (50 queries/month, 100MB scanned each) | ~$31/month   |
| **Total**                                     | ~$40/month   |

## Next Steps

1. Create S3 buckets (5 mins)
2. Run Glue Crawler on raw data (15 mins)
3. Write a simple Glue job (1 hour)
4. Create Athena table and run a query (30 mins)
5. Set up daily job schedule (15 mins)
6. [Talk to FactualMinds](/contact-us/) if you need help scaling to production or optimizing costs for large datasets

## FAQ

### What is the difference between AWS Glue and Lambda for ETL?
Lambda is compute — you write code and it runs. Glue is a fully managed ETL service with built-in job scheduling, error handling, and data catalog features. For simple transformations (1-2 scripts), Lambda is faster to implement. For complex ETL pipelines (10+ transformations, scheduling, data lineage), Glue is easier: you define jobs in the console, Glue handles retries and logging. Glue also includes the Data Catalog (metadata about your data), which Lambda doesn't provide.

### How much does Glue cost?
Glue charges $0.44 per DPU-hour (Data Processing Unit, ~4GB RAM). A 10-minute job using 2 DPUs = 0.33 hours × 2 DPU × $0.44 = ~$0.29. For daily ETL jobs: ~$8/month. Athena charges $6.25 per TB of data scanned. A 100GB analytics query = 0.1 TB × $6.25 = $0.625. For 100 analytics queries/month = ~$60/month. Combined: ~$70/month for typical ETL + analytics workload.

### How do I partition data for better Athena performance?
Partition Athena tables by date, region, or other common filters. Instead of one table with 1M rows, partition by year/month so Athena only scans relevant partitions. Example: logs/year=2026/month=04/day=03/data.parquet. Athena scans only data matching your query (e.g., "WHERE year=2026 AND month=04"). Without partitioning, Athena must scan all 1M rows. With partitioning, it scans 10K rows. Cost savings: 100x.

### Can I use Glue with data in real-time?
No, Glue is batch ETL (scheduled jobs that run hourly, daily, etc.). For real-time data, use: (1) Kinesis (streaming), (2) Lambda triggers, (3) DMS (database migration service for CDC — change data capture). Glue is best for historical data ingestion or periodic batch transformations (daily data cleanup, weekly aggregations).

### What format should I use for Athena queries — Parquet, CSV, or JSON?
Parquet is best for analytics: (1) Columnar format (Athena only reads relevant columns), (2) Compressed (50-80% smaller than CSV), (3) Schema-aware (data types preserved). CSV/JSON are row-based and less efficient. For 1GB of data: CSV = 1GB scanned, Parquet = 200-400MB scanned. Glue can convert CSV to Parquet as a transformation step, so raw data can be CSV and Glue converts to Parquet for analytics.

---

*Source: https://www.factualminds.com/blog/how-to-build-serverless-data-pipeline-glue-athena/*