Skip to main content

AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

Amazon DataZone adds business data catalog, project-based access, and data subscriptions to AWS data platforms. The governance layer that Glue Data Catalog was never meant to be.

Key Facts

  • Amazon DataZone adds business data catalog, project-based access, and data subscriptions to AWS data platforms
  • Amazon DataZone adds business data catalog, project-based access, and data subscriptions to AWS data platforms

Entity Definitions

Glue
Glue is an AWS service discussed in this article.

Amazon DataZone: Enterprise Data Governance and Catalog for Modern AWS Data Platforms

analytics Palaniappan P 12 min read

Quick summary: Amazon DataZone adds business data catalog, project-based access, and data subscriptions to AWS data platforms. The governance layer that Glue Data Catalog was never meant to be.

Key Takeaways

  • Amazon DataZone adds business data catalog, project-based access, and data subscriptions to AWS data platforms
  • Amazon DataZone adds business data catalog, project-based access, and data subscriptions to AWS data platforms
Amazon DataZone: Enterprise Data Governance and Catalog for Modern AWS Data Platforms
Table of Contents

Most AWS data platforms have a metadata problem that nobody talks about in architecture reviews. AWS Glue Data Catalog holds your table schemas, partition locations, and S3 paths. Lake Formation enforces who can query what. But neither answers the questions that block actual data usage in a large organization: What does this table actually represent? Who owns it? Is the data fresh? How do I get access to it? And who approved the last 12 access requests?

These are governance questions, not technical questions. And they are exactly what Amazon DataZone was built to answer.

DataZone, which reached GA in October 2023 and has expanded significantly through 2024–2025, is a data governance and business catalog service that layers on top of your existing AWS data infrastructure. It does not replace Glue Data Catalog or Lake Formation — it orchestrates and enriches them with the business context and access workflow that enterprise data teams actually need.

This post covers the DataZone architecture, the three-way comparison with Glue and Lake Formation, the producer/consumer subscription model, SageMaker Unified Studio integration, and a phased implementation roadmap for teams migrating from ad-hoc data access patterns.

DataZone vs. Glue Data Catalog vs. Lake Formation

Before going deeper, it is worth being precise about what each service is, because the three are commonly conflated.

DimensionGlue Data CatalogAWS Lake FormationAmazon DataZone
Primary purposeTechnical metadata storeFine-grained access controlGovernance workflow + business catalog
What it storesTable schemas, partitions, S3 locations, Glue job definitionsIAM policies scoped to tables/columns/rowsBusiness descriptions, glossary terms, owners, quality scores, lineage, subscription history
Who manages itData engineersData platform/security teamsData governance teams, data owners
Access modelIAM role-based (no workflow)Column/row-level permissions via grantsRequest → Approval → Automatic provisioning workflow
Data discoveryNone (technical catalog only)NoneFull-text search across all registered assets with business context
Replaces the other?NoNoNo — uses both as underlying infrastructure

The important takeaway: DataZone sits above Glue Data Catalog and Lake Formation in the stack. When a DataZone subscription is approved, DataZone calls Lake Formation APIs automatically to provision the actual permission grant. The Glue Data Catalog remains the authoritative technical metadata store; DataZone imports from it and adds business metadata on top.

Here is what the layered stack looks like in practice:

┌─────────────────────────────────────────┐
│          Amazon DataZone                │
│   (business catalog, governance,        │
│    subscription workflow, lineage)      │
├─────────────────────────────────────────┤
│          AWS Lake Formation             │
│   (column/row-level access control,     │
│    permission grants to Glue tables)    │
├─────────────────────────────────────────┤
│       AWS Glue Data Catalog             │
│   (table schemas, S3 locations,         │
│    partitions, Glue job definitions)    │
├─────────────────────────────────────────┤
│   Storage + Compute                     │
│   (S3, Glue ETL, Athena, Redshift,      │
│    RDS, EMR)                            │
└─────────────────────────────────────────┘

Domain, Project, and Environment Model

DataZone’s organizational hierarchy has four levels. Understanding these before setting up your first domain saves significant rework later.

Domain is the top-level organizational boundary — typically mapped to a company, a major business unit, or a regulatory boundary that requires separate data governance. All assets, projects, and business glossary terms live within a domain. Domains are isolated from each other; there is no cross-domain asset discovery by default.

Project is a team or initiative workspace within a domain. Projects are the unit of data ownership and access control. A Data Engineering team might have one project for raw data ingestion pipelines; a Finance Analytics team has a separate project to consume that data. Asset access is granted between projects, not between individual users.

Environment maps a project to a specific AWS account and region. A project can have multiple environments (e.g., a dev environment in account 123456789012 us-east-1, and a prod environment in account 987654321098 us-east-1). Environments are where the actual data assets live — S3 buckets, Glue tables, Redshift schemas.

Asset is any registered data resource: an S3-backed Glue table, a Redshift schema, an RDS table, or a SageMaker ML model artifact. Assets have both technical metadata (imported from Glue Data Catalog or Redshift) and business metadata (descriptions, glossary terms, owners, quality scores) you add in DataZone.

Here is a realistic enterprise domain setup for a financial services company:

FinServCo Domain

├── Data Engineering Project
│   ├── Production Environment (AWS Account: data-platform-prod)
│   │   ├── raw_transactions (Glue table asset)
│   │   ├── enriched_customer_360 (Glue table asset)
│   │   └── settlements_daily (Redshift schema asset)
│   └── Dev Environment (AWS Account: data-platform-dev)

├── Risk Analytics Project
│   └── Production Environment (AWS Account: risk-analytics-prod)
│       └── (consumes enriched_customer_360 via subscription)

└── ML Fraud Team Project
    └── Production Environment (AWS Account: ml-platform-prod)
        └── (consumes raw_transactions via subscription for model training)

Each project gets an IAM execution role that DataZone uses to provision and manage access. The critical design decision is whether to map projects to AWS accounts (strong isolation, more overhead) or to use account-level environments within a single account (simpler, but relies on IAM boundaries). For regulated industries, one AWS account per project environment is strongly recommended.

Data Subscription Workflow

The subscription workflow is where DataZone delivers its most concrete value. Before DataZone, data access in most AWS environments involved a Jira ticket, a Slack message to the data platform team, a manual Lake Formation grant, and zero audit trail. DataZone replaces this with a structured, auditable workflow.

Here is the end-to-end flow:

Step 1: Publish an asset

A Data Engineering project member registers a Glue table as a DataZone asset:

import boto3

datazone = boto3.client('datazone', region_name='us-east-1')

# Create an asset in DataZone from a Glue table
response = datazone.create_asset(
    domainIdentifier='dzd_abc123',
    owningProjectIdentifier='proj_dataeng_001',
    name='enriched_customer_360',
    typeIdentifier='amazon.datazone.GlueTableAssetType',
    typeRevision='1',
    description='Customer 360 view combining CRM, transaction, and behavioral data. Updated daily at 02:00 UTC.',
    formsInput=[
        {
            'formName': 'GlueTableForm',
            'typeIdentifier': 'amazon.datazone.GlueTableForm',
            'content': '{"databaseName": "analytics_prod", "tableName": "enriched_customer_360"}'
        }
    ]
)

asset_id = response['id']

# Publish the asset to make it discoverable
datazone.create_asset_revision(
    domainIdentifier='dzd_abc123',
    identifier=asset_id,
    name='enriched_customer_360',
    description='Customer 360 view combining CRM, transaction, and behavioral data. Updated daily at 02:00 UTC.',
    formsInput=[...]
)

Step 2: Business enrichment

Before publishing, data owners add business context in the DataZone console: glossary term assignments (e.g., “Customer”, “Transaction”), data quality score thresholds, owner contacts, and update frequency. This is what makes the asset useful for discovery.

Step 3: Consumer requests access

A Risk Analytics team member finds the asset via DataZone’s search interface, reviews the description and quality scores, and submits a subscription request with a business justification. The request appears in the owning project’s approval queue.

Step 4: Producer approves

The Data Engineering project owner reviews the request and approves it. At approval time, DataZone automatically calls the Lake Formation GrantPermissions API on behalf of the Risk Analytics project’s IAM execution role:

# DataZone does this automatically on approval — shown for illustration
lakeformation = boto3.client('lakeformation', region_name='us-east-1')

lakeformation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::RISK_ACCOUNT:role/DataZone-RiskAnalytics-ExecutionRole'},
    Resource={
        'Table': {
            'DatabaseName': 'analytics_prod',
            'Name': 'enriched_customer_360'
        }
    },
    Permissions=['SELECT'],
    PermissionsWithGrantOption=[]
)

Step 5: Consumer queries the data

The Risk Analytics project’s IAM role can now query the table via Athena, Glue, or EMR — whichever is configured in their DataZone environment. No additional manual steps required.

Audit trail: every subscription request, approval, rejection, and revocation is logged in DataZone’s subscription history. This is what compliance teams need for data access audits.

To revoke access, a producer revokes the subscription in DataZone, which calls RevokePermissions in Lake Formation automatically.

Integration with SageMaker Unified Studio

The relationship between DataZone and SageMaker Unified Studio is tighter than most teams realize: they share the same domain model. A SageMaker Unified Studio domain is a DataZone domain — the same entity, accessible from both consoles.

This means that when your ML team creates a project in SageMaker Unified Studio, that project appears in DataZone as a project with the same governance capabilities. ML engineers can discover and subscribe to governed data assets directly from within the SageMaker Unified Studio IDE without switching to a separate governance console.

The workflow looks like this in practice:

  1. A data engineer publishes a training dataset (S3/Glue table) as a DataZone asset in the “Data Engineering” project
  2. An ML engineer opens SageMaker Unified Studio, opens the Data Catalog panel in the IDE
  3. The ML engineer searches for “customer churn” and finds the registered asset
  4. If the ML project does not have an active subscription, the IDE shows a “Request Access” button
  5. The ML engineer submits a subscription request with justification (“Needed for Q2 churn prediction model training”)
  6. The data owner approves in the DataZone console (or SageMaker Unified Studio — same UI)
  7. The dataset is immediately available in the ML engineer’s notebook environment via Athena or S3 direct access

To configure this integration, the SageMaker domain and DataZone domain must be in the same AWS account and region, or connected via DataZone’s cross-account feature. The SageMaker execution role must have datazone:* permissions scoped to the domain ARN.

# Verify the SageMaker domain is linked to DataZone
sagemaker = boto3.client('sagemaker')

domain = sagemaker.describe_domain(DomainId='d-abc123')
print(domain['DefaultUserSettings']['CustomPosixUserConfig'])
# DataZone domain ARN appears in domain metadata when linked

The practical benefit: ML teams no longer maintain their own shadow copies of datasets in S3 buckets that the governance team does not know about. DataZone subscriptions make the access visible, auditable, and revocable.

Data Quality and Lineage

Data quality integration: DataZone integrates with AWS Glue Data Quality (powered by DeeQu) to surface quality scores on assets. When you configure Glue Data Quality rules on a table and publish that table as a DataZone asset, the latest quality score and rule pass/fail status appear in the asset’s DataZone catalog entry.

This is significant for consumers making access decisions: they can see that enriched_customer_360 has a 94% quality score (completeness, uniqueness, consistency) before requesting access, rather than discovering data problems after they have built a pipeline on top of it.

Lineage tracking: DataZone tracks lineage for AWS Glue ETL jobs that are registered in the same domain. If a Glue job reads from raw_transactions and writes to enriched_customer_360, DataZone surfaces this dependency chain in the asset’s lineage view. This is valuable for impact analysis (“if I change the schema of raw_transactions, which downstream assets are affected?”) and for audit requirements in regulated industries.

Lineage is automatically captured for Glue ETL jobs — no code changes required. For Spark jobs on EMR or custom Python transformations, you can emit lineage events via the DataZone API’s OpenLineage-compatible endpoint.

Implementation Roadmap

The teams that get the most value from DataZone treat it as a phased rollout rather than a big-bang deployment. The governance workflow is only useful once data producers are publishing assets and consumers are using it for access requests.

Phase 1: Foundation (Weeks 1–3)

  • Create the DataZone domain and map it to your organizational structure
  • Define the project hierarchy: identify your main data producer teams and consumer teams
  • Connect existing Glue Data Catalog: link your production AWS accounts as DataZone environments
  • Import existing Glue databases and tables as DataZone assets (technical metadata imports automatically)
  • Configure IAM execution roles for each project environment

Phase 2: Business Enrichment (Weeks 4–8)

  • Work with data owners to add business descriptions to the top 20 most-used datasets
  • Build the business glossary: define your organization’s key data terms (what does “customer” mean in your context? what counts as a “completed transaction”?)
  • Assign glossary terms to assets
  • Identify data owners for each published asset — this is often the hardest organizational step

Phase 3: Subscription Workflow Adoption (Weeks 9–14)

  • Pilot the subscription workflow with one high-traffic dataset (e.g., the analytics team’s most-requested table)
  • Migrate ad-hoc Slack/Jira data access requests to DataZone subscription requests for that dataset
  • Train both producers (approving requests, writing asset descriptions) and consumers (searching catalog, submitting requests)
  • Measure time-to-access before and after: this is your primary success metric
  • Expand to all production datasets once the pilot is stable

Phase 4: Quality and Lineage (Weeks 15–20)

  • Add Glue Data Quality rules to critical datasets and surface scores in DataZone
  • Configure lineage tracking for key Glue ETL pipelines
  • Set up EventBridge alerts for quality score degradation on governed assets
  • Review subscription history reports with your compliance team to validate the audit trail meets regulatory requirements

The most common failure mode in DataZone rollouts is skipping Phase 2 — importing technical metadata without adding business context. A catalog full of Glue tables with no descriptions is not more useful than just searching the Glue console directly. Business enrichment is what makes DataZone worth adopting.

Connecting DataZone to Your Existing AWS Data Lake

If you have an existing data lake built on S3, Glue, and Athena, DataZone slots in above it without requiring any changes to your storage layout or ETL pipelines. The connection is through the DataZone environment, which maps to your existing Glue Data Catalog.

The one infrastructure change you will need: ensure Lake Formation is enabled on the Glue Data Catalog in your data lake accounts. DataZone uses Lake Formation for permission provisioning, so if your data lake is still using IAM-only S3 bucket policies for access control (without Lake Formation), you will need to migrate before DataZone’s automated subscription provisioning works.

For IAM least-privilege design within the DataZone context, each project’s execution role should be scoped to only the S3 paths and Glue databases relevant to that project’s data, with Lake Formation managing the table-level grants on top.


Amazon DataZone addresses a genuine gap in the AWS data platform story. Glue Data Catalog and Lake Formation are excellent at their specific jobs, but they leave the governance workflow — the request, approval, business context, and audit trail layers — entirely to you. For small teams, that gap is manageable. For enterprises with dozens of data teams and hundreds of datasets, it creates the shadow data access patterns and undocumented dependencies that make regulated compliance audits painful.

DataZone is the governance layer worth building on. The subscription workflow alone replaces a class of manual coordination overhead that most data engineering teams have simply accepted as a cost of doing business.

Need help setting up Amazon DataZone for your AWS data platform? FactualMinds helps enterprise teams design DataZone domain hierarchies, migrate existing Glue catalogs, and build the organizational processes that make data governance stick.

PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Ready to discuss your AWS strategy?

Our certified architects can help you implement these solutions.

Recommended Reading

Explore All Articles »