Data catalogs for lakehouse architectures: Iceberg catalogs (Hive Metastore, AWS Glue, REST/Tabular), using DuckDB as a lightweight multi-source catalog, and comparisons of open-source metadata tools (Amundsen, DataHub, OpenMetadata).
Guide to selecting, configuring, and using data catalog systems for data discovery, governance, and unified table access across multiple query engines.
Use this skill when:
Do not use this skill for:
@designing-data-storage@accessing-cloud-storage@accessing-cloud-storage@building-data-pipelines| Catalog | Backend | Managed? | Best For |
|---|---|---|---|
| Hive Metastore | RDBMS (Postgres/MySQL) | Self-hosted | Existing Hadoop, high partition counts |
| AWS Glue | AWS-managed serverless | AWS-managed | AWS-native stacks (Athena, EMR) |
| Tabular/REST | SaaS (Nessie-backed) | Vendor-managed | Iceberg-native, Git-like branching |
| DuckDB (embedded) | Local file/Postgres | Self-hosted | Single-user, PoC, small teams |
| Scenario | Recommended Catalog | Why |
|---|---|---|
| AWS-native (Athena, Redshift Spectrum) | AWS Glue | Serverless, IAM integration |
| Self-hosted Hadoop/Spark | Hive Metastore | Battle-tested, no vendor lock-in |
| Iceberg-first, multi-cloud | Tabular or Hive | Native Iceberg features or flexibility |
| Small team, PoC, analytics | DuckDB | Zero infrastructure, SQL-native |
| LinkedIn-scale metadata | DataHub | Enterprise lineage, scale |
| Governance-heavy workflows | OpenMetadata | Built-in workflows, data quality |
Catalog (name) → Table identifier → Metadata location → Data files
(schema, snapshots, partitions)
Key insight: The catalog stores only metadata pointers. Actual data lives in object storage (S3, GCS, Azure).
# Same table, different engines
table = catalog.load_table("db.events")
# PyIceberg → Pandas DataFrame
df = table.scan().to_pandas()
# Spark SQL (via same catalog)
spark.table("db.events") # Same underlying data
# DuckDB (via ATTACH or REST catalog)
SELECT * FROM iceberg.db.events
| Guide | Covers | When to Read |
|---|---|---|
| Hive Metastore | Docker deployment, PyIceberg integration, pros/cons | Self-hosting Hadoop/Spark |
| AWS Glue Catalog | GlueCatalog setup, crawlers, IAM, Unity Catalog federation | AWS-native stacks |
| REST Catalog & Tabular | Tabular SaaS, Nessie patterns, Git-like branching | Iceberg-first, multi-cloud |
| DuckDB Multi-Source | ATTACH patterns, unified views, limitations | Single-user/PoC catalog |
| Open Source Tools | Amundsen vs DataHub vs OpenMetadata comparison | Metadata discovery/governance |
from pyiceberg.catalog import load_catalog
# Hive Metastore
catalog = load_catalog("hive", **{
"type": "hive",
"uri": "thrift://localhost:9083",
"warehouse": "s3://bucket/warehouse/"
})
# AWS Glue
catalog = load_catalog("glue", **{
"type": "glue",
"region": "us-east-1",
"warehouse": "s3://bucket/warehouse/"
})
# REST/Tabular
catalog = load_catalog("rest", **{
"type": "rest",
"uri": "https://api.tabular.io/ws/...",
"token": "tabular-token-...",
"warehouse": "s3://bucket/warehouse/"
})
# Create and query table
table = catalog.create_table("db.events", schema=schema)
table.append(data)
df = table.scan().to_pandas()
@designing-data-storage - Delta Lake, Iceberg table formats, file format selection@accessing-cloud-storage - fsspec, pyarrow.fs, obstore for storage access@building-data-pipelines - ETL patterns using catalog-registered tables