Name: Managing Data Catalogs
Author: legout

Managing Data Catalogs | Skills Pool

Catalog	Backend	Managed?	Best For
Hive Metastore	RDBMS (Postgres/MySQL)	Self-hosted	Existing Hadoop, high partition counts
AWS Glue	AWS-managed serverless	AWS-managed	AWS-native stacks (Athena, EMR)
Tabular/REST	SaaS (Nessie-backed)	Vendor-managed	Iceberg-native, Git-like branching
DuckDB (embedded)	Local file/Postgres	Self-hosted	Single-user, PoC, small teams

Scenario	Recommended Catalog	Why
AWS-native (Athena, Redshift Spectrum)	AWS Glue	Serverless, IAM integration
Self-hosted Hadoop/Spark	Hive Metastore	Battle-tested, no vendor lock-in
Iceberg-first, multi-cloud	Tabular or Hive	Native Iceberg features or flexibility
Small team, PoC, analytics	DuckDB	Zero infrastructure, SQL-native
LinkedIn-scale metadata	DataHub	Enterprise lineage, scale
Governance-heavy workflows	OpenMetadata	Built-in workflows, data quality

Catalog (name) → Table identifier → Metadata location → Data files
                                    (schema, snapshots, partitions)

# Same table, different engines
table = catalog.load_table("db.events")

# PyIceberg → Pandas DataFrame
df = table.scan().to_pandas()

# Spark SQL (via same catalog)
spark.table("db.events")  # Same underlying data

# DuckDB (via ATTACH or REST catalog)
SELECT * FROM iceberg.db.events

Guide	Covers	When to Read
Hive Metastore	Docker deployment, PyIceberg integration, pros/cons	Self-hosting Hadoop/Spark
AWS Glue Catalog	GlueCatalog setup, crawlers, IAM, Unity Catalog federation	AWS-native stacks
REST Catalog & Tabular	Tabular SaaS, Nessie patterns, Git-like branching	Iceberg-first, multi-cloud
DuckDB Multi-Source	ATTACH patterns, unified views, limitations	Single-user/PoC catalog
Open Source Tools	Amundsen vs DataHub vs OpenMetadata comparison	Metadata discovery/governance

from pyiceberg.catalog import load_catalog

# Hive Metastore
catalog = load_catalog("hive", **{
    "type": "hive",
    "uri": "thrift://localhost:9083",
    "warehouse": "s3://bucket/warehouse/"
})

# AWS Glue
catalog = load_catalog("glue", **{
    "type": "glue",
    "region": "us-east-1",
    "warehouse": "s3://bucket/warehouse/"
})

# REST/Tabular
catalog = load_catalog("rest", **{
    "type": "rest",
    "uri": "https://api.tabular.io/ws/...",
    "token": "tabular-token-...",
    "warehouse": "s3://bucket/warehouse/"
})

# Create and query table
table = catalog.create_table("db.events", schema=schema)
table.append(data)
df = table.scan().to_pandas()

Managing Data Catalogs

When to Use This Skill

Quick Catalog Type Comparison

Managing Data Catalogs

When to Use This Skill

Quick Catalog Type Comparison

When to Use Which Catalog

Core Patterns

Catalog → Table → Storage Mapping

Multi-Engine Access Pattern

Detailed Guides

Quick Reference: PyIceberg Catalog Setup

Best Practices

Catalog Selection

DuckDB-as-Catalog (Development Only)

Security

See Also

References

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns