File formats and lakehouse table formats for data lakes: Parquet, Arrow, Lance, Zarr, Avro, ORC, Delta Lake, Apache Iceberg, and Apache Hudi. Covers compression, partitioning, ACID transactions, schema evolution, and format selection.
Comprehensive guide to selecting and using modern data storage formats for analytics and machine learning. Covers six file formats (Parquet, Arrow, Lance, Zarr, Avro, ORC) and three lakehouse table formats (Delta Lake, Apache Iceberg, Apache Hudi).
| Format | Type | Best For | Compression | Schema Evolution | Random Access |
|---|---|---|---|---|---|
| Parquet | Columnar | Analytics, data lakes | ✅ (Snappy, Zstd, LZ4) | ✅ (add/drop) | ✅ (row groups) |
| Arrow/Feather | Columnar | In-memory, IPC, ML | ✅ (LZ4, Zstd) | Limited | ✅ (record batches) |
| Lance | Columnar | ML pipelines, vectors | ✅ (Zstd, LZ4) | ✅ | ✅ (multi-modal) |
| Zarr | Chunked arrays | ML, geospatial, N-dim | ✅ (Blosc, gzip) | ✅ (chunks) | ✅ (chunk-level) |
| Avro | Row-based | Streaming, Kafka | ✅ (deflate, snappy) | ✅ (full schema) | ❌ (sequential) |
| ORC | Columnar | Hive, Hadoop | ✅ (ZLIB, Snappy) | Limited | ✅ (stripe-level) |
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| ACID Transactions | ✅ | ✅ | ✅ |
| Time Travel | ✅ | ✅ | ✅ |
| Schema Evolution | ✅ | Advanced (branching) | ✅ |
| Primary Ecosystem | Spark/Databricks | Engine-agnostic | Spark (CDC focus) |
| Write Optimization | Copy-on-write | CoW, Merge-on-Read | CoW, Merge-on-Read |
| Python API | deltalake (pure), PySpark | pyiceberg (pure) | PySpark only |
| Best For | Spark ecosystems | Multi-engine analytics | Change data capture |
Choose Parquet when:
Choose Arrow/Feather when:
Choose Lance when:
Choose Zarr when:
Choose Avro when:
Choose ORC when:
Choose Delta Lake when:
deltalake libraryChoose Apache Iceberg when:
Choose Apache Hudi when:
| Use Case | Recommended Format | Reason |
|---|---|---|
| Data lake analytics | Parquet or Delta Lake | Mature, partitioning, ecosystem |
| ML training data | Arrow/Feather or Lance | Zero-copy, vector support |
| Geospatial arrays | Zarr | Chunked, N-dimensional, cloud-optimized |
| Streaming/Kafka | Avro | Schema evolution, row-based |
| Legacy Hive | ORC | Compatibility |
| Feature stores | Lance or Delta Lake | Versioning, vectors |
| IPC between processes | Arrow IPC or Feather | Zero-copy, fast |
| Multi-engine analytics | Apache Iceberg | Engine-agnostic, catalog flexibility |
| CDC pipelines | Apache Hudi | Built for upserts/streaming |
| Quick exports | Parquet (Zstd) | Good compression/decompression speed |
references/parquet.md - Deep dive on Parquet layout, writer settings, partitioning patternsreferences/format-selection-guide.md - Comprehensive decision matrix and selection criteriareferences/delta-lake.md - Pure-Python API, PySpark integration, time travel, cloud storagereferences/iceberg.md - Catalog configuration, schema evolution, partition evolutionreferences/hudi.md - CDC patterns, upserts, Copy-on-Write vs Merge-on-Readimport polars as pl
import pyarrow.parquet as pq
import pyarrow as pa
# Write with Polars
df = pl.DataFrame({"id": [1, 2, 3], "value": [100.0, 200.0, 150.0]})
df.write_parquet("data.parquet", compression="zstd")
# Write with PyArrow (more control)
table = pa.Table.from_pandas(df)
pq.write_table(
table,
"data.parquet",
compression="ZSTD",
compression_level=3,
row_group_size=100000,
use_dictionary=True
)
# Read with column pruning
df = pl.read_parquet("data.parquet", columns=["id", "value"])
# Dataset scanning with predicate pushdown
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet")
result = lazy_df.filter(pl.col("value") > 100).collect()
from deltalake import DeltaTable, write_deltalake
import pyarrow as pa
# Create Delta table
data = pa.table({
"id": [1, 2, 3],
"value": [100.0, 200.0, 150.0],
"timestamp": ["2024-01-01", "2024-01-02", "2024-01-03"]
})
write_deltalake("data/delta-table", data, mode="overwrite")
# Read with time travel
dt = DeltaTable("data/delta-table")
dt.load_version(1) # Load specific version
df = dt.to_pandas()
# Upsert (merge)
new_data = pa.table({"id": [2, 4], "value": [250.0, 400.0]})
dt.merge(
source=new_data,
predicate="target.id = source.id"
).when_matched_update_all().when_not_matched_insert_all().execute()
from pyiceberg.catalog import load_catalog
# Load catalog
catalog = load_catalog("glue", **{
"type": "glue",
"s3.region": "us-east-1",
})
# Create and write table
table = catalog.create_table("default.events", schema=schema)
table.append(parquet_data)
# Read with time travel
df = table.scan(
row_filter="population > 1000000",
as_of_timestamp="2024-01-01T00:00:00Z"
).to_pandas()
import lancedb
import polars as pl
# Create Lance dataset
db = lancedb.connect("./data.lance")
df = pl.DataFrame({
"id": [1, 2, 3],
"text": ["Hello", "World", "ML"],
"vector": [[0.1] * 128, [0.2] * 128, [0.3] * 128]
})
table = db.create_table("my_table", df)
# Vector search
results = table.search([0.1] * 128).limit(5).to_pandas()
import zarr
import numpy as np
# Create chunked, compressed array
z = zarr.open(
'data.zarr',
mode='w',
shape=(1000000, 1000),
chunks=(10000, 1000),
dtype='f4',
compressor=zarr.Blosc(cname='zstd', clevel=3)
)
# Write and read partial (only loads needed chunks)
z[:10000, :] = np.random.rand(10000, 1000).astype('f4')
slice_data = z[5000:6000, :]
| Codec | Compression Ratio | Speed | Best For |
|---|---|---|---|
| Snappy | Low (~2:1) | ⚡⚡⚡ Fast | Fast analytics, default |
| Zstd | Medium-High (~4:1) | ⚡⚡ Fast | General purpose |
| LZ4 | Low-Medium (~2.5:1) | ⚡⚡⚡ Very fast | Real-time streaming |
| Gzip | High (~5:1) | ⚡ Slow | Archival, cold storage |
| Blosc (zstd) | Medium | ⚡⚡ | Zarr arrays |
@accessing-cloud-storage - S3, GCS, Azure authentication and I/O patterns@building-data-pipelines - ETL patterns using these storage formats@engineering-ai-pipelines - Vector storage, Lance, embeddings workflows@managing-data-catalogs - Hive, Glue, REST catalogs for Iceberg