Core Strategies

Strategy	API	Use Case	Pros/Cons
Shuffle	`repartition(N)`	Light data (e.g. file paths), Joins	Global balance. High memory usage (materializes data).
Streaming	`into_batches(N)`	Heavy data (images, tensors)	Low memory (streaming). High scheduling overhead if batches too small.

Quick Recipes

1. Light Data: Repartitioning

Best for distributing file paths before heavy reads.

# Create enough partitions to saturate workers
df = daft.read_parquet("s3://metadata").repartition(100)
df = df.with_column("data", read_heavy_data(df["path"]))

Daft Distributed Scaling

Core Strategies

Quick Recipes

1. Light Data: Repartitioning

Daft Distributed Scaling

Core Strategies

Quick Recipes

1. Light Data: Repartitioning

2. Heavy Data: Streaming Batches

Advanced Tuning

Formula 1: Repartitioning (Light Data / Paths)

Formula 2: Streaming (Heavy Data / Images)

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns