Name: Dask — Parallel & Distributed Computing
Author: jaechang-hits

スキルを検索.../

Dask — Parallel & Distributed Computing | Skills Pool

pip install dask[complete]        # All components
pip install dask[dataframe]       # DataFrames only
pip install dask[distributed]     # Distributed scheduler + dashboard
pip install dask-jobqueue          # HPC cluster integration (SLURM, PBS)

import dask.dataframe as dd

# Read multiple files as a single DataFrame
ddf = dd.read_csv('data/2024-*.csv')
ddf = dd.read_parquet('data/', columns=['id', 'value', 'category'])

# Operations are lazy until .compute()
filtered = ddf[ddf['value'] > 100]
result = filtered.groupby('category').agg({'value': ['mean', 'sum']}).compute()
print(result.shape)  # (n_categories, 2)

# Custom operations via map_partitions (preferred over apply)
def normalize_partition(df):
    df['norm_value'] = (df['value'] - df['value'].mean()) / df['value'].std()
    return df

ddf = ddf.map_partitions(normalize_partition)

# Joins
ddf_merged = ddf.merge(lookup_ddf, on='category', how='left')

# Write results
ddf.to_parquet('output/', engine='pyarrow')

# Repartitioning for optimal chunk sizes
ddf = ddf.repartition(npartitions=20)         # By count
ddf = ddf.repartition(partition_size='100MB')  # By size

# Index management for sorted operations
ddf = ddf.set_index('timestamp', sorted=True)

# Debugging
print(f"Partitions: {ddf.npartitions}")
print(f"Dtypes: {ddf.dtypes}")
sample = ddf.get_partition(0).compute()  # Inspect first partition

import dask.array as da
import numpy as np

# Create from various sources
x = da.random.random((100000, 1000), chunks=(10000, 1000))
x = da.from_array(np_array, chunks=(10000, 1000))
x = da.from_zarr('large_dataset.zarr')

# Standard operations (lazy)
y = (x - x.mean(axis=0)) / x.std(axis=0)  # Normalize
z = da.dot(x.T, x)                          # Matrix multiply
u, s, v = da.linalg.svd(x)                  # SVD

# Compute and persist
result = y.mean(axis=0).compute()
print(result.shape)  # (1000,)

# Custom operations with map_blocks
def custom_filter(block):
    from scipy.ndimage import gaussian_filter
    return gaussian_filter(block, sigma=2)

filtered = da.map_blocks(custom_filter, x, dtype=x.dtype)

# Rechunking for different access patterns
x_rechunked = x.rechunk({0: 5000, 1: 500})

# Save to disk
da.to_zarr(y, 'normalized.zarr')

import dask.bag as db
import json

# Read unstructured data
bag = db.read_text('logs/*.json').map(json.loads)

# Functional operations
valid = bag.filter(lambda x: x['status'] == 'success')
ids = valid.pluck('user_id')
flat = bag.map(lambda x: x['tags']).flatten()

# Aggregation — use foldby instead of groupby (much faster)
counts = bag.foldby(
    key='category',
    binop=lambda total, x: total + x['amount'],
    initial=0,
    combine=lambda a, b: a + b,
    combine_initial=0
).compute()

# Convert to DataFrame for structured analysis
ddf = valid.to_dataframe(meta={'user_id': 'str', 'amount': 'float64', 'category': 'str'})

from dask.distributed import Client

client = Client()  # Local cluster with all cores
print(client.dashboard_link)  # http://localhost:8787

# Submit individual tasks (executes immediately, not lazy)
def process(x, param):
    return x ** param

future = client.submit(process, 42, param=2)
print(future.result())  # 1764

# Map over many inputs
futures = client.map(process, range(100), param=2)
results = client.gather(futures)
print(len(results))  # 100

# Scatter large data to workers (avoids repeated transfers)
import numpy as np
big_data = np.random.random((10000, 1000))
data_future = client.scatter(big_data, broadcast=True)

# Submit tasks using scattered data
futures = [client.submit(process_chunk, data_future, i) for i in range(10)]
results = client.gather(futures)

# Progressive result processing
from dask.distributed import as_completed
for future in as_completed(futures):
    result = future.result()
    print(f"Completed: {result}")

# Coordination primitives
from dask.distributed import Lock, Queue, Event
lock = Lock('resource-lock')
with lock:
    # Thread-safe operation across workers
    pass

client.close()

import dask

# Global scheduler setting
dask.config.set(scheduler='threads')       # Default: GIL-releasing numeric work
dask.config.set(scheduler='processes')     # Pure Python, GIL-bound work
dask.config.set(scheduler='synchronous')   # Debugging with pdb

# Context manager for temporary change
with dask.config.set(scheduler='synchronous'):
    result = computation.compute()  # Can use pdb here

# Per-compute override
result = ddf.mean().compute(scheduler='processes')

# Distributed scheduler with resource control
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=2, memory_limit='4GB')
print(client.dashboard_link)

# HPC cluster integration
from dask_jobqueue import SLURMCluster
from dask.distributed import Client

cluster = SLURMCluster(
    cores=24, memory='100GB',
    walltime='02:00:00', queue='regular'
)
cluster.scale(jobs=10)  # Request 10 SLURM jobs
client = Client(cluster)

# Adaptive scaling
cluster.adapt(minimum=2, maximum=20)

result = computation.compute()
client.close()

Data Type	Component	When to Use
Tabular (CSV, Parquet)	DataFrames	Standard pandas-like operations at scale
Numeric arrays (HDF5, Zarr)	Arrays	NumPy operations, linear algebra, image processing
Text, JSON, logs	Bags	ETL/cleaning → convert to DataFrame for analysis
Custom parallel tasks	Futures	Dynamic workflows, parameter sweeps, task dependencies
Any of above	Schedulers	Control execution backend (threads/processes/distributed)

Scheduler	Overhead	Best For	GIL
`threads` (default)	~10 µs/task	NumPy, pandas, scikit-learn	Affected
`processes`	~10 ms/task	Pure Python, text processing	Not affected
`synchronous`	~1 µs/task	Debugging with pdb	N/A
`distributed`	~1 ms/task	Dashboard, clusters, advanced features	Configurable

import dask.dataframe as dd
import dask

# Extract: Read all CSV files
ddf = dd.read_csv('raw_data/*.csv', dtype={'amount': 'float64'})

# Transform: Clean and process
ddf = ddf[ddf['status'] == 'valid']
ddf['amount'] = ddf['amount'].fillna(0)
ddf = ddf.dropna(subset=['category'])

# Aggregate
summary = ddf.groupby('category').agg({'amount': ['sum', 'mean', 'count']})

# Load: Save as Parquet (columnar, compressed)
summary.to_parquet('output/summary.parquet')
print(f"Processed {len(ddf)} rows across {ddf.npartitions} partitions")

import dask.array as da

# Load large scientific dataset
x = da.from_zarr('experiment_data.zarr')  # e.g., (50000, 50000) float64
print(f"Shape: {x.shape}, Chunks: {x.chunks}")

# Normalize per-column
x_norm = (x - x.mean(axis=0)) / x.std(axis=0)

# Compute covariance matrix
cov = da.dot(x_norm.T, x_norm) / (x_norm.shape[0] - 1)

# SVD for dimensionality reduction (top-k)
u, s, v = da.linalg.svd_compressed(x_norm, k=50)

# Save results
da.to_zarr(u, 'pca_components.zarr')
print(f"Explained variance (top 5): {(s[:5]**2 / (s**2).sum()).compute()}")

Parameter	Module	Default	Description
`npartitions`	DataFrame	auto	Number of partitions (controls parallelism)
`partition_size`	DataFrame	—	Target size per partition (e.g., '100MB')
`chunks`	Array	required	Chunk dimensions (e.g., `(10000, 1000)`)
`blocksize`	Bag	'128 MiB'	File read block size
`scheduler`	All	'threads'	Execution backend ('threads', 'processes', 'synchronous')
`n_workers`	Distributed	auto	Number of worker processes
`threads_per_worker`	Distributed	auto	Threads per worker
`memory_limit`	Distributed	auto	Per-worker memory limit (e.g., '4GB')
`sorted`	`set_index`	False	Whether data is pre-sorted (enables optimizations)
`meta`	`map_partitions`	—	Output DataFrame/Series structure template

from dask_ml.preprocessing import StandardScaler
from dask_ml.model_selection import train_test_split
import dask.array as da

X = da.random.random((100000, 50), chunks=(10000, 50))
y = da.random.randint(0, 2, size=100000, chunks=10000)

# Preprocessing
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

from dask.distributed import Client

client = Client()

class RunningStats:
    def __init__(self):
        self.count = 0
        self.total = 0.0

    def add(self, value):
        self.count += 1
        self.total += value
        return self.total / self.count

# Create actor on a worker
stats = client.submit(RunningStats, actor=True).result()

# Call methods (~1ms roundtrip)
for v in [10, 20, 30]:
    mean = stats.add(v).result()
    print(f"Running mean: {mean}")

client.close()

from dask_kubernetes import KubeCluster
from dask.distributed import Client

cluster = KubeCluster()
cluster.adapt(minimum=2, maximum=50)
client = Client(cluster)

# Run computation on auto-scaling cluster
result = large_computation.compute()
client.close()

Problem	Cause	Solution
`MemoryError` during `.compute()`	Result too large for local memory	Use `.to_parquet()` or `.persist()` instead of `.compute()`
Slow computation start	Task graph has millions of tasks	Increase chunk sizes; use `map_partitions`/`map_blocks`
Poor parallelization	GIL contention with threads scheduler	Switch to `scheduler='processes'` for Python-heavy code
`TypeError` in `map_partitions`	Missing or wrong `meta` parameter	Provide `meta=pd.DataFrame({'col': pd.Series(dtype='float64')})`
Workers killed (OOM)	Chunks exceed worker memory	Decrease chunk size; increase `memory_limit`
`KilledWorker` exception	Worker process crashed	Check worker logs; reduce memory per task; increase `memory_limit`
Slow joins/merges	Data not pre-sorted on join key	Call `ddf.set_index('key', sorted=True)` before join
`NotImplementedError`	Operation not supported by Dask	Use `map_partitions` with pandas equivalent
Dashboard not accessible	Distributed client not started	Use `client = Client()` to enable distributed scheduler
Data type mismatch across partitions	Inconsistent CSV files	Specify `dtype` explicitly in `dd.read_csv()`

Chunk Size	Effect
Too large (>1 GB)	Memory overflow, poor parallelization
Optimal (~100 MB)	Good parallelism, manageable memory
Too small (<1 MB)	Excessive scheduling overhead

Dask — Parallel & Distributed Computing

Overview

When to Use

Prerequisites

Dask — Parallel & Distributed Computing

Overview

When to Use

Prerequisites

Core API

1. DataFrames — Parallel Pandas

2. Arrays — Parallel NumPy

3. Bags — Unstructured Data Processing

4. Futures — Task-Based Parallelism

5. Schedulers & Configuration

Key Concepts

Component Selection Guide

Lazy Evaluation Model

Chunk Size Strategy

Scheduler Selection Guide

Common Workflows

Workflow 1: Multi-File ETL Pipeline

Workflow 2: Large-Scale Array Processing

Workflow 3: Unstructured Data to Analysis

Key Parameters

Best Practices

Common Recipes

Recipe: Dask-ML Integration

Recipe: Actors for Stateful Computation

Recipe: Kubernetes Cluster

Troubleshooting

Bundled Resources

References

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns

Dask — Parallel & Distributed Computing

Overview

When to Use

Prerequisites

Dask — Parallel & Distributed Computing

Overview

When to Use

Prerequisites

Core API

1. DataFrames — Parallel Pandas

2. Arrays — Parallel NumPy

3. Bags — Unstructured Data Processing

4. Futures — Task-Based Parallelism

5. Schedulers & Configuration

Key Concepts

Component Selection Guide

Lazy Evaluation Model

Chunk Size Strategy

Scheduler Selection Guide

Common Workflows

Workflow 1: Multi-File ETL Pipeline

Workflow 2: Large-Scale Array Processing

Workflow 3: Unstructured Data to Analysis

Key Parameters

Best Practices

Common Recipes

Recipe: Dask-ML Integration

Recipe: Actors for Stateful Computation

Recipe: Kubernetes Cluster

Troubleshooting

Bundled Resources

Related Skills

References

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns