Guidance for implementing tensor parallelism in PyTorch, including ColumnParallelLinear and RowParallelLinear layers. This skill should be used when implementing distributed tensor parallel operations, sharding linear layers across multiple GPUs, or simulating collective operations like all-gather and all-reduce for parallel computation.
This skill provides guidance for implementing tensor parallelism patterns in PyTorch, specifically for ColumnParallelLinear and RowParallelLinear layers that distribute computation across multiple devices.
Tensor parallelism splits individual layers across multiple devices to parallelize computation within a single forward/backward pass. The two primary patterns are:
ColumnParallelLinear: Shards weights along the output dimension (columns). Each device computes a portion of the output features, then results are concatenated via all-gather.
RowParallelLinear: Shards weights along the input dimension (rows). Each device computes partial outputs using its shard of the input, then results are summed via all-reduce.
When implementing tensor parallelism (especially in simulation or testing contexts), the forward pass must actually perform the collective operations, not just compute local shards:
A common mistake is returning only the local shard and expecting an external framework to handle collective operations. Unless explicitly specified otherwise, the implementation should produce the final, complete output.
Before implementing, clearly identify:
For weight matrix W of shape (out_features, in_features):
ColumnParallelLinear:
RowParallelLinear:
ColumnParallelLinear Forward:
1. Compute local output: y_local = x @ W_shard.T + bias_shard (if bias per shard)
2. All-gather to concatenate: y = concat([y_0, y_1, ..., y_n], dim=-1)
3. Return complete output of shape (batch, out_features)
RowParallelLinear Forward:
1. Get input shard: x_shard = x[..., start:end] for this rank
2. Compute partial output: y_partial = x_shard @ W_shard.T
3. All-reduce to sum: y = sum([y_0, y_1, ..., y_n])
4. Add bias (only once, not per-rank): y = y + bias
5. Return complete output of shape (batch, out_features)
ColumnParallelLinear:
RowParallelLinear:
When local testing is unavailable, verify implementation correctness through mathematical analysis:
Symptom: Output tensor size is (out_features / world_size) instead of (out_features)
Cause: Implementation computes local shard but doesn't perform all-gather
Fix: Implement the collective operation to combine results from all ranks
Symptom: Output values are N times larger than expected (where N is world_size)
Cause: Each rank adds the full bias, then values are summed
Fix: Add bias only once after all-reduce, not per-rank
Symptom: Implementation works for world_size=1 but fails for larger world sizes
Cause: Assuming external framework handles collective operations
Fix: Read requirements carefully - "as if using all_gather" means implement the operation
Symptom: Implementation has syntax errors or missing code
Cause: File write operation was truncated
Fix: Always read back the complete file after writing to verify integrity
Symptom: Shape mismatch errors during matrix multiplication
Cause: Sharding along wrong dimension (rows vs columns confusion)
Fix: ColumnParallel shards output features (dim=0 of weight), RowParallel shards input features (dim=1 of weight)
Before writing code, confirm understanding of:
After writing code: