Techniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes.
Stable docs: docs/parallelisms.md
Card: card.yaml (co-located)
GPU OOM failures during training often stem from memory fragmentation rather than raw capacity. PyTorch's default CUDA allocator can leave unusable gaps between allocations. The single most effective fix is:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
This tells PyTorch to use expandable (non-fixed-size) memory segments, which dramatically reduces fragmentation and often eliminates borderline OOM without any model or parallelism changes.
Beyond fragmentation, actual peak memory is determined by:
When a training run OOMs or is close to the memory limit:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True first. This fixes
fragmentation-induced OOM with zero performance cost. Most Slurm launch
templates already include it.recompute_modules=[core_attn]) if
not already enabled. See skills/perf-techniques/activation-recompute/SKILL.md.mlp recompute if still OOM. Saves ~3 GB but costs ~16% GPU
utilization on large dense models (Llama3 70B).Set in the job's environment before launching:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
In Slurm scripts this is typically placed alongside other env vars:
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
No model config changes needed. Zero throughput cost.
If the model genuinely does not fit (not fragmentation), adjust parallelism:
| Strategy | Memory effect | Throughput cost | Notes |
|---|---|---|---|
| Increase PP (keeping DP) | Fewer layers per stage | Moderate (~6% if DP halved) | Only if GPU count allows |
| Increase TP | Fewer params per GPU | Severe (-28% on 70B) | Last resort |
| Distributed optimizer | Shards optimizer state across DP ranks | ~1-2% | Recommended for large models |
| FSDP | Shards params + grads + optimizer | Varies | See skills/perf-techniques/megatron-fsdp/ |
See skills/perf-techniques/activation-recompute/SKILL.md for full details.
cfg.model.cpu_offloading = True
Incompatible with PP > 1. Only usable when pipeline_model_parallel_size = 1.
Virtual pipeline parallelism (VPP) is primarily a throughput optimization that reduces pipeline bubble overhead by interleaving smaller model chunks. Its effect on peak memory is minimal — changing VPP does not meaningfully change the total activation, parameter, or optimizer memory on a GPU.
In earlier experiments we incorrectly attributed an OOM fix to VPP tuning
(VPP 5→10). The actual fix was PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
which eliminated memory fragmentation. The VPP=10 run actually used slightly
more peak memory (60.2 GB vs 58.8 GB) but did not OOM because expandable
segments prevented fragmentation.
VPP should be tuned for pipeline bubble reduction (see docs/parallelisms.md),
not as a memory fix.
expandable_segments:True is incompatible with --use-nccl-ub (NCCL
user-buffer registration). See Megatron-FSDP docs.expandable_segments:True, set
NCCL_GRAPH_REGISTER=0 (required on pre-Blackwell GPUs, enforced by MCore
CudaGraphManager).pipeline_model_parallel_size = 1.use_distributed_optimizer = True in the
optimizer config.Llama3 70B SFT on 32x H100 80GB, FP8 (Current Scaling):
| Experiment | TP | PP | VPP | DP | TFLOP/s/GPU | vs Golden | Peak Mem (GB) | Result |
|---|---|---|---|---|---|---|---|---|
| Baseline | 4 | 4 | 5 | 2 | ~704 | -0.8% | 58.8 | OOM (fragmentation) |
| More PP | 4 | 8 | 5 | 1 | 668.0 | -5.9% | 53.2 | Borderline perf |
| More TP | 8 | 4 | 5 | 1 | 508.7 | -28.4% | 50.2 | Severe regression |
| Baseline + expandable_segments | 4 | 4 | 5 | 2 | ~704 | -0.8% | ~59 | Passed |
Key takeaways:
expandable_segments:True is the winner. The baseline OOM was caused by
memory fragmentation, not insufficient capacity. Setting this env var
eliminated the OOM with zero throughput cost and no parallelism changes.| Experiment | offload_layers | Result |
|---|---|---|
| Exp 4 | 2 | Incompatible (PP > 1) |
| Exp 5 | 4 | Incompatible (PP > 1) |
| Exp 6 | 6 | Incompatible (PP > 1) |
ValueError: Currently there is no support for Pipeline parallelism with CPU offloading. This approach is blocked for any model using PP > 1.
Selective activation recompute with mlp saved ~3 GB peak memory but cost
~16% GPU utilization on this workload. See
skills/perf-techniques/activation-recompute/SKILL.md for full results.
if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)
if pipeline_parallel_size and self.virtual_pipeline_model_parallel_size is not None:
num_layers_per_middle_pipeline_rank = num_layers // pipeline_parallel_size
if (
not num_layers_per_middle_pipeline_rank
% self.virtual_pipeline_model_parallel_size
== 0
):
raise ValueError(
f"number of layers on each middle pipeline rank:"
f"{num_layers_per_middle_pipeline_rank} must be divisible by virtual"
f"pipeline parallel degree {self.virtual_pipeline_model_parallel_size}"
)
To minimize the pipeline bubble, the computation on each GPU can be divided into multiple subsets of layers (referred to as model chunks), rather than a single contiguous block. Enable this by setting `virtual_pipeline_model_parallel_size`:
model_config = GPTModelProvider(
pipeline_model_parallel_size=4,
virtual_pipeline_model_parallel_size=2, # 2 model chunks per pipeline stage
# ... other model parameters
)
| Symptom | Cause | Confirm | Fix |
|---|---|---|---|
| OOM on a single rank despite headroom on others | Memory fragmentation | check if expandable_segments:True is set | set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
OOM with expandable_segments already set | Genuine capacity limit | check nvidia-smi for param/optimizer memory | increase PP, use distributed optimizer, or add recompute |
ValueError: PP + CPU offloading | using cpu_offloading with PP > 1 | check PP config | disable CPU offloading or set PP=1 |
RuntimeError with --use-nccl-ub + expandable segments | NCCL UB incompatible with expandable allocator | check env vars | remove expandable_segments:True or disable --use-nccl-ub |
expandable_segments:True is incompatible with NCCL user-buffer registrationQuick check that expandable_segments:True is active:
import os
assert "expandable_segments:True" in os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")
For Slurm jobs, verify the env var is exported before the training command in the launch script.
PyTorch深度学习模式与最佳实践,用于构建稳健、高效且可复现的训练流程、模型架构和数据加载。