What It Is

GPU OOM failures during training often stem from memory fragmentation rather than raw capacity. PyTorch's default CUDA allocator can leave unusable gaps between allocations. The single most effective fix is:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

This tells PyTorch to use expandable (non-fixed-size) memory segments, which dramatically reduces fragmentation and often eliminates borderline OOM without any model or parallelism changes.

Beyond fragmentation, actual peak memory is determined by:

Parameter + optimizer state memory — controlled by TP, PP, DP sharding (distributed optimizer, FSDP)
Activation memory — controlled by activation recompute, sequence length, micro-batch size
Temporary / workspace memory — CUDA kernels, NCCL buffers, CUDA graphs

Quick Decision

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Strategy	Memory effect	Throughput cost	Notes
Increase PP (keeping DP)	Fewer layers per stage	Moderate (~6% if DP halved)	Only if GPU count allows
Increase TP	Fewer params per GPU	Severe (-28% on 70B)	Last resort
Distributed optimizer	Shards optimizer state across DP ranks	~1-2%	Recommended for large models
FSDP	Shards params + grads + optimizer	Varies	See `skills/perf-techniques/megatron-fsdp/`

cfg.model.cpu_offloading = True

Experiment	TP	PP	VPP	DP	TFLOP/s/GPU	vs Golden	Peak Mem (GB)	Result
Baseline	4	4	5	2	~704	-0.8%	58.8	OOM (fragmentation)
More PP	4	8	5	1	668.0	-5.9%	53.2	Borderline perf
More TP	8	4	5	1	508.7	-28.4%	50.2	Severe regression
Baseline + expandable_segments	4	4	5	2	~704	-0.8%	~59	Passed

        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )

            if pipeline_parallel_size and self.virtual_pipeline_model_parallel_size is not None:
                num_layers_per_middle_pipeline_rank = num_layers // pipeline_parallel_size
                if (
                    not num_layers_per_middle_pipeline_rank
                    % self.virtual_pipeline_model_parallel_size
                    == 0
                ):
                    raise ValueError(
                        f"number of layers on each middle pipeline rank:"
                        f"{num_layers_per_middle_pipeline_rank} must be divisible by virtual"
                        f"pipeline parallel degree {self.virtual_pipeline_model_parallel_size}"
                    )

To minimize the pipeline bubble, the computation on each GPU can be divided into multiple subsets of layers (referred to as model chunks), rather than a single contiguous block. Enable this by setting `virtual_pipeline_model_parallel_size`:

model_config = GPTModelProvider(
    pipeline_model_parallel_size=4,
    virtual_pipeline_model_parallel_size=2,  # 2 model chunks per pipeline stage
    # ... other model parameters
)

Symptom	Cause	Confirm	Fix
OOM on a single rank despite headroom on others	Memory fragmentation	check if `expandable_segments:True` is set	set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
OOM with `expandable_segments` already set	Genuine capacity limit	check `nvidia-smi` for param/optimizer memory	increase PP, use distributed optimizer, or add recompute
`ValueError: PP + CPU offloading`	using cpu_offloading with PP > 1	check PP config	disable CPU offloading or set PP=1
`RuntimeError` with `--use-nccl-ub` + expandable segments	NCCL UB incompatible with expandable allocator	check env vars	remove `expandable_segments:True` or disable `--use-nccl-ub`

import os
assert "expandable_segments:True" in os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")

Memory Tuning | Skills Pool

Experiment	offload_layers	Result
Exp 4	2	Incompatible (PP > 1)
Exp 5	4	Incompatible (PP > 1)
Exp 6	6	Incompatible (PP > 1)

Memory Tuning

Memory Tuning

What It Is

Quick Decision

Enablement

Expandable segments (recommended first step)

Parallelism resizing

Activation recompute

CPU offloading

A Note on VPP

Compatibility and Constraints

Measured Results

Strategy comparison: parallelism changes for memory reduction

CPU offloading: blocked

Activation recompute: expensive alternative

Code Anchors

CPU offloading PP incompatibility (MCore)

VPP config and layer divisibility validation (MCore)

Parallelism docs on interleaved pipeline schedule

Failure Diagnosis

Known Limitations

Verification

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2