Skill File

Expert Parallel Overlap

Name: Expert Parallel Overlap
Author: NVIDIA-NeMo

Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.

NVIDIA-NeMo577 starsApr 16, 2026

Occupation
Categories: Framework Internals

Skill Content

MoE Expert-Parallel Overlap Skill

Stable docs: docs/training/communication-overlap.md Card: card.yaml (co-located)

References

Stable docs: docs/training/communication-overlap.md
Structured metadata: skills/perf-techniques/expert-parallel-overlap/card.yaml

What It Is

Expert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all communication by running it concurrently with expert FFN compute. Optionally, delayed expert weight-gradient computation (delay_wgrad_compute) provides additional overlap by deferring wgrad to overlap with the next layer's forward.

Bridge supports two dispatcher paths:

Dispatcher	Backend	When to use
`alltoall`	Standard MoE all-to-all	Default, broadest compatibility

Related Skills

Expert Parallel Overlap | Skills Pool

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False

from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")
# or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True

python scripts/performance/setup_experiment.py \
  --model qwen3-30b-a3b \
  --moe_a2a_overlap \
  --num_nodes 2 \
  --gpus_per_node 8 \
  --max_steps 20

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py -k "moe" \
  tests/unit_tests/training/test_deepep.py -q

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py \
  tests/unit_tests/training/test_deepep.py -q

if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
    assert model_cfg.expert_model_parallel_size > 1, ...
    assert model_cfg.num_moe_experts > 1, ...
    assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
    assert model_cfg.bf16 or model_cfg.fp16, ...
    assert is_torch_min_version("2.6.0"), ...
    # ... PP + VPP check, recompute checks, shared_expert_overlap check ...

if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
    # TE version checks for overlap_grad_reduce and gradient_accumulation_fusion
    # CUDA graph scope validations for delayed wgrad
    assert overlap_moe_expert_parallel_comm, ...

def apply_flex_dispatcher_backend(...):
    # GPU architecture check for DeepEP / HybridEP
    model_config.moe_token_dispatcher_type = "flex"
    model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
    model_config.moe_shared_expert_overlap = False

def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
    if moe_a2a_overlap:
        recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
        recipe.comm_overlap.delay_wgrad_compute = True
        recipe.model.moe_shared_expert_overlap = False

File	Coverage
`tests/unit_tests/training/test_comm_overlap.py`	EP overlap validation, delayed wgrad, CUDA graph + wgrad interaction
`tests/unit_tests/training/test_deepep.py`	DeepEP/HybridEP helper activation and GPU gating

Symptom	Likely Cause	How To Confirm	Fix
assert `expert_model_parallel_size > 1`	EP not configured	Check `expert_model_parallel_size`	Set EP > 1
assert `moe_token_dispatcher_type`	Wrong dispatcher	Check dispatcher type	Use `"alltoall"` or `"flex"`
assert on BF16/FP16	Wrong precision	Check `bf16` and `fp16`	Set `bf16 = True`
hang during training	PyTorch < 2.6	Check PyTorch version	Upgrade to >= 2.6.0
assert `virtual_pipeline_model_parallel_size`	PP > 1 without VPP	Check PP and VPP config	Set VPP when PP > 1
assert `recompute_granularity`	Full recompute enabled	Check recompute settings	Disable full recompute
assert `overlap_moe_expert_parallel_comm required`	delayed wgrad without EP overlap	Check `delay_wgrad_compute` without overlap	Enable EP overlap first
assert `gradient_accumulation_fusion`	CUDA graph + delayed wgrad	Check graph scope + wgrad settings	Enable `gradient_accumulation_fusion`
assert on attention bias	CUDA graph attn + delayed wgrad + bias	Check `add_bias_linear` / `add_qkv_bias`	Disable attention bias
no throughput gain from flex dispatcher	`apply_flex_dispatcher_backend` not called	Check `moe_token_dispatcher_type` in logs	Call `apply_flex_dispatcher_backend(...)`
DeepEP/HybridEP silently skipped	Unsupported GPU	Check warning logs	Run on Ampere/Hopper/Blackwell

Expert Parallel Overlap

MoE Expert-Parallel Overlap Skill

References

What It Is

Expert Parallel Overlap

MoE Expert-Parallel Overlap Skill

References

What It Is

Quick Decision

Enablement

alltoall dispatcher

flex dispatcher (DeepEP or HybridEP)

Compatibility And Constraints

Minimal Working Config

Minimal Runnable Command

Verification

Unit tests

Log checks

Success criteria

Code Anchors

Bridge overlap validation

Delayed wgrad validation

Flex-dispatcher activation

Perf harness override

Tests

Failure Diagnosis

Known Limitations

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2