Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.
Stable docs: docs/training/communication-overlap.md
Card: card.yaml (co-located)
docs/training/communication-overlap.mdskills/perf-techniques/expert-parallel-overlap/card.yamlExpert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all
communication by running it concurrently with expert FFN compute. Optionally,
delayed expert weight-gradient computation (delay_wgrad_compute) provides
additional overlap by deferring wgrad to overlap with the next layer's forward.
Bridge supports two dispatcher paths:
| Dispatcher | Backend | When to use |
|---|---|---|
alltoall | Standard MoE all-to-all | Default, broadest compatibility |
flex | DeepEP or HybridEP | Higher overlap on Ampere/Hopper/Blackwell |
Use EP overlap when:
EP > 1Prefer:
alltoall dispatcher for the first rollout (broader compatibility)flex + DeepEP/HybridEP when running on supported GPUs and seeking
additional gainsAvoid EP overlap when:
moe_shared_expert_overlap is enabledExpected outcome:
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False
cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False
from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False
apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")
# or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")
expert_model_parallel_size > 1num_moe_experts > 1moe_token_dispatcher_type must be "alltoall" or "flex"moe_shared_expert_overlap = False>= 2.6.0PP > 1, virtual_pipeline_model_parallel_size must be setrecompute_granularity != "full", recompute_method = None,
recompute_num_layers = Nonemtp_num_layers must be None or 1delay_wgrad_compute requires overlap_moe_expert_parallel_comm as a
prerequisitedelay_wgrad_compute with overlap_grad_reduce requires TE >= 2.7.0delay_wgrad_compute with gradient_accumulation_fusion requires TE >= 2.7.0attn scope + delay_wgrad_compute requires TE >= 2.12.0,
gradient_accumulation_fusion = True, and no attention biascfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True
Use this as the correctness-first starting point. Add delayed wgrad, flex dispatch, and CUDA-graph interactions only after the plain overlap path is known to work.
Performance harness example:
python scripts/performance/setup_experiment.py \
--model qwen3-30b-a3b \
--moe_a2a_overlap \
--num_nodes 2 \
--gpus_per_node 8 \
--max_steps 20
Unit test verification:
uv run python -m pytest \
tests/unit_tests/training/test_comm_overlap.py -k "moe" \
tests/unit_tests/training/test_deepep.py -q
uv run python -m pytest \
tests/unit_tests/training/test_comm_overlap.py \
tests/unit_tests/training/test_deepep.py -q
After a successful run with EP overlap:
CommOverlapConfig finalizationoverlap_moe_expert_parallel_comm appears as True in the logged
configmoe_token_dispatcher_type = "flex" and
the correct backend in logsif self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
assert model_cfg.expert_model_parallel_size > 1, ...
assert model_cfg.num_moe_experts > 1, ...
assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
assert model_cfg.bf16 or model_cfg.fp16, ...
assert is_torch_min_version("2.6.0"), ...
# ... PP + VPP check, recompute checks, shared_expert_overlap check ...
if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
# TE version checks for overlap_grad_reduce and gradient_accumulation_fusion
# CUDA graph scope validations for delayed wgrad
assert overlap_moe_expert_parallel_comm, ...
def apply_flex_dispatcher_backend(...):
# GPU architecture check for DeepEP / HybridEP
model_config.moe_token_dispatcher_type = "flex"
model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
model_config.moe_shared_expert_overlap = False
def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
if moe_a2a_overlap:
recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
recipe.comm_overlap.delay_wgrad_compute = True
recipe.model.moe_shared_expert_overlap = False
| File | Coverage |
|---|---|
tests/unit_tests/training/test_comm_overlap.py | EP overlap validation, delayed wgrad, CUDA graph + wgrad interaction |
tests/unit_tests/training/test_deepep.py | DeepEP/HybridEP helper activation and GPU gating |
| Symptom | Likely Cause | How To Confirm | Fix |
|---|---|---|---|
assert expert_model_parallel_size > 1 | EP not configured | Check expert_model_parallel_size | Set EP > 1 |
assert moe_token_dispatcher_type | Wrong dispatcher | Check dispatcher type | Use "alltoall" or "flex" |
| assert on BF16/FP16 | Wrong precision | Check bf16 and fp16 | Set bf16 = True |
| hang during training | PyTorch < 2.6 | Check PyTorch version | Upgrade to >= 2.6.0 |
assert virtual_pipeline_model_parallel_size | PP > 1 without VPP | Check PP and VPP config | Set VPP when PP > 1 |
assert recompute_granularity | Full recompute enabled | Check recompute settings | Disable full recompute |
assert overlap_moe_expert_parallel_comm required | delayed wgrad without EP overlap | Check delay_wgrad_compute without overlap | Enable EP overlap first |
assert gradient_accumulation_fusion | CUDA graph + delayed wgrad | Check graph scope + wgrad settings | Enable gradient_accumulation_fusion |
| assert on attention bias | CUDA graph attn + delayed wgrad + bias | Check add_bias_linear / add_qkv_bias | Disable attention bias |
| no throughput gain from flex dispatcher | apply_flex_dispatcher_backend not called | Check moe_token_dispatcher_type in logs | Call apply_flex_dispatcher_backend(...) |
| DeepEP/HybridEP silently skipped | Unsupported GPU | Check warning logs | Run on Ampere/Hopper/Blackwell |
moe_flex_dispatcher_backend alone does not activate flex dispatch —
you must call apply_flex_dispatcher_backend(...).