Skill File

Moe Dispatcher Selection

Name: Moe Dispatcher Selection
Author: NVIDIA-NeMo

Choose the right MoE token dispatcher (`alltoall`, DeepEP, or HybridEP) for the hardware, EP degree, and optimization stage. Summarizes patterns from DSV3, Qwen3, Qwen3-Next, and VLM bring-up work.

NVIDIA-NeMo577 starsApr 16, 2026

Occupation
Categories: Framework Internals

Skill Content

MoE Dispatcher Selection Guide

Stable docs: docs/training/moe-optimization.md Card: card.yaml (co-located)

Quick Decision

By hardware

Hardware	First choice	Why
H100	DeepEP	Strong default for cross-node EP on Hopper
B200	DeepEP	Good first choice unless a platform-specific HybridEP path is available
GB200 / GB300 NVL72	HybridEP	Best fit for NVLink-domain-aware dispatch and lower memory pressure
Unknown or first bring-up	`alltoall`	Easiest path for correctness and debugging

By EP degree

EP size

Related Skills

Moe Dispatcher Selection | Skills Pool

Workload	Common best path	Notes
DSV3 at large scale	HybridEP on GB200 or GB300, DeepEP on H100	Dispatcher choice matters more as EP and PP both grow
Qwen3 235B	DeepEP on H100, HybridEP on GB200	HybridEP usually wins on GB200 and often uses less memory
Qwen3 30B	DeepEP	Smaller models still benefit, but the absolute gap is smaller
Qwen3-Next	Close race in BF16, HybridEP stronger in FP8 or memory-tight runs	Good reminder to test, not assume
MoE VLMs	Start simple, then test HybridEP on GB200-class systems	Vision workloads are sensitive to both memory and host overhead

--moe-deepep-num-sms 20

--moe-hybridep-num-sms 16

--moe-router-force-load-balancing

Feature	Interaction
CUDA graphs	Best paired with `attn moe_router moe_preprocess` on dropless MoE
EP overlap	Helps when dispatcher time is still visible after backend tuning
FP8	Often increases the relative importance of communication and host overhead
CPU affinity	Can matter as much as dispatcher choice on GB200 or GB300
Pipeline layout	Poor PP or VPP layout can erase dispatcher gains

Small EP	Dispatcher choice is usually second-order; start with `alltoall` or DeepEP
Medium EP	DeepEP often becomes worthwhile
Large EP	HybridEP is usually the best target on NVL72 systems

Moe Dispatcher Selection

MoE Dispatcher Selection Guide

Quick Decision

By hardware

By EP degree

Moe Dispatcher Selection

MoE Dispatcher Selection Guide

Quick Decision

By hardware

By EP degree

Model-Family Patterns

Rounded Evidence Summary

DSV3 on GB200 or GB300

Qwen3 235B on GB200

Qwen3-Next on GB200

Tuning Parameters

DeepEP

HybridEP

Routing mode

Key Interactions

When To Use Each

`alltoall`

DeepEP

HybridEP

Pitfalls

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2