Choose the right MoE token dispatcher (`alltoall`, DeepEP, or HybridEP) for the hardware, EP degree, and optimization stage. Summarizes patterns from DSV3, Qwen3, Qwen3-Next, and VLM bring-up work.
Stable docs: docs/training/moe-optimization.md
Card: card.yaml (co-located)
| Hardware | First choice | Why |
|---|---|---|
| H100 | DeepEP | Strong default for cross-node EP on Hopper |
| B200 | DeepEP | Good first choice unless a platform-specific HybridEP path is available |
| GB200 / GB300 NVL72 | HybridEP | Best fit for NVLink-domain-aware dispatch and lower memory pressure |
| Unknown or first bring-up | alltoall | Easiest path for correctness and debugging |
| EP size |
|---|
| Guidance |
|---|
| Small EP | Dispatcher choice is usually second-order; start with alltoall or DeepEP |
| Medium EP | DeepEP often becomes worthwhile |
| Large EP | HybridEP is usually the best target on NVL72 systems |
| Workload | Common best path | Notes |
|---|---|---|
| DSV3 at large scale | HybridEP on GB200 or GB300, DeepEP on H100 | Dispatcher choice matters more as EP and PP both grow |
| Qwen3 235B | DeepEP on H100, HybridEP on GB200 | HybridEP usually wins on GB200 and often uses less memory |
| Qwen3 30B | DeepEP | Smaller models still benefit, but the absolute gap is smaller |
| Qwen3-Next | Close race in BF16, HybridEP stronger in FP8 or memory-tight runs | Good reminder to test, not assume |
| MoE VLMs | Start simple, then test HybridEP on GB200-class systems | Vision workloads are sensitive to both memory and host overhead |
The broad trend is more important than any single row in the tracker:
alltoall is usually the conservative baselineIn practice, the stack often moves from roughly "low-teens MFU" territory with an untuned baseline into "high-teens to low-20s MFU" territory after the full dispatcher and kernel stack is tuned.
For Qwen3 235B, the practical ordering is usually:
alltoall for initial bring-upHybridEP is usually modestly faster than alltoall on this workload and often
has noticeably better memory headroom.
This family is a good reminder that dispatcher wins are workload-dependent:
alltoall and HybridEP can be closeDeepEP is selected by setting
moe_token_dispatcher_type="flex" and moe_flex_dispatcher_backend="deepep".
--moe-deepep-num-sms 20
Tune the SM count allocated to DeepEP communication kernels (default 20). The optimal value depends on the workload and EP degree.
HybridEP is selected by setting
moe_token_dispatcher_type="flex" and moe_flex_dispatcher_backend="hybridep".
--moe-hybridep-num-sms 16
Tune the SM count allocated to HybridEP communication (default 16). The
performance harness uses 32 for HybridEP workloads. Sweep between 16 and 32
for the target hardware. Set
NUM_OF_HYBRID_EP_RANKS_PER_NVLINK_DOMAIN to match the NVLink domain size of
the deployment. If it does not match the actual topology, performance and
sometimes correctness will suffer.
--moe-router-force-load-balancing
For performance benchmarking, force-balance routing is the safer default. It usually outperforms dropless routing in large-scale benchmarks and makes results more comparable across dispatcher backends.
| Feature | Interaction |
|---|---|
| CUDA graphs | Best paired with attn moe_router moe_preprocess on dropless MoE |
| EP overlap | Helps when dispatcher time is still visible after backend tuning |
| FP8 | Often increases the relative importance of communication and host overhead |
| CPU affinity | Can matter as much as dispatcher choice on GB200 or GB300 |
| Pipeline layout | Poor PP or VPP layout can erase dispatcher gains |
alltoallDo not compare dispatchers on different stacks: container, routing mode, PP layout, and CUDA-graph scope can move the result as much as the dispatcher.
HybridEP is topology-sensitive: it is not a universal win outside the hardware it was designed for.
Both dispatchers need SM tuning: default moe_deepep_num_sms (20) and
moe_hybridep_num_sms (16) are reasonable starting points but rarely optimal.
Force-balance and dropless are not interchangeable baselines: keep the routing mode fixed when comparing dispatcher backends.
Memory and throughput can trade off differently by model: Qwen3-style runs may show a smaller speed delta than DSV3, but still justify HybridEP for memory headroom.