Name: Sglang Diffusion Modelopt Quant
Author: sgl-project

Overview

Use this skill when the task is to take a diffusion transformer through the full ModelOpt workflow:

This skill owns the ModelOpt-to-SGLang bridge. It is not a generic kernel-tuning skill.

Core Rules

Use ModelOpt's official quantize.py as the PTQ source of truth.
Keep the workflow generic. Put model-specific fallback logic in small isolated branches, not in the main conversion path.
Benchmark only when BF16 and quantized commands are identical except for the checkpoint override being tested.
For diffusion FP8, keep dit_cpu_offload=false. dit_layerwise_offload=true is valid on the fixed path when you want lower DiT residency.
For multi-transformer pipelines, use per-component overrides when different components need different checkpoints.
For B200 NVFP4 validation, keep backend-sensitive environment variables explicit. Wan2.2 NVFP4 is commonly validated with ; benchmark the default CUTLASS path separately if that is what you are evaluating.

Use this skill when the task is to take a diffusion transformer through the full ModelOpt workflow:

This skill owns the ModelOpt-to-SGLang bridge. It is not a generic kernel-tuning skill.

Use ModelOpt's official quantize.py as the PTQ source of truth.
Keep the workflow generic. Put model-specific fallback logic in small isolated branches, not in the main conversion path.
Benchmark only when BF16 and quantized commands are identical except for the checkpoint override being tested.
For diffusion FP8, keep dit_cpu_offload=false. dit_layerwise_offload=true is valid on the fixed path when you want lower DiT residency.
For multi-transformer pipelines, use per-component overrides when different components need different checkpoints.
For B200 NVFP4 validation, keep backend-sensitive environment variables explicit. Wan2.2 NVFP4 is commonly validated with ; benchmark the default CUTLASS path separately if that is what you are evaluating.

File	Role
`runtime/layers/quantization/__init__.py`	registers diffusion quant methods
`runtime/layers/quantization/modelopt_quant.py`	ModelOpt FP8 and NVFP4 runtime loading
`runtime/utils/quantization_utils.py`	resolves flat ModelOpt configs and reconstructs NVFP4 config from metadata
`runtime/loader/transformer_load_utils.py`	guards incompatible FP8 offload modes
`runtime/models/dits/flux_2.py`	packed-QKV handling for the packed FLUX.2 NVFP4 family
`tools/build_modelopt_fp8_transformer.py`	Build an SGLang-loadable FP8 transformer from a ModelOpt export
`tools/build_modelopt_nvfp4_transformer.py`	Build mixed BF16+NVFP4 transformer directories when a family needs preserved BF16 layers
`tools/compare_diffusion_trajectory_similarity.py`	reduced deterministic BF16-vs-quantized validation
`docs/diffusion/quantization.md`	public ModelOpt support matrix and CLI examples
`test/server/testcase_configs.py`	reusable ModelOpt testcase constants, thresholds, and helpers
`test/server/gpu_cases.py`	concrete GPU and B200 ModelOpt CI case lists