Use when quantizing a diffusion DiT with NVIDIA ModelOpt and making the resulting FP8 or NVFP4 checkpoint loadable, verifiable, and benchmarkable in SGLang Diffusion.
Use this skill when the task is to take a diffusion transformer through the full ModelOpt workflow:
This skill owns the ModelOpt-to-SGLang bridge. It is not a generic kernel-tuning skill.
quantize.py as the PTQ source of truth.dit_cpu_offload=false. dit_layerwise_offload=true is valid on the fixed path when you want lower DiT residency.SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnnpython/sglang/multimodal_gen/tools/build_modelopt_fp8_transformer.py, python/sglang/multimodal_gen/tools/build_modelopt_nvfp4_transformer.py, and python/sglang/multimodal_gen/tools/compare_diffusion_trajectory_similarity.py instead of inventing one-off scripts elsewhere.docs/diffusion/quantization.md before closing the task.Read these sources before changing code:
examples/diffusers/README.mdexamples/diffusers/quantization/quantize.pyexamples/diffusers/quantization/config.pypython/sglang/multimodal_gen/runtime/layers/quantization/modelopt_quant.pypython/sglang/multimodal_gen/runtime/utils/quantization_utils.pypython/sglang/multimodal_gen/runtime/loader/transformer_load_utils.pyIf you are working on a new model family, inspect the transformer's config and tensor naming before changing the generic converter.
This repo now contains:
quant_method=modelopt plus quant_algo=FP8/NVFP4 resolutionpython/sglang/multimodal_gen/tools/build_modelopt_fp8_transformer.pypython/sglang/multimodal_gen/tools/build_modelopt_nvfp4_transformer.pypython/sglang/multimodal_gen/tools/compare_diffusion_trajectory_similarity.pyValidated documentation and CI coverage currently center on six ModelOpt diffusion transformer override families:
Treat a new family, a new precision, or a new checkpoint layout as unsupported until it has a documented matrix row and a matching validation story.
Before writing CLI examples, re-read the active branch's docs/diffusion/quantization.md: FLUX.2 NVFP4 is an official black-forest-labs/* repo rather than a BBuf/* converted repo, and its preferred flag depends on the current documented loader flow. Use --transformer-path for a component override directory with config.json; use --transformer-weights-path when the repo or path should be probed as raw weights.
B200 CI coverage can include loose BF16-vs-quantized quality smoke checks. Inspect the active branch's run_suite.py before assuming they are part of the suite; mainline and feature branches may differ. Those checks are intended to catch blank, corrupted, or obviously divergent images, not exact image parity.
docs/diffusion/quantization.md.unpublished explicitly instead of leaving the field blank.FP8 and NVFP4 are not wired into SGLang in exactly the same way.
FP8:
weight_scale and input_scalefloat8_e4m3fn weights from backbone.ptNVFP4:
Important caveat:
Before quantizing anything:
perf.jsonDo not start quantization work until the BF16 path is already healthy.
Use ModelOpt's official script. Generic template:
python quantize.py \
--model <model-name> \
--override-model-path <hf-repo-or-local-model> \
--model-dtype <Half|BFloat16> \
--format <fp8|fp4> \
--batch-size 1 \
--calib-size <calib-size> \
--n-steps <calib-steps> \
--quantize-mha \
--prompts-file <prompt-file> \
--quantized-torch-ckpt-save-path <out>/ckpt \
--hf-ckpt-dir <out>/hf
For current ModelOpt diffusion examples, use --format fp4 for NVFP4 exports.
Do not assume the checked-out ModelOpt version accepts a literal nvfp4 format string unless you verified it locally.
For multi-transformer models:
backbone.pt and the matching hf/<component> exportFP8 requires an extra conversion step:
PYTHONPATH=python python3 -m sglang.multimodal_gen.tools.build_modelopt_fp8_transformer \
--modelopt-hf-dir <out>/hf \
--modelopt-backbone-ckpt <out>/ckpt/backbone.pt \
--base-transformer-dir <base-model-transformer-dir> \
--output-dir <out>/sglang_transformer \
--overwrite
What the converter does:
weight_quantizer._amax and input_quantizer._amax from backbone.ptweight_scale and input_scalefloat8_e4m3fnignore layers as BF16_quantizer.* tensors and fallback-layer scales that should not survive into the SGLang-native checkpointFor FLUX.1-dev, the validated fallback set currently keeps these modules in BF16:
transformer_blocks.*.norm1.lineartransformer_blocks.*.norm1_context.lineartransformer_blocks.*.ff.net.0.projtransformer_blocks.*.ff.net.2transformer_blocks.*.ff_context.net.0.projtransformer_blocks.*.ff_context.net.2single_transformer_blocks.*.norm.linearsingle_transformer_blocks.*.proj_mlpUse --model-type flux1 to force that profile, or rely on --model-type auto when the export config identifies FluxTransformer2DModel.
For FLUX.1-dev NVFP4 model families that need a mixed BF16+NVFP4 checkpoint, build the merged transformer explicitly:
PYTHONPATH=python python3 -m sglang.multimodal_gen.tools.build_modelopt_nvfp4_transformer \
--base-transformer-dir <base-model-transformer-dir> \
--modelopt-hf-dir <out>/hf/transformer \
--output-dir <out>/transformer-mixed \
--pattern-preset flux1-nvfp4
The validated FLUX.1-dev mixed builder also needs to preserve:
quant_type: NVFP4 in config.jsonswap_weight_nibbles: false for the validated diffusers exportSingle-transformer example:
sglang generate \
--model-path <base-model> \
--transformer-path <quantized-transformer> \
--prompt "<prompt>" \
--seed <seed> \
--save-output
Multi-transformer example:
sglang generate \
--model-path <base-model> \
--transformer-path <quantized-transformer> \
--transformer-2-path <another-transformer-or-bf16-override> \
--prompt "<prompt>" \
--seed <seed> \
--save-output
Guideline:
--transformer-path only when the model effectively has one transformer override to apply--<component>-path--component_paths.transformer_2=... also resolve to the same internal override mapUse two levels of validation.
Reduced deterministic validation:
Tool:
PYTHONPATH=python python3 -m sglang.multimodal_gen.tools.compare_diffusion_trajectory_similarity \
--model-path <base-model> \
--model-id <optional-native-model-id> \
--prompt "<prompt>" \
--width <w> \
--height <h> \
--num-inference-steps <steps> \
--guidance-scale <cfg> \
--seed <seed> \
--candidate-transformer-path <quantized-transformer> \
--output-json <report.json>
Use --model-id FLUX.1-dev when --model-path points to a local directory but the runtime still needs the native FLUX.1 model registration.
Full-output validation:
Benchmark only when these match between BF16 and quantized:
Only the quantized checkpoint path should differ.
Interpretation rule:
If the generic FP8 path fails on a new model family:
Do not turn one validated model quirk into a generic rule unless another family also needs it.
Current diffusion ModelOpt FP8 support requires:
dit_cpu_offload=falsedit_layerwise_offload may be enabled when you want lower DiT residencyReason:
dit_cpu_offload is still treated conservativelyRuntime behavior:
dit_cpu_offload when it detects modelopt_fp8When documenting results:
| File | Role |
|---|---|
runtime/layers/quantization/__init__.py | registers diffusion quant methods |
runtime/layers/quantization/modelopt_quant.py | ModelOpt FP8 and NVFP4 runtime loading |
runtime/utils/quantization_utils.py | resolves flat ModelOpt configs and reconstructs NVFP4 config from metadata |
runtime/loader/transformer_load_utils.py | guards incompatible FP8 offload modes |
runtime/models/dits/flux_2.py | packed-QKV handling for the packed FLUX.2 NVFP4 family |
tools/build_modelopt_fp8_transformer.py | Build an SGLang-loadable FP8 transformer from a ModelOpt export |
tools/build_modelopt_nvfp4_transformer.py | Build mixed BF16+NVFP4 transformer directories when a family needs preserved BF16 layers |
tools/compare_diffusion_trajectory_similarity.py | reduced deterministic BF16-vs-quantized validation |
docs/diffusion/quantization.md | public ModelOpt support matrix and CLI examples |
test/server/testcase_configs.py | reusable ModelOpt testcase constants, thresholds, and helpers |
test/server/gpu_cases.py | concrete GPU and B200 ModelOpt CI case lists |