Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang.
Use this skill when measuring denoise performance, finding the slow op, checking whether an existing fast path can solve it, or verifying that a hotspot is real before any kernel work in sglang.multimodal_gen.
This skill is diagnosis-first. It owns:
torch.profiler trace capture and quick hotspot rankingThis skill does not own low-level kernel authoring or standalone Nsight workflows.
Before running any benchmark, profiler, or kernel-validation command:
scripts/diffusion_skill_env.py to derive the repo root from sglang.__file__HF_TOKEN before using gated Hugging Face models such as black-forest-labs/FLUX.*FLASHINFER_DISABLE_VERSION_CHECK=1All diffusion benchmark and profiling results owned by this skill must come from the native SGLang diffusion backend.
Treat any of the following as a hard stop condition:
Falling back to diffusers backendUsing diffusers backendLoaded diffusers pipelineIf any benchmark, perf-dump, or torch.profiler command prints one of those signals:
torch.profiler workflow; uses the checked-in nightly-aligned presets, plus LTX-2, LTX-2.3 one-stage, and LTX-2.3 two-stage benchmark recipesQK norm + RoPE, and distributed overlap patterns before proposing new codesglang.__file__, write-access probe, benchmark/profile output directories, idle GPU selectionsglang generate; pins --backend=sglang, supports --no-torch-compile, and saves perf dumps by label for compare_perf.pyBefore calling a diffusion hotspot "new", first classify it with existing-fast-paths.md.
Always rule out these existing families first:
QK norm + RoPEtorch.compile compute / communication reorderIf the user explicitly requires torch.compile to stay off, do not use the
default benchmark preset invocation unchanged. Either pass the checked-in
benchmark helper its no-compile switch or run the equivalent manual command
without --enable-torch-compile.
For FLUX-family manual profiling runs with a quantized transformer override:
sglang generate directly--transformer-path <dir>--prompt-path <file> when also fixing --output-file-name--model-path plus HF_HUB_OFFLINE=1--profile changes latency substantially; use the non-profile perf dump for the real before/after benchmark claim