Compact SGLang torch-profiler triage skill. Use when Codex should inspect an existing `trace.json(.gz)` or profile directory, trigger `sglang.profiler` against a live server, and return one compact report with kernel, overlap-opportunity, and fuse-pattern tables. Single-trace triage is enough for quick diagnosis; mapping+formal two-trace triage gives stronger overlap conclusions.
Use this skill for SGLang torch.profiler analysis.
There is only one public workflow:
triageUse the unified entrypoint:
triage always prints the same three tables:
By default, all three tables only render rows at or above 1.0% cumulative GPU-time share.
Treat anything below that as noise unless the user explicitly asks for a lower cutoff.
The script-level fuse-pattern table should stay source-backed and deterministic. Do not build a fuzzy string-matching engine into the script for typo-tolerance.
If exact/source-backed matching is weak but the agent judges that a cluster of kernels still looks semantically close to a known pattern, add a short AI note after the table with one of these labels:
high: very likely the same pattern family; naming drift or minor implementation reshaping is the main uncertaintymedium: several signals line up, but one important piece is still ambiguouslow: weak resemblance only; mention it only if it is still worth a human follow-upFor diffusion benchmark or profiling work, only analyze traces produced by the native SGLang diffusion backend.
If the run that generated the trace logs any of:
Falling back to diffusers backendUsing diffusers backendLoaded diffusers pipelinestop the workflow instead of analyzing the trace. Treat it as a backend-selection issue, not as valid SGLang diffusion profiler evidence.
python3 scripts/analyze_sglang_torch_profile.py \
--input /path/to/profile_dir_or_trace.json.gz
Use this when you want the fastest read on kernel share and likely fused-kernel pattern matches. The overlap table stays conservative in single-trace mode and will tell you when a mapping/formal pair is needed.
python3 scripts/analyze_sglang_torch_profile.py \
--url http://127.0.0.1:30000 \
--num-steps 5 \
--profile-by-stage
python3 scripts/analyze_sglang_torch_profile.py triage \
--mapping-input /path/to/graph_off_profile_dir \
--formal-input /path/to/graph_on_profile_dir
Use this when you need stronger overlap conclusions and cleaner kernel-to-source attribution.
python3 scripts/analyze_sglang_torch_profile.py triage \
--mapping-url http://127.0.0.1:31025 \
--formal-url http://127.0.0.1:31026 \
--num-steps 5 \
--profile-by-stage
profile_by_stageprofile_by_stage is not only for PD disaggregation.
profile_by_stage.Use when you want the lowest-friction report:
This is the recommended default.
Use when you need:
--disable-cuda-graph --disable-piecewise-cuda-graphDo not call the mapping pass a "fast profile". It exists to recover kernel -> cpu_op -> python scope.
TP-0 traces over merged traces.sglang.profiler and automatically send a small probe request.--profile-by-stage even on standard serving unless the user explicitly wants an all-stage mixed trace.triage for the compact three-table report.PR-backed / in-flight sections too. Prefer reporting:
AI similarity judgment note after the tables.
Use high, medium, or low only.
Base that note on the full pattern shape, not on one kernel name alone.
Prefer semantic cues such as producer-consumer chain, source locations, CPU op names, TP context, and model-specific structure.
Do not rewrite the script table itself to include these heuristic judgments.Load these only when needed:
Return:
AI similarity judgment note with high / medium / low when exact matching is inconclusive