Use when optimizing an existing SGLang diffusion kernel with AKO4ALL, including AKO4ALL repo hygiene, custom microbench setup, ncu-guided iteration, and end-to-end denoise validation. Also use when a sibling AKO4ALL repo must be cloned or refreshed before starting kernel tuning work.
Use this skill to run the full AKO4ALL-based optimization loop for an existing SGLang diffusion kernel.
It is the default implementation path once the benchmark/profile skill has already shown that a hotspot is real and not covered by an existing fast path. This workflow bootstraps a custom AKO harness, benchmarks and profiles the kernel, iterates with ncu, ports the best version back to sglang, then validates with targeted tests and model-level denoise runs.
This skill assumes a sibling repo layout like:
<base-dir>/
├── sglang/
└── AKO4ALL/
If AKO4ALL/ is missing under the current base directory, clone it first.
sglangsglang-diffusion-benchmark-profile has already ruled out an existing in-repo fast path or overlap familyncu before/after data, and proof image outputsDo not start here when the bottleneck has not been proven yet. First use ../sglang-diffusion-benchmark-profile/SKILL.md to:
torch.profiler traceIf a future specialized optimization skill matches the kernel family better than AKO4ALL, hand off there instead. The diagnosis contract stays the same.
Before any AKO work:
scripts/ensure_ako4all_clean.sh [base-dir].<base-dir>/AKO4ALL does not exist, the script clones it.AKO4ALL is:
mainupstream/<default-branch>The script creates an upstream remote automatically when missing.
By default it uses the existing origin URL, or AKO4ALL_URL if you need to override the clone source.
sglang.Inside the clean AKO4ALL repo:
TASK.md and HINTS.mdinput/reference.pyinput/<kernel>.pysolution/<kernel>.pybench/bench_<kernel>.pycontext/ when the kernel has model-specific shape assumptions or perf conclusionsThe custom benchmark should:
ncu baseline on the hottest meaningful shapeITERATIONS.md with hypothesis, result, and next stepAfter 3 consecutive no-improvement or regression iterations:
ncuITERATIONS.mdsglang kernel filesolution/ version aligned with the main-tree version you actually want to keepcompare_perf.pyAt minimum, keep:
ncu before/after pair on the most representative kernel shapeSee references/ako-loop.md for the checklist and common stop rules.