Optimizes the performance of existing Liger Kernel Triton kernels. Profiles kernels, diagnoses bottlenecks (memory-bound vs compute-bound), generates multiple optimization variants with benchmarking, and applies the best variant while maintaining correctness. Supports GPU architecture-specific optimization (Ampere, Hopper, Blackwell). Use when a user asks to optimize, speed up, tune, profile, or reduce memory of an existing Liger kernel.
Optimizes existing Liger Kernel Triton kernels through a 3-stage pipeline: Profile, Optimize, Finalize. Supports interactive mode (human checkpoints between stages) and autonomous mode (runs end-to-end). NVIDIA GPUs only.
Extract from the user's request:
| Field | Description | Default |
|---|---|---|
target_kernel | Which kernel to optimize (e.g., "rms_norm", "cross_entropy") | Required |
optimization_goal | speed / memory / balanced | balanced |
scope | Specific pass (forward/backward), input regime, or general |
| general |
target_gpu | Ampere / Hopper / Blackwell / auto-detect | auto-detect |
autonomy | interactive / autonomous | interactive |
max_variants | Max optimization variants to try | 8 |
target_metric | Optional concrete target (e.g., "forward under 0.3ms at hidden_size=4096") | none |
Before starting the pipeline, validate:
src/liger_kernel/ops/{kernel}.pybenchmark/scripts/benchmark_{kernel}.pytest/transformers/test_{kernel}.pypip install -e ".[dev]")If any validation fails, report clearly and stop.
Spawn a Profiler agent (read profiler.md).
The agent:
optimization/{kernel}/ncu is available)optimization/{kernel}/profile.mdHuman checkpoint (interactive mode): Present the optimization profile with bottleneck diagnosis and proposed strategy order. Confirm before proceeding.
Spawn an Optimizer agent (read optimizer.md).
The agent runs an autonomous optimization loop:
optimization/{kernel}/{kernel}_vN.py
b. Write the variant lab notebook → optimization/{kernel}/{kernel}_vN_notes.md
c. Run quick smoke test (single shape, float32, forward+backward) → discard on failure
d. Run the full existing benchmark script → optimization/{kernel}/benchmarks/vN_results.csv
e. Check guardrails (no catastrophic regressions)
f. Update the variant notes with actual resultsHuman checkpoint (interactive mode): Present the comparison table across all variants. User approves the winner (or skill picks best if autonomous).
Spawn a Finalizer agent (read finalizer.md).
The agent:
src/liger_kernel/ops/{kernel}.pypython -m pytest test/transformers/test_{kernel}.py -xvs (hard gate)make checkstyle (auto-fix with ruff check . --fix && ruff format .)benchmarks_visualizer.pyoptimization/{kernel}/report.mdHuman checkpoint (interactive mode): Present the final report with before/after numbers, comparison plots, and test results.
These apply to EVERY variant, regardless of mode:
| Guardrail | Threshold | Action |
|---|---|---|
| Non-target metric regression | >5% worse | Reject variant |
| Cross-pass regression | >10% on one pass to marginally improve other | Reject variant |
| Smoke test failure | Any correctness failure | Discard variant immediately |
| Full test suite failure | Any | Do NOT apply winner, report failure, stop |
| Checkstyle failure | Any | Auto-fix with ruff, retry once |