Run SGLang auto benchmark searches with tiered server-flag sweeps, canonical dataset preparation, ShareGPT auto-download, custom-data conversion/validation, SLA or fixed-QPS benchmarking, CSV export, and optional second-stage speculative/EAGLE tuning. Use when the user wants an AI-operated benchmark workflow rather than a one-off bench_serving command.
This skill is for repeatable, AI-driven SGLang performance tuning.
The preferred workflow is:
search_space,The implementation lives in:
python -m sglang.auto_benchmarkpython -m sglang.bench_serving --dataset-name autobench.claude/skills/sglang-auto-benchmark/references/cookbook-llm/max_ttft_ms / max_tpot_ms.If those are not true yet, fix them before running a large search.
Environment consistency check:
python/sglang/bench_serving.py matches the local
feature level needed by auto benchmark before launching a long runPYTHONPATH=<repo>/python python3 -m sglang.bench_serving --help and confirm that the dataset choices include
autobenchautobench is missing remotely, do not start the benchmark; sync
python/sglang/bench_serving.py and any required dataset modules firstIf the benchmark is executed on a remote machine, the progress bar output must be mirrored back to a local file for humans to watch.
Scope note:
Required behavior:
script -q -f <log> -c "<cmd>"; on Linux containers that use util-linux
script, prefer the explicit -c form instead of BSD-style positional
command argumentsprogress.log; this local progress.log should already have terminal control
sequences removed, because script + tqdm progress bars will otherwise leave
ANSI cursor-control bytes and carriage-return redraws that look like garbled textprogress.log must be refreshed automatically at least once every 30
seconds while the run is active; do not rely on one-off manual pollingnohup zsh -lc '...'
command strings with heavy nested quotingtmux pane, screen, or the agent's own persistent PTY session;
detached child processes started from short-lived command runners can be
reaped unexpectedly, so plain nohup ... & is not the most stable defaultprogress.log is
actually updating by checking its timestamp or size twice across a short wait;
if it is not changing, treat that as a broken sync setup and fix it before
telling the user that live log mirroring is workingsummary.md / SUMMARY.md files back locally as first-class result artifacts
rather than leaving them only on the remote machineThis is important because long searches can run for hours, and people need a stable local file they can tail without logging into the remote box. The final local run folder should also be self-contained enough for someone to review the benchmark outcome without re-entering the remote environment.
Recommended cleanup pipeline for the local mirrored log:
perl -pe 's/\e\[[0-9;?]*[ -\/]*[@-~]//g; s/\r/\n/g; s/\x08//g;' raw_progress.log \
> progress.log
Recommended remote-container sync pattern:
cat > sync_progress.sh <<'EOF'
#!/bin/zsh
set -euo pipefail
while true; do
ssh <remote-host> "tail -n 200 <remote-progress-log>" > raw_progress.log
perl -pe 's/\e\[[0-9;?]*[ -\/]*[@-~]//g; s/\r/\n/g; s/\x08//g;' raw_progress.log \
> progress.log
sleep 15
done
EOF
chmod +x sync_progress.sh
Run that script from a long-lived local session, for example:
tmux new-session -d -s autobench-sync './sync_progress.sh'
Use a persistent local background job, tmux pane, screen, or equivalent
long-lived sync process so that humans can watch the cleaned local log in real
time. Use sleep 15 by default for long runs unless there is a specific need
for tighter polling, and keep the cleaned local progress.log within the
required 30-second refresh window while the run is active.
At the end of the run, make sure the local artifact set includes any generated:
results.jsonlresults.csvsummary.mdSUMMARY.mdscenario_summary.jsonlscenario_summary.csvRequired health check after starting the sync script:
stat -f '%m %z' progress.log
sleep 5
stat -f '%m %z' progress.log
If the timestamp and size both stay unchanged while the remote benchmark is known to be producing new output, the sync loop is broken. Fix the script before continuing.
Do not make the cleaned log optional. The default local progress artifact should
be the cleaned progress.log that humans actually read.
If the user wants the best command for a real production or real workload scenario, the benchmark must use their real request distribution.
That means:
sharegpt, random, and generated-shared-prefix are useful for sanity checks and broad tuning, but they are not a substitute for the user’s real traffic.
The cookbook reference configs now default to random because it is portable and immediately runnable, but that should still be treated as a fallback benchmark shape rather than the final answer for a real deployment.
The current implementation intentionally keeps the dataset surface small:
sharegpt
custom
bench_serving custom conversation JSONL,random
input_len and output_len can be lists of equal length.