Analyze CI performance comparison reports for a ClickHouse PR. Lists all regressions and improvements, cross-references with master history to distinguish real changes from flaky tests.
$0 (required): PR number (e.g. 99474) or a CI report URLUse the fetch_perf_report.py tool to get TSV data for both architectures:
python3 .claude/tools/fetch_perf_report.py "https://github.com/ClickHouse/ClickHouse/pull/$PR" --arch amd --tsv
python3 .claude/tools/fetch_perf_report.py "https://github.com/ClickHouse/ClickHouse/pull/$PR" --arch arm --tsv
If a direct CI report URL is given instead of a PR number, use it directly.
Extract all changed queries (column 10 == 1 means the change exceeds the threshold):
Present the complete, unfiltered list of all changes above 1.10x, sorted by magnitude, for both architectures separately. Use tables with columns: Magnitude, Direction, Test, Query#, Batch.
Do NOT summarize, collapse, or hide entries. Do NOT dismiss anything as "noise" or "not actionable" without evidence. Every entry must be visible.
For every test that shows as slower or faster above 1.10x, check the public CI database to determine if this is a known flaky test or a genuine change introduced by the PR.
Query the public CI database:
clickhouse client --host play.clickhouse.com --user explorer --secure -q "
SELECT
replaceRegexpOne(test_name, '::(new|old)$', '') AS test,
countIf(test_status = 'slower') AS slower_count,
countIf(test_status = 'faster') AS faster_count,
countIf(test_status = 'unstable') AS unstable_count,
count() AS total_runs
FROM default.checks
WHERE pull_request_number = 0
AND check_name LIKE '%Performance%amd%' -- or arm
AND check_start_time >= now() - INTERVAL 30 DAY
AND test_name IN (
'test_name #N::new',
...
)
GROUP BY test
ORDER BY slower_count DESC, test
"
Important:
pull_request_number = 0 means master commits::new and ::old suffixes — query with ::new%amd% for AMD results and %arm% for ARM resultsFor each test, classify based on the master history:
Present the final classification in a table per architecture:
| Magnitude | Test | Master slower/total (30d) | Verdict |
|---|
Classify as:
After the tables, provide a brief summary:
When the user asks to investigate a specific regression further, download and analyze the CI artifacts.
Get artifact links using the fetch_ci_report.js tool with --links:
node .claude/tools/fetch_ci_report.js "<CI-report-URL-for-specific-batch>" --links
This will show logs.tar.zst, job.log.zst, all-query-metrics.tsv, report.html, etc.
Download and extract server logs:
curl -sS "<logs.tar.zst-URL>" -o tmp/perf_logs.tar.zst
tar -I zstd -tf tmp/perf_logs.tar.zst # list contents
tar -I zstd -xf tmp/perf_logs.tar.zst -C tmp/ ./right/server.log # PR binary
tar -I zstd -xf tmp/perf_logs.tar.zst -C tmp/ ./left/server.log # master binary
right/server.log = PR binary (the "new" version)left/server.log = master binary (the "old" version)Analyze the query execution by finding the query ID in the server log:
# Find the query and its timing
grep "math.query2" tmp/right/server.log | grep -E "Aggregated|Read.*rows.*sec"
The perf framework uses query IDs like {test_name.query{N}.run{M}} (e.g. math.query2.run0). Look at:
Aggregated ... in X sec — actual compute timeexecuteQuery: Read N rows ... in X sec — total query timeTCPHandler: Processed in X sec — wall clock including networkCompare both servers during the same time window to see if the machine was under load or if only the PR binary was slow. Check for:
Check the git hash of the binary that actually ran:
grep "Starting ClickHouse" tmp/right/server.log | head -1
This shows the exact revision, build ID, and PID. Compare with what you expect — the CI perf test may use a different binary than the latest commit if the build was cached.
The logs.tar.zst archive contains right-trace-log.tsv and left-trace-log.tsv — these are exports of system.trace_log from each server, containing CPU and real-time stack samples for every query.
Extract trace logs:
tar -I zstd -xf tmp/perf_logs.tar.zst -C tmp/ ./right-trace-log.tsv ./left-trace-log.tsv
Key columns (TSV format, header in row 1, types in row 2, data from row 3):
trace_type — CPU (sampled CPU time), Real (wall clock), Memory, etc.query_id — matches the {test.queryN.runM} pattern from server logssymbols — comma-separated list of function names (leaf first), wrapped in ['...']Find the hotspot for a specific query — extract CPU traces and count leaf functions:
grep "math.query2" tmp/right-trace-log.tsv | awk -F'\t' '$7 == "CPU" {print $19}' | \
sed "s/\[//g; s/\]//g; s/'//g" | \
awk -F',' '{print $1}' | \
sort | uniq -c | sort -rn | head -20
This gives a flat profile of where CPU time is spent, similar to perf report. Compare left (master) vs right (PR) to see what changed.
Build collapsed stacks for flamegraph visualization:
grep "math.query2" tmp/right-trace-log.tsv | awk -F'\t' '$7 == "CPU" {print $19}' | \
sed "s/\[//g; s/\]//g; s/','/;/g; s/'//g" | \
sort | uniq -c | awk '{print $2, $1}' > tmp/math2_right.collapsed
The resulting .collapsed file can be processed with flamegraph.pl or the analyze-assembly.py --perf-map tool.
This is the fastest way to identify the root cause of a regression. Example: for a 31x exp10 regression, the trace log immediately showed modf consuming 577/1200 CPU samples on the PR binary vs 28/60 on master — pinpointing the exact function responsible without needing to reproduce locally.
The archive also contains pre-built SVG flamegraphs for queries the framework selected for detailed analysis:
tar -I zstd -tf tmp/perf_logs.tar.zst | grep "\.svg"
These come in .left.svg (master), .right.svg (PR), and .diff.svg (differential) variants, for both CPU and Real time. Not all queries get flamegraphs — only those the framework considers interesting.
The archive contains per-query raw metric data in analyze/tmp/{test}_{queryN}.tsv. Each row is one run, with an array of all ProfileEvents (counters like UserTimeMicroseconds, OSCPUVirtualTimeMicroseconds, RealTimeMicroseconds, etc.).
tar -I zstd -xf tmp/perf_logs.tar.zst -C tmp/ ./analyze/tmp/math_2.tsv
The all-query-metrics.tsv file (linked from --links output) contains the processed comparison data with old/new values, ratios, and thresholds for every metric of every query. The fetch_perf_report.py tool already parses this, but the raw file has all metrics, not just client_time.
When analyzing multiple batches or PRs, use unique directory names to avoid overwriting:
mkdir -p tmp/batch1_amd tmp/batch5_amd
curl -sS "<batch1-logs-url>" -o tmp/batch1_amd/logs.tar.zst
curl -sS "<batch5-logs-url>" -o tmp/batch5_amd/logs.tar.zst
tar -I zstd -xf tmp/batch1_amd/logs.tar.zst -C tmp/batch1_amd/