Profile GPU kernels using rocprofv3 to collect ATT instruction-level traces, then analyze the trace data using hotspot_analyzer.py to identify top-K stall hotspots (VMEM-load, VMEM-wait, LDS/SMEM-wait, barrier, MFMA stalls) mapped back to source lines, and produce an actionable optimization plan. Usage: /kernel-trace-analysis <cmd> Can also analyze an existing dispatch dir directly: /kernel-trace-analysis --dir <path>
Profile and analyze GPU kernel ATT traces to identify stall hotspots and produce an optimization plan.
| Argument | Description |
|---|---|
<CMD> | Command to profile. Example: python bench_pa.py --batch 32 |
--dir <path> | Skip collection; analyze existing ui_output_agent_*_dispatch_* directory |
--topk N | Show top-N hotspots (default: 15) |
The hotspot analyzer is located at scripts/hotspot_analyzer.py.
It reads a ui_output_agent_*_dispatch_* directory and reports top-K stall hotspots.
If the user provides --dir <path> or already has a ui_output_agent_*_dispatch_* directory:
# Write hotspot_analyzer.py (see above), then:
python /tmp/hotspot_analyzer.py <dispatch_dir> --topk 15 --mode both
python /tmp/hotspot_analyzer.py <dispatch_dir> --topk 5 --mode src --detail --context 4
Skip to Step 4: Interpret Results.
touch /tmp/trace_ts
rocprofv3 --stats --kernel-trace -f csv -- <CMD> 2>&1
find . -maxdepth 3 -name "*stats*" -newer /tmp/trace_ts -type f 2>/dev/null
Parse the stats CSV and present a kernel table:
| Rank | Kernel Name | Calls | Total (us) | Avg (us) | % GPU Time |
|---|
Ask the user which kernel to trace if not obvious.
Prefer results.db if available — use sqlite3 for structured queries:
sqlite3 results.db "
SELECT ks.KernelName, COUNT(*) calls,
ROUND(AVG(kd.end-kd.start)/1000.0,1) avg_us
FROM rocpd_kernel_dispatch kd
JOIN rocpd_info_kernel_symbol ks ON kd.kernel_symbol_id=ks.id
GROUP BY ks.KernelName ORDER BY avg_us DESC LIMIT 20;"
cp ~/Documents/input.yaml /tmp/trace_input.yaml
Edit /tmp/trace_input.yaml: