Analyze ncu (NVIDIA Nsight Compute) profiling output: SOL% bottleneck classification, roofline analysis, occupancy diagnosis, memory hierarchy analysis, warp stall analysis, metric interpretation, and programmatic .ncu-rep report analysis. NOT for kernel writing or code generation, Nsight Systems (nsys), host-side profiling, or system-level profiling.
NVIDIA Nsight Compute (ncu) profiles individual CUDA kernels to determine
why they are slow and what to optimize. It measures GPU throughput as a
percentage of theoretical peak (Speed of Light / SOL%), enabling systematic
bottleneck classification and targeted optimization.
Reach for this skill when you encounter:
ncu output,
interpret .ncu-rep reports, or optimize GPU kernel performanceDo NOT use this skill for:
nsys instead)nsysnvidia-smi)| Dependency | Version | Notes |
|---|---|---|
| CUDA Toolkit | >=11.0 | Includes ncu |
ncu binary | Match CUDA version | Or set $NCU env var |
| NVIDIA GPU | Kepler+ | Volta+ recommended |
Permissions: ncu may require sudo, CAP_SYS_ADMIN, or --privileged
in containers. Check with ncu -v first.
This is a data-driven analysis system. Every number you present must have an authoritative source. Follow these rules without exception:
Speed of Light (SOL%) measures how close a kernel runs to the GPU's theoretical peak:
A kernel cannot saturate both simultaneously. The higher metric reveals the bottleneck type. Use this as the primary classification signal.
| Compute % | Memory % | Bottleneck | Next Step |
|---|---|---|---|
| >60 | <40 | Compute-bound | ComputeWorkloadAnalysis section |
| <40 | >60 | Memory-bound | MemoryWorkloadAnalysis section |
| <40 | <40 | Latency-bound | LaunchStats + Occupancy sections |
| 40-60 | 40-60 | Balanced | Profile deeper with detailed sections |
Additional signals:
| SOL% | Level | Action |
|---|---|---|
| >80% | Excellent | Minor tuning only |
| 60-80% | Good | Targeted optimization |
| 40-60% | Fair | Significant optimization needed |
| <40% | Poor | Major rework needed |
Always use targeted --section flags instead of bulk --set collection. Individual sections are faster and more surgical. Only escalate to --set basic or --set detailed when broad exploration is needed.
| Tool | Scope | Overhead | Purpose |
|---|---|---|---|
| nsys | System-level | 5-10% | Find which kernels to optimize |
| ncu | Kernel-level | 10-100x slower | Understand why a kernel is slow |
Use nsys first to identify top kernels by GPU time, then ncu for deep analysis of those specific kernels.
Choose your path based on the request:
ncu -v
# Or: $NCU -v
If not found, ensure CUDA toolkit is installed or set NCU env var to the binary path.
Always start with SpeedOfLight to classify the bottleneck:
ncu --section SpeedOfLight --csv \
--kernel-name regex:"KERNEL" \
--launch-skip 5 --launch-count 3 \
-- COMMAND
Read Compute (SM) Throughput and Memory Throughput from the output. Classify using the thresholds above.
Based on Step 1 classification, add sections:
| Classification | Sections to Add |
|---|---|
| Compute-bound | ComputeWorkloadAnalysis |
| Memory-bound | MemoryWorkloadAnalysis |
| Latency-bound | LaunchStats, Occupancy |
| Warp stalls | WarpStateStats, SchedulerStats |
| Need instruction breakdown | InstructionStats |
Always include LaunchStats and Occupancy when diagnosing latency-bound kernels. These reveal register pressure, shared memory limits, and block size issues.
Example -- memory-bound deep dive:
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
--kernel-name regex:"embedding_lookup" \
--launch-count 3 \
-- python script.py
Example -- compute-bound deep dive:
ncu --section SpeedOfLight --section ComputeWorkloadAnalysis --csv \
--kernel-name regex:"gemm" \
--launch-count 3 -- python script.py
Example -- occupancy investigation:
ncu --section SpeedOfLight --section LaunchStats --section Occupancy --csv \
--kernel-name regex:"small_kernel" \
-- python script.py
For visual understanding of compute vs memory balance:
ncu --section SpeedOfLight_RooflineChart \
--kernel-name regex:"KERNEL" -- COMMAND
For precision-specific hierarchical roofline:
# FP16 kernels
ncu --section SpeedOfLight_HierarchicalHalfRooflineChart \
--kernel-name regex:"KERNEL" -- COMMAND
# Tensor core kernels
ncu --section SpeedOfLight_HierarchicalTensorRooflineChart \
--kernel-name regex:"KERNEL" -- COMMAND
Interpretation: kernel left of ridge point = memory-bound; right = compute-bound;
far below both roofs = latency/occupancy issue. See references/roofline-analysis.md.
references/bottleneck-guide.mdRe-profile the same kernel after optimization:
ncu --section SpeedOfLight --csv \
--kernel-name regex:"optimized_kernel" \
--launch-count 3 \
-- python optimized_script.py
Compare: Did throughput % increase? Did duration decrease? Did the bottleneck type shift?
JIT-compiled kernels trigger autotuning on first invocation. Isolate the actual execution:
torch.cuda.synchronize().cudaProfilerStart()/cudaProfilerStop().--profile-from-start off so ncu only captures the marked region:# Warmup (JIT + autotuning)
for _ in range(5):
result = kernel(inputs)
torch.cuda.synchronize()
# Profile only steady-state
torch.cuda.cudart().cudaProfilerStart()
for _ in range(3):
result = kernel(inputs)
torch.cuda.synchronize()
torch.cuda.cudart().cudaProfilerStop()
ncu --profile-from-start off --section SpeedOfLight --csv \
--kernel-name regex:"target_kernel" \
--launch-count 3 -- python script.py
Alternative: use --launch-skip N to skip autotuning launches. See
references/advanced-profiling.md for NVTX range and replay mode alternatives.
Extract metrics from .ncu-rep files using the ncu_report Python module
(in extras/python/ of the Nsight Compute installation):
import ncu_report
ctx = ncu_report.load_report("report.ncu-rep")
for rng in ctx:
for action in rng:
name = action.name()
compute = action["sm__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
memory = action["dram__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
duration = action["gpu__time_duration.sum"].as_uint64()
if compute > 60:
classification = "compute-bound"
elif memory > 60:
classification = "memory-bound"
else:
classification = "latency-bound"
print(f"{name}: {classification} (compute={compute:.1f}%, mem={memory:.1f}%, {duration}ns)")
See references/python-report-api.md for the full API (IContext, IRange, IAction, IMetric classes).
CSV output (for scripting and automated analysis):
ncu --csv --section SpeedOfLight --kernel-name regex:"KERNEL" -- COMMAND
ncu --csv --page raw --section SpeedOfLight -- COMMAND # All metrics flat
Report files (for later analysis):
ncu -o report --section SpeedOfLight -- COMMAND
ncu --import report.ncu-rep --csv --page raw # Export to CSV
Key CSV columns:
| Column | Meaning |
|---|---|
Kernel Name | CUDA kernel function name |
Duration | Execution time (nanoseconds) |
Compute (SM) Throughput | % of peak compute |
Memory Throughput | % of peak memory bandwidth |
Achieved Occupancy | Active warps / max warps (%) |
Success indicators:
--launch-count > 1ncu --section SpeedOfLight --csv \
--kernel-name regex:"gemm" \
--launch-skip 5 --launch-count 3 \
-- python train.py
Output:
"Kernel Name","Duration","Compute (SM) Throughput","Memory Throughput"
"ampere_fp16_gemm",1250000,78.5,35.2
Interpretation: compute-bound (78.5% compute, 35.2% memory). Next step:
check tensor core usage with --section ComputeWorkloadAnalysis.
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
--kernel-name regex:"embedding" \
--launch-count 3 -- python train.py
Check L1/L2 cache hit rates and coalescing efficiency in output. Low hit rates suggest poor data locality; low coalescing efficiency suggests scattered access.
| Error | Cause | Fix |
|---|---|---|
ncu: command not found | Not in PATH | export PATH=$PATH:/usr/local/cuda/bin or set $NCU |
Permission denied | Needs elevated privileges | sudo ncu ... or --cap-add=SYS_ADMIN in containers |
| No kernels captured | Name regex doesn't match | Run without --kernel-name first to see actual names |
| Profiling extremely slow | Using --set full or many sections | Use --section SpeedOfLight only; reduce --launch-count |
| Autotuning pollutes results | JIT kernel warmup captured | Use --profile-from-start off with profiler markers |
| Metrics show 0% tensor cores | Kernel doesn't use tensor cores | Check with --section InstructionStats; verify dimensions align to 8/16 |
| Report file too large | --set full with many kernels | Use targeted sections; limit with --kernel-name and --launch-count |
| Out-of-range metric values | Async GPU activity or short kernels | Profile on isolated GPU; increase workload size |
ncu hangs on MPI app | Dependent kernels across ranks | Use --communicator=tcp --lockstep-kernel-launch |
You are reading it now. The section-first workflow and error table above cover the most common profiling tasks. Search this file first.
Grep for keywords across references/ -- headers are grep-friendly:
references/cli-reference.md -- Complete CLI options, filtering, output formatsreferences/metrics-guide.md -- Hardware model, metric naming, key metricsreferences/sections-guide.md -- All --section names, when to use eachreferences/bottleneck-guide.md -- Per-bottleneck root causes and optimizationreferences/memory-analysis.md -- Memory hierarchy, cache analysis, coalescingreferences/roofline-analysis.md -- Roofline charts and interpretationreferences/advanced-profiling.md -- Replay modes, MPI, CUDA graphs, PM sampling, customizationreferences/python-report-api.md -- ncu_report Python module APIHow to search:
Grep for your keyword across references/Read only the file that Grep points toIf Tiers 1-2 don't answer:
ncu_report APIWebFetch or WebSearch these URLs for the latest content. Consider distilling
new findings back into references/.