Name: Perf Nsight Compute Analysis
Author: NVIDIA

Perf Nsight Compute Analysis | Skills Pool

nsys

Compute %	Memory %	Bottleneck	Next Step
>60	<40	Compute-bound	ComputeWorkloadAnalysis section
<40	>60	Memory-bound	MemoryWorkloadAnalysis section
<40	<40	Latency-bound	LaunchStats + Occupancy sections
40-60	40-60	Balanced	Profile deeper with detailed sections

ncu -v
# Or: $NCU -v

ncu --section SpeedOfLight --csv \
    --kernel-name regex:"KERNEL" \
    --launch-skip 5 --launch-count 3 \
    -- COMMAND

Classification	Sections to Add
Compute-bound	`ComputeWorkloadAnalysis`
Memory-bound	`MemoryWorkloadAnalysis`
Latency-bound	`LaunchStats`, `Occupancy`
Warp stalls	`WarpStateStats`, `SchedulerStats`
Need instruction breakdown	`InstructionStats`

ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
    --kernel-name regex:"embedding_lookup" \
    --launch-count 3 \
    -- python script.py

ncu --section SpeedOfLight --section ComputeWorkloadAnalysis --csv \
    --kernel-name regex:"gemm" \
    --launch-count 3 -- python script.py

ncu --section SpeedOfLight --section LaunchStats --section Occupancy --csv \
    --kernel-name regex:"small_kernel" \
    -- python script.py

ncu --section SpeedOfLight_RooflineChart \
    --kernel-name regex:"KERNEL" -- COMMAND

# FP16 kernels
ncu --section SpeedOfLight_HierarchicalHalfRooflineChart \
    --kernel-name regex:"KERNEL" -- COMMAND

# Tensor core kernels
ncu --section SpeedOfLight_HierarchicalTensorRooflineChart \
    --kernel-name regex:"KERNEL" -- COMMAND

ncu --section SpeedOfLight --csv \
    --kernel-name regex:"optimized_kernel" \
    --launch-count 3 \
    -- python optimized_script.py

# Warmup (JIT + autotuning)
for _ in range(5):
    result = kernel(inputs)
torch.cuda.synchronize()

# Profile only steady-state
torch.cuda.cudart().cudaProfilerStart()
for _ in range(3):
    result = kernel(inputs)
    torch.cuda.synchronize()
torch.cuda.cudart().cudaProfilerStop()

ncu --profile-from-start off --section SpeedOfLight --csv \
    --kernel-name regex:"target_kernel" \
    --launch-count 3 -- python script.py

import ncu_report

ctx = ncu_report.load_report("report.ncu-rep")
for rng in ctx:
    for action in rng:
        name = action.name()
        compute = action["sm__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
        memory = action["dram__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
        duration = action["gpu__time_duration.sum"].as_uint64()

        if compute > 60:
            classification = "compute-bound"
        elif memory > 60:
            classification = "memory-bound"
        else:
            classification = "latency-bound"

        print(f"{name}: {classification} (compute={compute:.1f}%, mem={memory:.1f}%, {duration}ns)")

ncu --csv --section SpeedOfLight --kernel-name regex:"KERNEL" -- COMMAND
ncu --csv --page raw --section SpeedOfLight -- COMMAND   # All metrics flat

ncu -o report --section SpeedOfLight -- COMMAND
ncu --import report.ncu-rep --csv --page raw            # Export to CSV

Column	Meaning
`Kernel Name`	CUDA kernel function name
`Duration`	Execution time (nanoseconds)
`Compute (SM) Throughput`	% of peak compute
`Memory Throughput`	% of peak memory bandwidth
`Achieved Occupancy`	Active warps / max warps (%)

ncu --section SpeedOfLight --csv \
    --kernel-name regex:"gemm" \
    --launch-skip 5 --launch-count 3 \
    -- python train.py

"Kernel Name","Duration","Compute (SM) Throughput","Memory Throughput"
"ampere_fp16_gemm",1250000,78.5,35.2

ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
    --kernel-name regex:"embedding" \
    --launch-count 3 -- python train.py

Error	Cause	Fix
`ncu: command not found`	Not in PATH	`export PATH=$PATH:/usr/local/cuda/bin` or set `$NCU`
`Permission denied`	Needs elevated privileges	`sudo ncu ...` or `--cap-add=SYS_ADMIN` in containers
No kernels captured	Name regex doesn't match	Run without `--kernel-name` first to see actual names
Profiling extremely slow	Using `--set full` or many sections	Use `--section SpeedOfLight` only; reduce `--launch-count`
Autotuning pollutes results	JIT kernel warmup captured	Use `--profile-from-start off` with profiler markers
Metrics show 0% tensor cores	Kernel doesn't use tensor cores	Check with `--section InstructionStats`; verify dimensions align to 8/16
Report file too large	`--set full` with many kernels	Use targeted sections; limit with `--kernel-name` and `--launch-count`
Out-of-range metric values	Async GPU activity or short kernels	Profile on isolated GPU; increase workload size
`ncu` hangs on MPI app	Dependent kernels across ranks	Use `--communicator=tcp --lockstep-kernel-launch`

Dependency	Version	Notes
CUDA Toolkit	>=11.0	Includes `ncu`
`ncu` binary	Match CUDA version	Or set `$NCU` env var
NVIDIA GPU	Kepler+	Volta+ recommended

SOL%	Level	Action
>80%	Excellent	Minor tuning only
60-80%	Good	Targeted optimization
40-60%	Fair	Significant optimization needed
<40%	Poor	Major rework needed

Tool	Scope	Overhead	Purpose
nsys	System-level	5-10%	Find which kernels to optimize
ncu	Kernel-level	10-100x slower	Understand why a kernel is slow

Perf Nsight Compute Analysis

Nsight Compute Analysis

When to Use

Perf Nsight Compute Analysis

Nsight Compute Analysis

When to Use

Requirements

Principles

Data Integrity

SOL% Mental Model

Classification Thresholds

SOL% Performance Levels

Section-First Profiling

ncu vs nsys

Workflow

Step 0: Verify ncu

Step 1: SOL% Diagnosis

Step 2: Escalate with Targeted Sections

Step 3: Roofline Analysis (Optional)

Step 4: Interpret and Optimize

Step 5: Validate

Profiling JIT-Compiled Kernels (Triton/cuTile/CuTeDSL)

Programmatic Report Analysis

Output Formats

Examples

Example: Classify a GEMM Kernel

Example: Diagnose a Memory-Bound Embedding Kernel

Error Handling

Finding More Information

Tier 1: This File (SKILL.md)

Tier 2: references/ Directory

Tier 3: Official Documentation

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags