Skill-Datei

Vgpr Pressure Analysis

Name: Vgpr Pressure Analysis
Author: zhiding512

Analyze VGPR register pressure in FlyDSL GPU kernels by combining ISA metadata (spill count, scratch size) with source-level liveness analysis. Identifies which variable groups cause peak VGPR pressure, locates the spill root cause, and suggests optimization directions. Works with ISA files from /dump-ir or saved ISA snapshots. Usage: /vgpr-pressure <isa_file> [kernel_source]

zhiding5120 Sterne09.04.2026

Beruf
Kategorien: Debugging

Skill-Inhalt

Analyze VGPR register pressure in a FlyDSL GPU kernel by combining ISA metadata with source-level variable liveness analysis. Identify the root cause of VGPR spill and produce actionable optimization suggestions.

Arguments

Argument	Required	Description
`<ISA_FILE>`	Yes	Path to the final ISA assembly file (e.g., `15_final_isa.s` from `/dump-ir`)
`[KERNEL_SOURCE]`	No	Path to the kernel Python source file. If omitted, attempts auto-detection.

If no ISA file is provided, ask the user. If no kernel source is provided, try to locate it from the dump directory structure or ask the user.

Step 1: Extract ISA Hardware Metrics

1.1 Read Metadata

From the ISA file's .amdgpu_metadata YAML section, extract:

Verwandte Skills

Vgpr Pressure Analysis | Skills Pool

grep -E '\.(vgpr_count|vgpr_spill_count|agpr_count|sgpr_count|sgpr_spill_count|private_segment_fixed_size):' <ISA_FILE>
grep 'amdhsa.target:' <ISA_FILE>

Field	What it means
`.vgpr_count`	Allocated VGPRs (max 256 on gfx950/gfx942)
`.vgpr_spill_count`	Number of VGPR values spilled to scratch memory
`.agpr_count`	AccVGPRs used (0 on gfx950 for non-AGPR kernels)
`.sgpr_count`	Allocated SGPRs
`.private_segment_fixed_size`	Scratch memory bytes per thread
`amdhsa.target`	GPU architecture (gfx942, gfx950, etc.)

# Find hot loop boundaries
grep -n 'LBB0_' <ISA_FILE> | head -10
# Example: .LBB0_2 (line 450) to s_cbranch_vccnz .LBB0_2 (line 973)

# Replace START and END with actual line numbers
sed -n 'START,ENDp' <ISA_FILE> | grep -c 'scratch_load'
sed -n 'START,ENDp' <ISA_FILE> | grep -c 'scratch_store'
sed -n 'START,ENDp' <ISA_FILE> | grep -c 'buffer_load'
sed -n 'START,ENDp' <ISA_FILE> | grep -c 'v_mfma'
sed -n 'START,ENDp' <ISA_FILE> | grep -c 'v_cndmask'
sed -n 'START,ENDp' <ISA_FILE> | grep -c 's_nop'

# Search for these assignments in the kernel source:
m_repeat = tile_m // 16           # M-dimension repeat count
num_acc_n = n_per_wave // 16      # N-dimension accumulator count
k_unroll = tile_k_bytes // 64     # K-dimension unroll factor (K64 micro-steps)
sb_per_tile = tile_k // scale_block_k  # Scale blocks per tile
ku_per_sb = scale_block_k // 64   # K64 steps per scale block
num_x_loads = bytes_per_thread_x // x_load_bytes  # X tile load count

Grep(pattern="m_repeat|num_acc_n|k_unroll|sb_per_tile|num_x_loads", path=<KERNEL_SOURCE>)

Grep(pattern="init_state|init=|yield ", path=<KERNEL_SOURCE>)
Grep(pattern="_pack_loop_state|_unpack_loop_state", path=<KERNEL_SOURCE>)

Grep(pattern="load_scales|combined_gate|combined_up|s_a_vecs|s_w_gate|s_w_up", path=<KERNEL_SOURCE>)

Grep(pattern="load_x_tile|x_regs|num_x_loads", path=<KERNEL_SOURCE>)

Grep(pattern="load_b_tile|b_gate_tile|b_up_tile|b_tile", path=<KERNEL_SOURCE>)

Grep(pattern="block_gate_accs|block_up_accs|block_accs.*acc_init", path=<KERNEL_SOURCE>)

Grep(pattern="_pack128|a128|bg128|bu128|vec8_i32|vec4_i64", path=<KERNEL_SOURCE>)

Grep(pattern="lds_load_packs|a0_prefetch", path=<KERNEL_SOURCE>)

MLIR / FlyDSL Type	VGPRs	Common Usage
`vec4_f32`	4	Accumulator, scale FMA result
`i64`	2	B tile pack, a0_prefetch, LDS load pack
`f32` / `i32`	1	Scale scalar, index
`vec4_i32`	4	X tile load (buffer_load_dwordx4)
`vec8_i32`	8	gfx950 MFMA 128-bit packed input
`vec4_i64`	8	gfx950 _pack128 intermediate
`T.index`	1	Address calculation (lowered to i32/i64)

Category	Variable	Count	Type	VGPRs	Reducible?
A. Loop acc	acc_gate	m_repeat × num_acc_n	vec4_f32	...	No (output)
A. Loop B tile	b_gate_flat	k_unroll × 2 × num_acc_n	i64	...	Yes (lifecycle)
A. Loop a0_pf	a0_prefetch	2	i64	...	Yes (remove)
B. Scales	combined	m_repeat × num_acc_n × 4 × 2	f32	...	Yes (lazy)
C. X tile	x_regs	num_x_loads	vec4_i32	...	No (min load)
D. B tile cur	b_gate_cur	k_unroll × 2 × num_acc_n	i64	...	Yes (per-ku)
E. Block accs	block_g/u_accs	m_repeat × num_acc_n × 2	vec4_f32	...	Yes (reuse)
F. MFMA temps	a128/bg128/bu128	3	vec8_i32	...	No (transient)
G. LDS temps	a0-a3 pre-pack	4	i64	...	No (transient)

Alive Variable Group	Reason	VGPRs
Loop-carried state (full)	Will become yield output	...
B tile for next stage	Just loaded, waiting for stage 1	...
X regs for next stage	Just loaded, waiting for LDS store	...
Combined scales	Needed by compute	...
Block accs	Inside compute_tile	...
MFMA temps	Inside MFMA loop	...
TOTAL		???

Peak VGPR pressure: CP1 (Stage 0 Compute) = XXX VGPRs
Hardware limit: 256 VGPRs
Overflow: XXX - 256 = YY VGPRs → causes ZZZ spills

# VGPR Pressure Analysis Report

## Kernel: <kernel_name>
## Target: <gpu_arch>
## Source: <kernel_source_path>
## ISA: <isa_file_path>

## 1. ISA Metrics
| Metric | Value | Status |
|--------|-------|--------|
| vgpr_count | | ⚠️/✅ |
| vgpr_spill_count | | ❌/✅ |
| agpr_count | | |
| sgpr_count | | |
| scratch_bytes | | |

### Hot Loop Instruction Distribution
| Instruction Type | Count | % of VMEM |
|-----------------|-------|-----------|
| scratch_load | | |
| scratch_store | | |
| buffer_load | | |
| v_mfma | | |
| Total VMEM ops | | |

## 2. Variable Groups & VGPR Cost
| Category | Variables | Count | Type | VGPRs | Reducible |
|----------|-----------|-------|------|-------|-----------|
| ... | ... | ... | ... | ... | Yes/No |

## 3. Peak Liveness Analysis

### Critical Point 1: <name>
| Alive Group | VGPRs |
|-------------|-------|
| ... | ... |
| **TOTAL** | **XXX** |

### Critical Point 2: <name>
...

### Peak: <CP name> = <N> VGPRs (overflow: <M>)

## 4. Optimization Suggestions (sorted by impact)

### [HIGH] OPT-A: <title> — saves ~XX VGPRs
- **What**: <description>
- **Pattern found**: <code pattern location>
- **Trade-off**: <cost>

### [MEDIUM] OPT-C: <title> — saves ~XX VGPRs
...

### Non-Reducible
- <category>: <reason>

## 5. Projected Peak After Optimization
| Scenario | Peak VGPRs | vs 256 Limit | Spill Expected |
|----------|-----------|-------------|----------------|
| Current | XXX | +YY | Yes (severe) |
| +OPT-A | ... | ... | ... |
| +OPT-B | ... | ... | ... |
| +OPT-A+B | ... | ... | No |
| +All | ... | ... | No |

Category	Count × Type	VGPRs	Reducible
acc_gate + acc_up	8 × vec4_f32	64	No
b_gate_flat + b_up_flat (loop)	16 × i64	32	Yes → OPT-A
a0_prefetch (loop)	2 × i64	4	Yes → OPT-E
combined scales	64 × f32	64	Yes → OPT-C
x_regs	2 × vec4_i32	8	No
B tile next stage	16 × i64	32	Lifecycle
block_gate/up_accs	16 × vec4_f32	64	Yes → OPT-B
MFMA temps	mixed	24	No (transient)

Hot Loop Metric	Count	Note
Total lines
scratch_load		spill reads
scratch_store		spill writes
buffer_load		VMEM data loads
v_mfma		MFMA compute
v_cndmask		conditional select
s_nop		pipeline nops

Alive Variable Group	Reason	VGPRs
acc_gate + acc_up	Output accumulators	...
B tile next (if loop-carried)	For next iteration	...
a0_prefetch (if loop-carried)	For next iteration	...
TOTAL		???

Alive Variable Group	VGPRs
current_gate + current_up	...
block_gate_accs + block_up_accs	...
combined scales (current sb)	...
a128 + bg128 + bu128	...
B tile (remaining ku packs)	...
TOTAL	???

Vgpr Pressure Analysis

Arguments

Step 1: Extract ISA Hardware Metrics

1.1 Read Metadata

Vgpr Pressure Analysis

Arguments

Step 1: Extract ISA Hardware Metrics

1.1 Read Metadata

1.2 Hot Loop Instruction Statistics

Step 2: Source Code Variable Grouping

2.1 Extract Tile Constants

2.2 Variable Categories

Category A: Loop-Carried State

Category B: Scale Values

Category C: X Tile Regs

Category D: B Tile Regs

Category E: MFMA Block Accumulators

Category F: MFMA Temporaries

Category G: LDS Load Temps

Step 3: VGPR Cost Model

Step 4: Liveness Analysis

Critical Points

CP1: Stage 0 Compute (while loads for Stage 1 are queued)

CP2: Yield Point

CP3: Inside MFMA + Scale FMA

Finding the Peak

Step 5: Diagnosis and Optimization Suggestions

Optimization Catalog

OPT-A: Remove Non-Acc Values from Loop State (High Impact)

OPT-B: Reuse acc_init for Block Accumulators (High Impact)

OPT-C: Lazy Scale Computation (Medium Impact)

OPT-D: Per-KU B Tile Loading (Low Impact)

OPT-E: Remove Cross-Stage LDS Prefetch (Low Impact)

Generating Suggestions

Step 6: Report Output

Worked Example: MoE Blockscale Stage1 Kernel

Error Handling

Next Steps

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags