Name: Gluon Pipeline Optimization: Global + Local Prefetch
Author: leonling-ll

Buscar habilidades.../

Gluon Pipeline Optimization: Global + Local Prefetch | Skills Pool

Signal in amdgcn	Root cause	Fix
`s_waitcnt vmcnt(0)` before `ds_write` (CDNA3)	HBM load latency	Stage 1
`s_waitcnt vmcnt(0)` before MFMA (CDNA4)	DMA latency	Stage 1
`s_waitcnt lgkmcnt(0)` before MFMA	LDS read latency	Stage 2

Time →  [global load for k+2] ────────────────────────────────────────▶
                               [ds_read for k+1] ──────────────▶
                                                 [MFMA for k] ──────────▶

# Inspect the compiled ISA
grep "NumVgprs:" <path>.amdgcn
# occupancy = floor(512 / ceil(NumVgprs / 8) / 8)  [gfx942 has 512 VGPRs/SIMD, granularity 8]

nBuffers: gl.constexpr = 2
smemA = gl.allocate_shared_memory(
    a_ptr.type.element_ty, [nBuffers, BLOCK_M, BLOCK_K], layout=sharedLayoutA
)
smemB = gl.allocate_shared_memory(
    b_ptr.type.element_ty, [nBuffers, BLOCK_K, BLOCK_N], layout=sharedLayoutB
)

iterMax = gl.cdiv(K, BLOCK_K)
gl.assume(iterMax > 0)

g_idx = 0

# ── CDNA4 ──────────────────────────────────────────────────────────────────
gl.amd.cdna4.async_copy.buffer_load_to_shared(smemA.index(g_idx), a_base, a_offsets)
gl.amd.cdna4.async_copy.buffer_load_to_shared(smemB.index(g_idx), b_base, b_offsets)
gl.amd.cdna4.async_copy.commit_group()
# ── CDNA3 (replace the three lines above with these four) ──────────────────
vgpr_a = gl.amd.cdna3.buffer_load(ptr=a_ptr, offsets=a_offsets)
vgpr_b = gl.amd.cdna3.buffer_load(ptr=b_ptr, offsets=b_offsets)
smemA.index(g_idx).store(vgpr_a)    # ds_write; s_waitcnt vmcnt(0) fires here
smemB.index(g_idx).store(vgpr_b)
# ───────────────────────────────────────────────────────────────────────────

a_base += BLOCK_K * stride_ak
b_base += BLOCK_K * stride_bk

for k in range(0, iterMax - 1):
    l_idx = k % 2        # LDS slot holding the tile to compute NOW
    g_idx = 1 - l_idx    # LDS slot to load the NEXT tile into

    # Issue global load for tile k+1 (non-blocking)
    # ── CDNA4 ──────────────────────────────────────────────────────────────
    gl.amd.cdna4.async_copy.buffer_load_to_shared(smemA.index(g_idx), a_base, a_offsets)
    gl.amd.cdna4.async_copy.buffer_load_to_shared(smemB.index(g_idx), b_base, b_offsets)
    gl.amd.cdna4.async_copy.commit_group()
    gl.amd.cdna4.async_copy.wait_group(1)   # allow 1 DMA in-flight while we compute
    # ── CDNA3 (replace the four lines above with these two) ────────────────
    vgpr_a = gl.amd.cdna3.buffer_load(ptr=a_ptr, offsets=a_offsets)
    vgpr_b = gl.amd.cdna3.buffer_load(ptr=b_ptr, offsets=b_offsets)
    # ───────────────────────────────────────────────────────────────────────

    # LDS read + MFMA for tile k (overlaps with in-flight global load above)
    a = smemA.index(l_idx).load(layout=dotOpLayoutA)
    b = smemB.index(l_idx).load(layout=dotOpLayoutB)
    acc = gl.amd.cdna3.mfma(a, b, acc)

    # Write tile k+1 into LDS (vmcnt stall hidden behind MFMA above)
    # ── CDNA4: no ds_write needed — async_copy already landed in LDS ───────
    # ── CDNA3 (add these two lines after MFMA) ─────────────────────────────
    smemA.index(g_idx).store(vgpr_a)
    smemB.index(g_idx).store(vgpr_b)
    # ───────────────────────────────────────────────────────────────────────

    a_base += BLOCK_K * stride_ak
    b_base += BLOCK_K * stride_bk

# ── CDNA4 ──────────────────────────────────────────────────────────────────
gl.amd.cdna4.async_copy.wait_group(0)
# ── CDNA3 (no extra wait needed — vmcnt(0) already fired in last loop iter) -
# ───────────────────────────────────────────────────────────────────────────

l_idx = (iterMax - 1) % 2
a = smemA.index(l_idx).load(layout=dotOpLayoutA)
b = smemB.index(l_idx).load(layout=dotOpLayoutB)
acc = gl.amd.cdna3.mfma(a, b, acc)

import torch, importlib.util

def load_kernel(path, name):
    spec = importlib.util.spec_from_file_location(name, path)
    mod  = importlib.util.module_from_spec(spec); spec.loader.exec_module(mod)
    return mod

baseline = load_kernel("kernel_baseline.py", "baseline")
stage1   = load_kernel("kernel_stage1.py",   "stage1")

# Use the actual input shapes from the target workload
x = torch.randn(B, M, K, dtype=torch.bfloat16, device="cuda")
w = torch.randn(N, K, dtype=torch.bfloat16, device="cuda")

c_ref = baseline.launcher(x, w)
c_new = stage1.launcher(x, w)
assert torch.allclose(c_ref, c_new, atol=1.0, rtol=0), \
    f"Stage 1 FAILED: max diff = {(c_ref - c_new).abs().max().item()}"
print("Stage 1 correctness OK")

/kernel-perf-analysis
Kernel file: <absolute path to stage1_kernel.py>
Mode hint: ATT trace MFMA efficiency vmcnt bottleneck
Label: stage1_global_prefetch

  MFMA efficiency      : 72.4%   (target > 80%; was ~57% before Stage 1)
  Avg iteration cycles : 210.3
  Time distribution    : prologue=1.8%, loop=97.1%, epilogue=1.1%

stage1_isa=$(find ~/.triton/cache -name "*.amdgcn" | xargs ls -lt | head -1 | awk '{print $NF}')
echo "VGPRs:"; grep "NumVgprs:" $stage1_isa
echo "vmcnt(0) count:";   grep -c "s_waitcnt vmcnt(0)"   $stage1_isa
echo "lgkmcnt(0) count:"; grep -c "s_waitcnt lgkmcnt(0)" $stage1_isa

Outcome	Action
Faster and `vmcnt(0)` count dropped	✅ Stage 1 succeeded — proceed to Stage 2
Slower, CDNA3, VGPRs increased significantly	❌ Revert. Check if occupancy dropped (waves/SIMD fell). Document and stop.
Slower, CDNA4	❌ Revert. Kernel is likely compute-bound or already has good HW prefetch. Stop.
Same speed, `lgkmcnt(0)` count is high	⚠️ Stage 1 neutral — LDS latency dominates. Proceed to Stage 2 anyway.

iterMax = gl.cdiv(K, BLOCK_K)
gl.assume(iterMax > 1)

## --- Tile 0 → LDS[0] ---
g_idx = 0
# ── CDNA4 ──────────────────────────────────────────────────────────────────
gl.amd.cdna4.async_copy.buffer_load_to_shared(smemA.index(g_idx), a_base, a_offsets)
gl.amd.cdna4.async_copy.buffer_load_to_shared(smemB.index(g_idx), b_base, b_offsets)
gl.amd.cdna4.async_copy.commit_group()
# ── CDNA3 ──────────────────────────────────────────────────────────────────
vgpr_a = gl.amd.cdna3.buffer_load(ptr=a_ptr, offsets=a_offsets)
vgpr_b = gl.amd.cdna3.buffer_load(ptr=b_ptr, offsets=b_offsets)
smemA.index(g_idx).store(vgpr_a)
smemB.index(g_idx).store(vgpr_b)
# ───────────────────────────────────────────────────────────────────────────
a_base += BLOCK_K * stride_ak
b_base += BLOCK_K * stride_bk

## --- Tile 1 → LDS[1] ---
g_idx = 1
# ── CDNA4 ──────────────────────────────────────────────────────────────────
gl.amd.cdna4.async_copy.buffer_load_to_shared(smemA.index(g_idx), a_base, a_offsets)
gl.amd.cdna4.async_copy.buffer_load_to_shared(smemB.index(g_idx), b_base, b_offsets)
gl.amd.cdna4.async_copy.commit_group()
gl.amd.cdna4.async_copy.wait_group(1)   # wait for tile 0 only; tile 1 still in-flight
# ── CDNA3 ──────────────────────────────────────────────────────────────────
vgpr_a = gl.amd.cdna3.buffer_load(ptr=a_ptr, offsets=a_offsets)
vgpr_b = gl.amd.cdna3.buffer_load(ptr=b_ptr, offsets=b_offsets)
smemA.index(g_idx).store(vgpr_a)    # vmcnt stall fires here for tile 1
smemB.index(g_idx).store(vgpr_b)
# ───────────────────────────────────────────────────────────────────────────
a_base += BLOCK_K * stride_ak
b_base += BLOCK_K * stride_bk

## --- Pre-read tile 0 from LDS[0] into registers ---
# (tile 0 is guaranteed ready; tile 1 DMA/vmcnt may still be in-flight — that's fine)
a = smemA.index(0).load(layout=dotOpLayoutA)
b = smemB.index(0).load(layout=dotOpLayoutB)

for k in range(0, iterMax - 1):
    g_idx = k % 2        # LDS slot to write tile k+2 into (was consumed at iter k-1)
    l_idx = 1 - g_idx    # LDS slot holding tile k+1 (ready to read)

    ## MFMA on pre-loaded registers — no lgkmcnt stall
    acc = gl.amd.cdna3.mfma(a, b, acc)

    ## Issue global load for tile k+2 (non-blocking; masked on last useful iter)
    if k < iterMax - 2:
        # ── CDNA4 ──────────────────────────────────────────────────────────
        gl.amd.cdna4.async_copy.buffer_load_to_shared(smemA.index(g_idx), a_base, a_offsets)
        gl.amd.cdna4.async_copy.buffer_load_to_shared(smemB.index(g_idx), b_base, b_offsets)
        gl.amd.cdna4.async_copy.commit_group()
        gl.amd.cdna4.async_copy.wait_group(0)   # drain — tile k+1 must be ready for ds_read below
        # ── CDNA3 ──────────────────────────────────────────────────────────
        vgpr_a = gl.amd.cdna3.buffer_load(ptr=a_ptr, offsets=a_offsets)
        vgpr_b = gl.amd.cdna3.buffer_load(ptr=b_ptr, offsets=b_offsets)
        # ───────────────────────────────────────────────────────────────────
        a_base += BLOCK_K * stride_ak
        b_base += BLOCK_K * stride_bk

    ## Write tile k+2 into LDS (vmcnt stall hidden behind MFMA above)
    if k < iterMax - 2:
        # ── CDNA3 only ─────────────────────────────────────────────────────
        smemA.index(g_idx).store(vgpr_a)
        smemB.index(g_idx).store(vgpr_b)
        # ── CDNA4: async_copy already landed tile k+2 in LDS — nothing to do

    ## ds_read tile k+1 into registers for next MFMA (overlaps with ds_write above)
    a = smemA.index(l_idx).load(layout=dotOpLayoutA)
    b = smemB.index(l_idx).load(layout=dotOpLayoutB)

## Final MFMA — a, b were pre-loaded at the end of the last loop iteration
acc = gl.amd.cdna3.mfma(a, b, acc)

c_s1  = stage1.launcher(x, w)
c_s2  = stage2.launcher(x, w)
assert torch.allclose(c_s1, c_s2, atol=1.0, rtol=0), \
    f"Stage 2 FAILED: max diff = {(c_s1 - c_s2).abs().max().item()}"
print("Stage 2 correctness OK")

/kernel-perf-analysis
Kernel file: <absolute path to stage2_kernel.py>
Mode hint: ATT trace MFMA efficiency lgkmcnt bottleneck
Label: stage2_local_prefetch

  MFMA efficiency      : 84.1%   (was ~72% after Stage 1; target > 80%)
  Avg iteration cycles : 178.6
  Time distribution    : prologue=1.5%, loop=97.9%, epilogue=0.6%

stage2_isa=$(find ~/.triton/cache -name "*.amdgcn" | xargs ls -lt | head -1 | awk '{print $NF}')
echo "VGPRs:"; grep "NumVgprs:" $stage2_isa
echo "vmcnt(0) count:";   grep -c "s_waitcnt vmcnt(0)"   $stage2_isa
echo "lgkmcnt(0) count:"; grep -c "s_waitcnt lgkmcnt(0)" $stage2_isa
# Confirm lgkmcnt(0) count dropped compared to Stage 1

Outcome	Action
MFMA efficiency > 80% and `lgkmcnt(0)` count dropped	✅ Stage 2 succeeded — keep it
Slower or same speed	❌ Revert to Stage 1. Compiler already schedules ds_read well, or kernel is MFMA-bound. Document.
VGPRs increased, CDNA3 only	Check occupancy. If waves/SIMD dropped, revert.

Stage	Hides	Mechanism	Expected Speedup
Stage 1 (CDNA4)	DMA latency (~200–800 cy)	`wait_group(1)` overlaps DMA with MFMA	15–40%
Stage 1 (CDNA3)	HBM latency (~200–800 cy)	`buffer_load` into VGPR, vmcnt hidden behind MFMA	5–25% (VGPR-dependent)
Stage 2 (both)	LDS read latency (~40–100 cy)	`ds_read` one iter ahead; MFMA uses pre-loaded registers	5–20% additional

`gfx942`	CDNA3	No	`buffer_load` → VGPR → `ds_write`
`gfx950`	CDNA4	Yes	`async_copy.buffer_load_to_shared`

Gluon Pipeline Optimization: Global + Local Prefetch

Step 0: Identify GPU Platform and Baseline Bottlenecks

Gluon Pipeline Optimization: Global + Local Prefetch

Step 0: Identify GPU Platform and Baseline Bottlenecks

Background: Two Latency Sources

VGPR Cost (CDNA3 Only)

Stage 1: Global Prefetch (Double Buffering)

What It Does

Code Template

1. Allocate double-buffered LDS

2. Prologue — load tile 0 into LDS[0]

3. Main loop — overlap load k+1 with MFMA k

4. Epilogue — drain remaining in-flight load and compute last tile

Stage 1 Verification ✓

Correctness

Performance + ATT analysis (Mode 3 — ATT trace)

Decision

Stage 2: Local Prefetch (LDS Read Overlap)

What It Does

Code Template

1. Extended prologue — load tiles 0 and 1, pre-read tile 0 into registers

2. Main loop — MFMA first, then load k+2, then ds_read k+1

3. Epilogue — final MFMA, data already in registers

Stage 2 Verification ✓

Correctness

Performance + ATT analysis (Mode 3 — ATT trace)

Decision

Performance Summary

Nanoclaw Repl

Bioinformatics

Smart Explore

Vector Database Engineer

Skin Health Analyzer

Scanpy