技能档案

Xe2 ESIMD GEMV Kernels — W4A16, W8A16 & FP16

Name: Xe2 ESIMD GEMV Kernels — W4A16, W8A16 & FP16
Author: ModelTC

Use this skill when writing, optimizing, benchmarking, or debugging W4A16 or W8A16 GEMV kernels targeting Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) GPU using SYCL ESIMD. Xe2 is the GPU architecture; LNL and BMG are product names. Also covers general FP16 GEMV patterns. Covers quantized weight dequantization, SIMD vs scalar interleaving, K-split SLM reduction, VL/ROWS tuning, workgroup decomposition, uint4 unpacking, FP32 accumulation, SLM barriers, performance methodology, and all hardware constraints.

ModelTC2,191 星标2026年4月17日

职业
分类: 框架内部

技能内容

Specialized knowledge for memory-bandwidth-bound quantized and FP16 GEMV (M=1 matrix-vector multiply) on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG). Reference files hold detail; this file holds critical rules and workflow.

Platform: Xe2 (BMG) — 520 GB/s DRAM, 32 XE cores x 8 EUs x 8 threads = 2048 HW threads

Achieved results: W4A16 571 GB/s (110% roofline), W8A16 552 GB/s (106% roofline)

Quick-Reference Rules (must follow every time)

Hardware limits

2048 hardware threads total (32 XE × 8 EU × 8 threads). Design WG sizes so num_groups × local_size fills this.
No doubleGRF required for GEMV — kernels are memory-bound, GRF pressure is low.
SLM: 64 KB per XE core. For GEMV with K-split: SLM_SIZE = ROWS × K_SPLIT × sizeof(float). This is tiny; no pressure.
Barriers: every thread in a WG must execute the same number of barrier() calls. Unequal counts cause GPU hang.

相关技能

Xe2 ESIMD GEMV Kernels — W4A16, W8A16 & FP16 | Skills Pool

NEVER use scalar loops to interleave lo/hi nibbles into a SIMD register. This is the single most impactful mistake.

ALWAYS use simd::template select<COUNT, STRIDE>(OFFSET) for strided writes:

// BAD  — scalar loop, 289 GB/s
for (int i = 0; i < 64; i++) {
    weight_f[base + i * 2]     = lo[i];
    weight_f[base + i * 2 + 1] = hi[i];
}

// GOOD — SIMD strided select, 571 GB/s
weight_f.template select<64, 2>(base + 0) = lo;   // even positions
weight_f.template select<64, 2>(base + 1) = hi;   // odd positions

All nibble extraction must also stay SIMD: use p & 0x0F and (p >> 4) & 0x0F directly on simd<uint8_t,64>.
Dequant arithmetic (subtract zero-point, multiply scale) must operate on full simd<float,64>.

Weight layout: [N, K/2] uint8 — two uint4 nibbles packed per byte, lo nibble = even k, hi nibble = odd k.
Scale layout: [N, K/BLOCK_SIZE] fp16 — one scale per 128-element block (BLOCK_SIZE=128).
Formula: weight_fp = (uint4_val - 8) × scale (symmetric, zero_point=8)

Block loop (unrolled, NUM_BLOCKS = VL/128):

auto p = weight_packed.template select<64, 1>(blk * 64);  // 64 bytes = 128 nibbles
simd<float, 64> lo = p & 0x0F;
simd<float, 64> hi = (p >> 4) & 0x0F;
lo = (lo - 8.0f) * sc;
hi = (hi - 8.0f) * sc;
weight_f.template select<64, 2>(blk * 128 + 0) = lo;
weight_f.template select<64, 2>(blk * 128 + 1) = hi;

slm_init(SLM_SIZE) must be the very first statement in the kernel — before any other code.
SLM layout: [ROWS][K_SPLIT] floats. Offset = (row_thread_id × K_SPLIT + k_thread_id) × sizeof(float).

Use vectorized SLM load for reduction — never scalar loop:

// K_SPLIT == 2
simd<float, 2> r = slm_block_load<float, 2>(slm_base);
final_sum = r[0] + r[1];
// K_SPLIT == 4 or 8
simd<float, K_SPLIT> r = slm_block_load<float, K_SPLIT>(slm_base);
final_sum = reduce<float>(r, std::plus<>());

Only k_thread_id == 0 threads write the final output after reduction.

icpx <src>.cpp -o <out>.exe \
  -fsycl -fsycl-targets=spir64_gen \
  -Xs "-device bmg -options -doubleGRF"

icpx <file>.cpp -o <file>.exe -fsycl -fsycl-targets=spir64_gen -Xs "-device bmg -options -doubleGRF"
powershell.exe -Command "& './<file>.exe'"

Asset	BW	Purpose
`assets/w4a16_simd_optimized.cpp`	571 GB/s	Production W4A16 — SIMD select dequant, K-split=2, ROWS=4, VL=1024; sweeps multiple configs
`assets/w8a16_nocache.cpp`	552 GB/s	Production W8A16 — simple row-parallel, VL=1024, 32 weight copies; sweeps VL

File	Contents
`references/hardware-constraints.md`	BMG thread count, SLM limits, VL limits, memory bandwidth
`references/kernel-patterns.md`	SIMD dequant patterns, K-split layout, SLM reduction code, multiple-accumulator pattern, bandwidth formula
`references/perf-testing.md`	Cache-bust boilerplate, timing harness, random init, bandwidth formula
`references/correctness-testing.md`	CPU reference pattern, thresholds, NaN check, corner cases
`references/optimization-history.md`	Full journey: scalar loop (289 GB/s) → SIMD select (571 GB/s), every experiment with results
`references/code-index.md`	Per-file annotations, parameter summary, performance ladder

Xe2 ESIMD GEMV Kernels — W4A16, W8A16 & FP16

Quick-Reference Rules (must follow every time)

Hardware limits

Xe2 ESIMD GEMV Kernels — W4A16, W8A16 & FP16

Quick-Reference Rules (must follow every time)

Hardware limits

SIMD dequant — the #1 rule (1.98× speedup)

W4A16 dequantization pattern

W8A16 dequantization pattern

K-split strategy

SLM reduction

VL (vector length) tuning

ROWS tuning

Multiple accumulators (latency hiding)

API namespace

Performance testing

Correctness testing

Compile command

Workflow

Assets (ready to compile)

Reference files

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2