Use this skill when writing, optimizing, benchmarking, or debugging W4A16 or W8A16 GEMV kernels targeting Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) GPU using SYCL ESIMD. Xe2 is the GPU architecture; LNL and BMG are product names. Also covers general FP16 GEMV patterns. Covers quantized weight dequantization, SIMD vs scalar interleaving, K-split SLM reduction, VL/ROWS tuning, workgroup decomposition, uint4 unpacking, FP32 accumulation, SLM barriers, performance methodology, and all hardware constraints.
Specialized knowledge for memory-bandwidth-bound quantized and FP16 GEMV (M=1 matrix-vector multiply) on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG). Reference files hold detail; this file holds critical rules and workflow.
Platform: Xe2 (BMG) — 520 GB/s DRAM, 32 XE cores x 8 EUs x 8 threads = 2048 HW threads
Achieved results: W4A16 571 GB/s (110% roofline), W8A16 552 GB/s (106% roofline)
num_groups × local_size fills this.doubleGRF required for GEMV — kernels are memory-bound, GRF pressure is low.SLM_SIZE = ROWS × K_SPLIT × sizeof(float). This is tiny; no pressure.barrier() calls. Unequal counts cause GPU hang.simd::template select<COUNT, STRIDE>(OFFSET) for strided writes:
// BAD — scalar loop, 289 GB/s
for (int i = 0; i < 64; i++) {
weight_f[base + i * 2] = lo[i];
weight_f[base + i * 2 + 1] = hi[i];
}
// GOOD — SIMD strided select, 571 GB/s
weight_f.template select<64, 2>(base + 0) = lo; // even positions
weight_f.template select<64, 2>(base + 1) = hi; // odd positions
p & 0x0F and (p >> 4) & 0x0F directly on simd<uint8_t,64>.simd<float,64>.[N, K/2] uint8 — two uint4 nibbles packed per byte, lo nibble = even k, hi nibble = odd k.[N, K/BLOCK_SIZE] fp16 — one scale per 128-element block (BLOCK_SIZE=128).weight_fp = (uint4_val - 8) × scale (symmetric, zero_point=8)NUM_BLOCKS = VL/128):
auto p = weight_packed.template select<64, 1>(blk * 64); // 64 bytes = 128 nibbles
simd<float, 64> lo = p & 0x0F;
simd<float, 64> hi = (p >> 4) & 0x0F;
lo = (lo - 8.0f) * sc;
hi = (hi - 8.0f) * sc;
weight_f.template select<64, 2>(blk * 128 + 0) = lo;
weight_f.template select<64, 2>(blk * 128 + 1) = hi;
[N, K] int8.[N] fp16 — one scale per row (no blocking needed).weight_fp = int8_val × scalesimd<int8_t, VL>, convert to float, multiply scalar scale. No interleaving needed.[k_start, k_start+K/K_SPLIT), stores to SLM, then thread-0-of-slice reduces.local_id = row_thread_id × K_SPLIT + k_thread_idlocal_size = ROWS × K_SPLIT. num_groups = ceil(N / ROWS).slm_init(SLM_SIZE) must be the very first statement in the kernel — before any other code.[ROWS][K_SPLIT] floats. Offset = (row_thread_id × K_SPLIT + k_thread_id) × sizeof(float).// K_SPLIT == 2
simd<float, 2> r = slm_block_load<float, 2>(slm_base);
final_sum = r[0] + r[1];
// K_SPLIT == 4 or 8
simd<float, K_SPLIT> r = slm_block_load<float, K_SPLIT>(slm_base);
final_sum = reduce<float>(r, std::plus<>());
k_thread_id == 0 threads write the final output after reduction.ceil(N/ROWS) × local_size ≈ 2048 to fill all HW threads.simd<float, 8> partial_sums = 0.0f with rotating index (acc_idx = (acc_idx+1) & 0x7) across K iterations.float s = reduce<float>(partial_sums, std::plus<>()).sycl::ext::intel::esimd for block_load, block_store, slm_init, slm_block_load, slm_block_store, barrier, reduce.<sycl/sycl.hpp> and <sycl/ext/intel/esimd.hpp>.using namespace sycl::ext::intel::esimd; simplifies calls.weight_idx = i % num_copies each iteration.command_start/command_end) not wall-clock for per-kernel timing.bytes = K×2 + N×(K/2) + N×(K/128)×2 + N×2 (input fp16 + weight uint8 + scale fp16 + output fp16)bytes = K×2 + N×K×1 + N×2 + N×2 (input fp16 + weight int8 + scale fp16 + output fp16)references/perf-testing.md for boilerplate.abs_diff > 1.0 && rel_error > 2% = fail.references/correctness-testing.md.icpx <src>.cpp -o <out>.exe \
-fsycl -fsycl-targets=spir64_gen \
-Xs "-device bmg -options -doubleGRF"
-O3 is optional but does not significantly change GEMV performance (memory-bound).-fsycl-targets=intel_gpu_bmg_g21.kernel uses N bytes of scratch space) indicates GRF pressure — reduce VL if seen.powershell.exe -Command "& './<out>.exe'"icpx <file>.cpp -o <file>.exe -fsycl -fsycl-targets=spir64_gen -Xs "-device bmg -options -doubleGRF"
powershell.exe -Command "& './<file>.exe'"
| Asset | BW | Purpose |
|---|---|---|
assets/w4a16_simd_optimized.cpp | 571 GB/s | Production W4A16 — SIMD select dequant, K-split=2, ROWS=4, VL=1024; sweeps multiple configs |
assets/w8a16_nocache.cpp | 552 GB/s | Production W8A16 — simple row-parallel, VL=1024, 32 weight copies; sweeps VL |
Expected on BMG (N=8192–16384, K=4096–8192):
w4a16_simd_optimized.exe → 571 GB/s (110% of 520 GB/s roofline)w8a16_nocache.exe → 552 GB/s (106% of 520 GB/s roofline)| File | Contents |
|---|---|
references/hardware-constraints.md | BMG thread count, SLM limits, VL limits, memory bandwidth |
references/kernel-patterns.md | SIMD dequant patterns, K-split layout, SLM reduction code, multiple-accumulator pattern, bandwidth formula |
references/perf-testing.md | Cache-bust boilerplate, timing harness, random init, bandwidth formula |
references/correctness-testing.md | CPU reference pattern, thresholds, NaN check, corner cases |
references/optimization-history.md | Full journey: scalar loop (289 GB/s) → SIMD select (571 GB/s), every experiment with results |
references/code-index.md | Per-file annotations, parameter summary, performance ladder |