This skill should be used when writing, optimizing, benchmarking, or debugging high-performance FP16 GEMM kernels (or similar dense-compute SYCL ESIMD kernels) targeting Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) GPU. Xe2 is the GPU architecture; LNL and BMG are product names. Covers ESIMD API, XMX DPAS, hardware constraints, performance methodology, optimization patterns, and known pitfalls.
Specialized knowledge for authoring and optimizing SYCL ESIMD matrix-multiply kernels on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) architecture. Reference files hold detail; this file holds the critical rules and workflow.
doubleGRF is on (256 × 64-byte GRF = 16 KB/thread). Never set nd_range work-group size > 32. See references/hardware-constraints.md.doubleGRF — it is mandatory for large tile kernels. Do not remove it.barrier() calls. Unequal counts cause GPU hang.if, no ?: inside the K-loop body. Move all runtime conditionals to the host or split the loop into phases.index += incrementssycl::ext::intel::experimental::esimd (xesimd) for lsc_load_2d, lsc_prefetch_2d, config_2d_mem_access, xmx::dpas.lsc_prefetch_2d does not accept a payload object (config_2d_mem_access) — use the inline 6-argument form: lsc_prefetch_2d<T,BW,BH,N,L1H,L2H>(ptr, surfW, surfH, surfPitch, x, y).lsc_load_2d supports the payload object API (config_2d_mem_access) with set_x() / set_y().config_2d_mem_access once outside the K-loop with static fields (ptr, surfW, surfH, surfPitch). Only call set_x() / set_y() inside the loop. This eliminates per-call descriptor rebuilds and dramatically reduces XVE activity.references/kernel-patterns.md for tile sizes, VNNI packing, double-buffer pattern, and dpas call signature.references/perf-testing.md for boilerplate.rel_rms < 0.5% as pass criterion.wg_m = z_deinterleave_even(wg_id), wg_n = z_deinterleave_odd(wg_id). Keep index compute light (bit ops only).icpx <src>.cpp -o <out>.exe \
-fsycl -fsycl-targets=spir64_gen \
-Xs "-device bmg -options -doubleGRF"
Do not use -fsycl-targets=intel_gpu_bmg_g21 (wrong target string).
Do not use -O3 or -doubleGRF as top-level flags — they are ignored or warn.
if) whose condition can be evaluated on the host. Move the branch outside the SYCL kernel (template or host-side dispatch) to eliminate the SPIR-V link-time issue.warning: ... spilled ... bytes). Spill > 0 usually means GRF budget exceeded — reduce tile size or payload count.--collect gpu-hotspots to measure XVE ALU2 %. Target < 5% XVE for DPAS-bound kernel.Copy assets to a working directory, then compile and run:
icpx <file>.cpp -o <file>.exe -fsycl -fsycl-targets=spir64_gen -Xs "-device bmg -options -doubleGRF"
powershell.exe -Command "& './<file>.exe'"
| Asset | TFLOPS | Purpose |
|---|---|---|
assets/fp16_gemm_nopf_v2.cpp | ~117T | Best kernel (current) — B_T[K,N] layout, correct a_tile/b_tile naming, payload CSE |
assets/fp16_gemm_gather_v2.cpp | ~114T | Gather variant (current) — B[N,K] layout (no transpose), lsc_gather<u32,8,N=16> for b_tile |
assets/fp16_gemm_nopf.cpp | 117.10 | Original nopf — old aa/bb naming (see _v2 for corrected names) |
assets/fp16_gemm_nopf3.cpp | 117.44 | Highest measured — induction-var XVE reduction; also tests L1UC (43.9T) |
assets/fp16_gemm_nopf_verify.cpp | — | Correctness checker (M=N=K=256, CPU ref). Run before benchmarking. |
assets/fp16_gemm_noif.cpp | 109.55 | Pre-optimization baseline showing 40% XVE problem (inline descriptor rebuild) |
To reproduce all results in order, see references/code-index.md.
| File | Contents |
|---|---|
references/code-index.md | Per-file annotations, key patterns, performance ladder, failed experiments |
references/hardware-constraints.md | Xe2/BMG GRF, L1, SLM, WG, barrier limits |
references/kernel-patterns.md | DPAS tile layout, VNNI packing, double-buffer pattern, payload CSE code |
references/lsc-memory-ops.md | Full LSC API: lsc_load_2d, lsc_store_2d, lsc_prefetch_2d, lsc_gather, lsc_scatter, config_2d_mem_access, cache hints |
references/perf-testing.md | Cache-bust boilerplate, timing harness, random init, NaN check |
references/optimization-history.md | Exhaustive record of every optimization tried on this GEMM with TFLOPS results |