Name: Xe2 (Lunar Lake/LNL, Battlemage/BMG) ESIMD GEMM Skill
Author: ModelTC

Specialized knowledge for authoring and optimizing SYCL ESIMD matrix-multiply kernels on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) architecture. Reference files hold detail; this file holds the critical rules and workflow.

Quick-Reference Rules (must follow every time)

Hardware limits

Max WG threads = 32 when doubleGRF is on (256 × 64-byte GRF = 16 KB/thread). Never set nd_range work-group size > 32. See references/hardware-constraints.md.
Always compile with doubleGRF — it is mandatory for large tile kernels. Do not remove it.
Barriers: every thread in a WG must execute the same number of barrier() calls. Unequal counts cause GPU hang.

Inner-loop rules

No if, no ?: inside the K-loop body. Move all runtime conditionals to the host or split the loop into phases.
instead of recomputing full expressions each iteration. Saves XVE ALU ops.

Quick-Reference Rules (must follow every time)

Hardware limits

Max WG threads = 32 when doubleGRF is on (256 × 64-byte GRF = 16 KB/thread). Never set nd_range work-group size > 32. See references/hardware-constraints.md.
Always compile with doubleGRF — it is mandatory for large tile kernels. Do not remove it.
Barriers: every thread in a WG must execute the same number of barrier() calls. Unequal counts cause GPU hang.

Inner-loop rules

No if, no ?: inside the K-loop body. Move all runtime conditionals to the host or split the loop into phases.
instead of recomputing full expressions each iteration. Saves XVE ALU ops.

Asset	TFLOPS	Purpose
`assets/fp16_gemm_nopf_v2.cpp`	~117T	Best kernel (current) — B_T[K,N] layout, correct `a_tile`/`b_tile` naming, payload CSE
`assets/fp16_gemm_gather_v2.cpp`	~114T	Gather variant (current) — B[N,K] layout (no transpose), `lsc_gather<u32,8,N=16>` for b_tile
`assets/fp16_gemm_nopf.cpp`	117.10	Original nopf — old `aa`/`bb` naming (see _v2 for corrected names)
`assets/fp16_gemm_nopf3.cpp`	117.44	Highest measured — induction-var XVE reduction; also tests L1UC (43.9T)
`assets/fp16_gemm_nopf_verify.cpp`	—	Correctness checker (M=N=K=256, CPU ref). Run before benchmarking.
`assets/fp16_gemm_noif.cpp`	109.55	Pre-optimization baseline showing 40% XVE problem (inline descriptor rebuild)

File	Contents
`references/code-index.md`	Per-file annotations, key patterns, performance ladder, failed experiments
`references/hardware-constraints.md`	Xe2/BMG GRF, L1, SLM, WG, barrier limits
`references/kernel-patterns.md`	DPAS tile layout, VNNI packing, double-buffer pattern, payload CSE code
`references/lsc-memory-ops.md`	Full LSC API: `lsc_load_2d`, `lsc_store_2d`, `lsc_prefetch_2d`, `lsc_gather`, `lsc_scatter`, `config_2d_mem_access`, cache hints
`references/perf-testing.md`	Cache-bust boilerplate, timing harness, random init, NaN check
`references/optimization-history.md`	Exhaustive record of every optimization tried on this GEMM with TFLOPS results

Xe2 (Lunar Lake/LNL, Battlemage/BMG) ESIMD GEMM Skill

Quick-Reference Rules (must follow every time)

Hardware limits

Inner-loop rules

Xe2 (Lunar Lake/LNL, Battlemage/BMG) ESIMD GEMM Skill

Quick-Reference Rules (must follow every time)

Hardware limits

Inner-loop rules

API namespace

Payload CSE (critical optimization)

DPAS / VNNI

Performance testing

Correctness testing

Tile walk order

Compile command

SPIR-V linker errors

Workflow

Sample code (ready to compile)

Reference files

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2