Use this skill when writing, loading operands for, or storing results from XMX DPAS instructions on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) GPU using SYCL ESIMD. Xe2 is the GPU architecture; LNL and BMG are product names. Covers all four DPAS operand load patterns (lsc_load_2d, lsc_gather, VNNI packing), scatter/store write-back, Usage 1 vs Usage 2 orientation, and the SOA property of lsc_gather. Applicable to any kernel using DPAS: GEMM, attention, convolution, etc.
Reference for loading and storing DPAS operands on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) via SYCL ESIMD.
All patterns are validated in assets/fp16_dpas_ult.cpp (4 test cases, all PASS).
// xmx::dpas<RC, SD, Tacc, Tc, Tb, Ta>(acc, b_tile, a_tile)
// XE2: RC=8 fixed, SD=8 fixed (32-bit systolic depth)
// FP16: SD=8 systolic steps × 2 fp16/step = 16 K-elements per call