Use this skill when implementing, optimizing, or debugging quantized GEMM kernels using oneDNN on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) or newer Intel XPU. Xe2 is the GPU architecture; LNL and BMG are product names. Covers FP16/BF16 x FP8_E4M3 with per-N scale, FP16 x FP8 with block-wise scale along K, FP16 x INT4 (U4) with block-wise scale + zero-point, 2D block quantization emulation via repeat_interleave, bias fusion, and the critical API differences between set_scales_mask (JIT) vs set_scales (ref fallback). Use whenever the user mentions oneDNN FP8 GEMM, quantized matmul, W8A16, W4A16, per-N scale, block-wise FP8, block-wise INT4, 2D block quantization, or dnnl matmul primitive on Intel GPU.
Specialized knowledge for implementing FP16/BF16 x FP8 and FP16 x INT4 quantized GEMM using oneDNN's matmul primitive on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) discrete GPU.
| I/O dtype | Weights | Scale | oneDNN Impl | Performance |
|---|---|---|---|---|
| FP16 | FP8_E4M3 | per-N | jit:gemm:any | ~130 TFLOPS (96% of 135T peak) |
| BF16 | FP8_E4M3 | per-N | jit:gemm:any | ~96 TFLOPS (71% of peak) |
| FP16 | FP8_E4M3 | block-K per-N | jit:gemm:any | ~88-110 TFLOPS |
| BF16 |
| FP8_E4M3 |
| block-K per-N |
jit:gemm:any |
| BROKEN (wrong results on v3.7) |
| FP16 | U4 (INT4) | block-wise | jit:gemm:any | ~130 TFLOPS |
| FP32 | FP8_E4M3 | per-N | ocl:ref:any | Very slow (reference) |
KEY DIFFERENCE FROM PTL: On BMG with oneDNN 2025.2, both FP16xFP8 and BF16xFP8 have
optimized JIT kernels. On PTL, only FP16xFP8 had JIT; BF16xFP8 fell back to ocl:ref:any.
[M, K] row-major (format_tag::ab) -- FP16 or BF16 activations[K, N], physical [N, K] (format_tag::ba) -- FP8_E4M3 or U4 weights[N] FP32 for FP8; [n_groups, N] FP16 for INT4[M, N] row-major (format_tag::ab) -- same dtype as A| API | Use Case | JIT? | Notes |
|---|---|---|---|
set_scales_mask(DNNL_ARG_WEIGHTS, 2) | FP8 per-N | YES | Implicit f32 dtype, required for JIT |
set_scales(DNNL_ARG_WEIGHTS, 2, {}, dt::f32) | FP8 per-N | NO | Forces ocl:ref:any fallback! |
set_scales(DNNL_ARG_WEIGHTS, 3, {blk,1}, dt::f16) | INT4 block-wise | YES | Explicit dtype required for groups |
Always check implementation string after creating primitive_desc: