스킬 파일

oneDNN Quantized GEMM on Xe2 (Lunar Lake/LNL, Battlemage/BMG)

Name: oneDNN Quantized GEMM on Xe2 (Lunar Lake/LNL, Battlemage/BMG)
Author: ModelTC

Use this skill when implementing, optimizing, or debugging quantized GEMM kernels using oneDNN on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) or newer Intel XPU. Xe2 is the GPU architecture; LNL and BMG are product names. Covers FP16/BF16 x FP8_E4M3 with per-N scale, FP16 x FP8 with block-wise scale along K, FP16 x INT4 (U4) with block-wise scale + zero-point, 2D block quantization emulation via repeat_interleave, bias fusion, and the critical API differences between set_scales_mask (JIT) vs set_scales (ref fallback). Use whenever the user mentions oneDNN FP8 GEMM, quantized matmul, W8A16, W4A16, per-N scale, block-wise FP8, block-wise INT4, 2D block quantization, or dnnl matmul primitive on Intel GPU.

ModelTC2,191 스타2026. 4. 17.

직업
카테고리: 머신러닝

스킬 내용

Specialized knowledge for implementing FP16/BF16 x FP8 and FP16 x INT4 quantized GEMM using oneDNN's matmul primitive on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) discrete GPU.

Quick-Reference Rules (must follow every time)

Supported Configurations on BMG (oneDNN 2025.2)

I/O dtype	Weights	Scale	oneDNN Impl	Performance
FP16	FP8_E4M3	per-N	`jit:gemm:any`	~130 TFLOPS (96% of 135T peak)
BF16	FP8_E4M3	per-N	`jit:gemm:any`	~96 TFLOPS (71% of peak)
FP16	FP8_E4M3	block-K per-N	`jit:gemm:any`	~88-110 TFLOPS

API	Use Case	JIT?	Notes
`set_scales_mask(DNNL_ARG_WEIGHTS, 2)`	FP8 per-N	YES	Implicit f32 dtype, required for JIT
`set_scales(DNNL_ARG_WEIGHTS, 2, {}, dt::f32)`	FP8 per-N	NO	Forces `ocl:ref:any` fallback!
`set_scales(DNNL_ARG_WEIGHTS, 3, {blk,1}, dt::f16)`	INT4 block-wise	YES	Explicit dtype required for groups

oneDNN Quantized GEMM on Xe2 (Lunar Lake/LNL, Battlemage/BMG)

Quick-Reference Rules (must follow every time)

Supported Configurations on BMG (oneDNN 2025.2)

oneDNN Quantized GEMM on Xe2 (Lunar Lake/LNL, Battlemage/BMG)

Quick-Reference Rules (must follow every time)

Supported Configurations on BMG (oneDNN 2025.2)

Layout Convention

CRITICAL: set_scales_mask vs set_scales

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns