Archivo del skill

Intel ESIMD Kernel Fusion Patterns

Name: Intel ESIMD Kernel Fusion Patterns
Author: ModelTC

Expert guidance for implementing fused multi-operation kernels on Intel GPUs using ESIMD. Use this skill whenever the user needs to fuse multiple operations into a single kernel pass to minimize memory traffic, such as softmax + top-K + normalize, or any pipeline that chains reduction, selection, and normalization in one kernel. Also trigger for ESIMD softmax implementation, vectorized exp on simd<float,N> for a full row, detail::sum vs reduce pitfall (reduce silently returns 0), fused attention block selection with probability normalization, or any kernel that computes softmax probabilities and immediately selects the top-K entries. The main example is the fused softmax+topk+normalize V2 variant achieving 43.2 GB/s (43% bandwidth utilization) for seq_len=32K, N=128, K=8.

ModelTC2,191 estrellas17 abr 2026

Ocupación
Categorías: Internos de Frameworks

Contenido de la habilidad

This skill covers techniques for fusing multiple operations into a single ESIMD kernel pass, eliminating intermediate global memory round-trips. The primary example is the fused softmax + topK + normalize pipeline, but the patterns generalize to any chain of per-row operations that can share data through registers.

Version: 2.0.0

General Fusion Principles

One load, multiple operations: Load data once into registers, then chain all operations (reduction, selection, normalization) without writing intermediates to global memory.
Register data sharing: Pass data between fused stages via simd<T, N> registers rather than SLM or global memory.
Single store at the end: Write only the final result to global memory.
Compile-time dimensions: Template on data dimensions so the compiler can unroll and optimize the entire fused pipeline.

For TopK only (no fusion), see intel-esimd-topk (V2 register heap, V3 runtime K).

Skills relacionados

Intel ESIMD Kernel Fusion Patterns | Skills Pool

Archivo del skill

Intel ESIMD Kernel Fusion Patterns

ModelTC2,191 estrellas17 abr 2026

Ocupación
Categorías: Internos de Frameworks

Contenido de la habilidad

Version: 2.0.0

General Fusion Principles

One load, multiple operations: Load data once into registers, then chain all operations (reduction, selection, normalization) without writing intermediates to global memory.
Register data sharing: Pass data between fused stages via simd<T, N> registers rather than SLM or global memory.
Single store at the end: Write only the final result to global memory.
Compile-time dimensions: Template on data dimensions so the compiler can unroll and optimize the entire fused pipeline.

For TopK only (no fusion), see intel-esimd-topk (V2 register heap, V3 runtime K).

Skills relacionados

simd<float, N> x_f = convert<float>(x_h);

// Max subtraction — hmax on the full N-element vector directly
float max_v = hmax<float>(x_f);
x_f -= max_v;

// exp on full vector (ADL inside using namespace esimd picks esimd::exp)
simd<float, N> exp_x = exp(x_f);

// Sum — use detail::sum (NOT reduce<float> — see Pitfalls)
float sum_e = sycl::ext::intel::esimd::detail::sum<float, float, N>(exp_x);

simd<sycl::half, N> sm = convert<sycl::half>(exp_x / sum_e);

simd<float, TOP_K> fheap = convert<float>(heap);
float top_sum = sycl::ext::intel::esimd::detail::sum<float, float, TOP_K>(fheap);
simd<sycl::half, TOP_K> norm_vals = convert<sycl::half>(fheap / top_sum);

#include <sycl/sycl.hpp>
#include <sycl/ext/intel/esimd.hpp>

constexpr int INPUT_DIM = 128;  // must be a compile-time constant
constexpr int TOP_K     = 8;    // must be <= 32 for single pack_mask call

Version	Time	Bandwidth	Notes
Initial (chunked exp, reduce<float>)	NaN	---	reduce<float> returns 0, div-by-zero
Fixed (exp on full vector, detail::sum)	230 us	43.2 GB/s	43% of 100 GB/s peak

Intel ESIMD Kernel Fusion Patterns

General Fusion Principles

Intel ESIMD Kernel Fusion Patterns

General Fusion Principles

Example: Fused Softmax + TopK + Normalize

Softmax in ESIMD

Normalize Top-K

Optimization Journey (seq_len=32K, N=128, K=8)

Complete Fused Kernel Template (Softmax + TopK + Normalize)

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2