Expert guidance for implementing fused multi-operation kernels on Intel GPUs using ESIMD. Use this skill whenever the user needs to fuse multiple operations into a single kernel pass to minimize memory traffic, such as softmax + top-K + normalize, or any pipeline that chains reduction, selection, and normalization in one kernel. Also trigger for ESIMD softmax implementation, vectorized exp on simd<float,N> for a full row, detail::sum vs reduce pitfall (reduce silently returns 0), fused attention block selection with probability normalization, or any kernel that computes softmax probabilities and immediately selects the top-K entries. The main example is the fused softmax+topk+normalize V2 variant achieving 43.2 GB/s (43% bandwidth utilization) for seq_len=32K, N=128, K=8.
This skill covers techniques for fusing multiple operations into a single ESIMD kernel pass, eliminating intermediate global memory round-trips. The primary example is the fused softmax + topK + normalize pipeline, but the patterns generalize to any chain of per-row operations that can share data through registers.
Version: 2.0.0
simd<T, N> registers
rather than SLM or global memory.For TopK only (no fusion), see intel-esimd-topk (V2 register heap, V3 runtime K).
Fuses three operations in a single kernel pass per row:
hmax then subtract then exp then sum then dividev_i / sum(v_i) over the top-K valuesAchievement: 43.2 GB/s (43% of 100 GB/s peak) for seq_len=32K, N=128, K=8
simd<float, N> x_f = convert<float>(x_h);
// Max subtraction — hmax on the full N-element vector directly
float max_v = hmax<float>(x_f);
x_f -= max_v;
// exp on full vector (ADL inside using namespace esimd picks esimd::exp)
simd<float, N> exp_x = exp(x_f);
// Sum — use detail::sum (NOT reduce<float> — see Pitfalls)
float sum_e = sycl::ext::intel::esimd::detail::sum<float, float, N>(exp_x);
simd<sycl::half, N> sm = convert<sycl::half>(exp_x / sum_e);
Two rules that matter:
hmax<float>(x_f) and exp(x_f) directly on the full vector — do not chunk.detail::sum, never reduce<float> — reduce resolves to std::reduce and silently returns 0.simd<float, TOP_K> fheap = convert<float>(heap);
float top_sum = sycl::ext::intel::esimd::detail::sum<float, float, TOP_K>(fheap);
simd<sycl::half, TOP_K> norm_vals = convert<sycl::half>(fheap / top_sum);
| Version | Time | Bandwidth | Notes |
|---|---|---|---|
| Initial (chunked exp, reduce<float>) | NaN | --- | reduce<float> returns 0, div-by-zero |
| Fixed (exp on full vector, detail::sum) | 230 us | 43.2 GB/s | 43% of 100 GB/s peak |
#include <sycl/sycl.hpp>
#include <sycl/ext/intel/esimd.hpp>
constexpr int INPUT_DIM = 128; // must be a compile-time constant
constexpr int TOP_K = 8; // must be <= 32 for single pack_mask call