Foundational Intel ESIMD GPU programming skill. Use this skill proactively whenever the user is writing, optimizing, or debugging any SYCL/ESIMD kernel for Intel GPUs — including Intel Arc, Iris Xe, or Data Center GPU Max. Covers kernel design, memory access patterns (block_load, gather, SLM), data types, vectorization, workgroup patterns, hardware characteristics, performance analysis, and troubleshooting. Trigger this even when the user does not explicitly say "ESIMD" — invoke it for any Intel GPU kernel development, performance bottleneck questions, or SYCL optimization tasks targeting Intel hardware.
Foundational guidance for writing, optimizing, and debugging Intel GPU ESIMD kernels — applicable to any kernel regardless of algorithm. Covers kernel design, all memory access patterns, data types, vectorization, workgroup patterns, hardware characteristics, performance analysis, and troubleshooting.
ESIMD (Explicit SIMD) is a C++ programming model for Intel GPU Execution Units (EUs). Each ESIMD work-item maps to one EU thread and operates on explicit simd<T, N> register vectors rather than the implicit SIMD of standard SYCL. This gives direct control over:
block_load, gather, SLM)hmax, hmin, pack_mask, fbl, xmx::dpas)Required headers and annotation:
#include <sycl/sycl.hpp>
#include <sycl/ext/intel/esimd.hpp>
// Kernel lambda must be annotated:
[=](sycl::id<1> idx) SYCL_ESIMD_KERNEL { ... }
// or equivalently:
[=](sycl::nd_item<1> item) [[intel::sycl_explicit_simd]] { ... }
Namespace: bring in ESIMD intrinsics with:
using namespace sycl::ext::intel::esimd;
| Resource | Spec |
|---|---|
| EUs (Execution Units) | 128–512 (model dependent) |
| Threads per EU | 8 |
| Total threads | EUs × 8 |
| GRF registers per thread (Xe1) | 128 × 32B = 4 KB |
| GRF registers per thread (Xe2+) | 256 × 32B = 8 KB |
| SIMD width per thread (Xe1) | 256-bit → 16 FP16 elements per instruction |
| SIMD width per thread (Xe2+) | 512-bit → 32 FP16 elements per instruction |
| SLM per sub-slice | 128 KB |
| Recommended SLM | < 64 KB |
Register file budget:
| Architecture | GRF per thread | FP16 capacity | FP32 capacity |
|---|---|---|---|
| Xe1 (Arc Alchemist, iGPU Gen12) | 128 × 32B = 4 KB | up to ~2048 elements | up to ~1024 elements |
| Xe2+ (Arc Battlemage, Lunar Lake+) | 256 × 32B = 8 KB | up to ~4096 elements | up to ~2048 elements |
SIMD instruction width:
| Architecture | SIMD width | FP16 elements/instruction |
|---|---|---|
| Xe1 (Arc Alchemist, iGPU Gen12) | 256-bit | 16 FP16 per instruction |
| Xe2+ (Arc Battlemage, Lunar Lake+) | 512-bit | 32 FP16 per instruction |
simd<half, N> with N > the native width, the compiler automatically splits the operation into multiple native-width instructions — no manual tiling needed.simd widths: multiples of 16 (Xe1) or 32 (Xe2+). E.g., simd<half, 128> = 8 instructions on Xe1, 4 on Xe2+.Query thread count at runtime (instead of hard-coding):
const int ts = (int)q.get_device()
.get_info<sycl::info::device::max_compute_units>() * 8;
Apply in this order — each step can dwarf the next:
| Priority | Technique | Typical Impact |
|---|---|---|
| 0 | Kernel design (range dim, work/thread, template params) | 10–35× |
| 1 | Memory access (block_load, alignment, coalescing) | 2–5× |
| 2 | Data types (FP16 > FP32, uint4/int8 for weights) | 2–8× |
| 3 | Vectorization (hmax/hmin, SIMD ops, avoid scalar loops) | 1.5–2× |
| 4 | Parallelization (workgroup size, SLM reduction) | 10–30% |
| 5 | Register file vs SLM (~1 cycle vs ~30 cycles) | up to 2× |
| 6 | Algorithm complexity (O(N) vs O(K×N), unroll, hoist) | varies |
This is the single most impactful decision. A bad design can be 10–35× slower before any micro-optimization.
Use multi-dimensional range<D> so indices map directly to data dimensions. Never decode a 1D linear index into multi-dimensional coordinates at runtime.
// BAD: 1D range + manual 4D decode → 1% bandwidth
q.parallel_for(range<1>(bsz * heads * seq * blocks),
[=](id<1> idx) SYCL_ESIMD_KERNEL {
int d4 = idx[0] % blocks;
int d3 = (idx[0] / blocks) % seq;
int d2 = (idx[0] / (blocks * seq)) % heads;
int d1 = idx[0] / (blocks * seq * heads);
output[idx[0]] = process(input[idx[0]]); // 1 element/thread
});
// GOOD: 3D range → ~73% bandwidth (21.7× faster before any other change)
q.parallel_for(range<3>(bsz, heads, seq),
[=](id<3> idx) SYCL_ESIMD_KERNEL {
int b = idx[0], h = idx[1], s = idx[2];
// each thread processes many output elements for this (b, h, s)
});
Processing 1 element per thread creates massive scheduling overhead. Target 32–128+ elements per thread (or one complete row/column/tile).
// GOOD: thread processes 128 elements
constexpr int ELEMS = 128;
const int base = idx[0] * ELEMS;
simd<sycl::half, ELEMS> data = block_load<sycl::half, ELEMS>(input + base);
// ... process data ...
block_store<sycl::half, ELEMS>(output + base, data);
Runtime variables prevent loop unrolling and constant propagation. Make all array sizes and loop bounds constexpr or template parameters.
// BAD: runtime N prevents unrolling
void kernel(int N) { for (int i = 0; i < N; i++) ... }
// GOOD: compile-time N enables full unroll and SIMD vectorization
template<int N>
void kernel() {
#pragma unroll
for (int i = 0; i < N; i++) ...
}
When computing multiple dot products or reductions, issue all block_loads before any compute. This lets the hardware pipeline memory fetches:
// BAD: each load stalls its own compute
for (int i = 0; i < K; i++) {
auto v = block_load<half, D>(ptr + i * D);
results[i] = detail::sum<half, half, D>(v * w);
}
// GOOD: all loads in-flight before compute begins (+29% measured)
simd<half, D> vecs[K];
for (int i = 0; i < K; i++) vecs[i] = block_load<half, D>(ptr + i * D);
for (int i = 0; i < K; i++) results[i] = detail::sum<half, half, D>(vecs[i] * w);
The primary memory primitive. Maps to one or a few hardware memory transactions.
// Load N elements starting at ptr (ptr must be N*sizeof(T) aligned by default)
simd<T, N> data = block_load<T, N>(ptr);
simd<T, N> data = block_load<T, N>(ptr + offset); // offset in elements
// Store
block_store<T, N>(ptr + offset, data);
Efficiency rules:
block_load<half, 128> = 256B, block_load<float, 64> = 256B)CRITICAL — address alignment: block_load and block_store require the byte address to be a multiple of 4 by default. For 4-byte types (float, int) this is automatically satisfied. For 2-byte types (half, bfloat16) an odd element index produces a non-4-byte-aligned byte address and gives wrong results. When you cannot guarantee the address is 4-byte aligned, you must pass an alignment<> property.
// ── half / bfloat16 (2 bytes each) ───────────────────────────────────────
// Element index 3 → byte address +6 → NOT a multiple of 4 → wrong results!
simd<half, 8> data = block_load<half, 8>(ptr + 3); // WRONG
simd<half, 8> data = block_load<half, 8>(ptr + 3, properties{alignment<2>}); // OK
// Safe rule: always use alignment<2> for half* / bfloat16* block_load/store
// unless you can statically prove the element offset is even (byte offset % 4 == 0).
block_store<half, 8>(ptr + 3, data, properties{alignment<2>}); // OK
// ── float / int (4 bytes each) ────────────────────────────────────────────
// 4-byte elements: any element index gives a 4-byte-aligned byte address → safe.
simd<float, 8> data = block_load<float, 8>(ptr + 3); // OK (byte addr = +12)
// ── uint8_t (1 byte each) ────────────────────────────────────────────────
// 1-byte elements: any byte address → always need alignment<1>.
simd<uint8_t, 8> data = block_load<uint8_t, 8>(ptr + 3, properties{alignment<1>});
Summary by type:
| Type | Size | Default required alignment | When to annotate |
|---|---|---|---|
float, int, uint32_t | 4 B | 4 B | never needed (4B × any index = multiple of 4) |
half, bfloat16 | 2 B | 4 B | whenever element offset may be odd |
uint8_t, int8_t | 1 B | 4 B | always (alignment<1>) |
Symptom of missing annotation: incorrect values loaded/stored, sporadic failures depending on offset parity, no compile error.
Use when elements are at irregular or strided offsets — e.g., sliding windows, strided output.
// Gather: load N elements from per-lane byte offsets
simd<T, N> data = gather<T, N>(base_ptr, simd<uint32_t, N>(byte_offsets));
// Scatter: store N elements to per-lane byte offsets
scatter<T, N>(base_ptr, simd<uint32_t, N>(byte_offsets), data);
Key rules:
byte_offsets = elem_offsets * sizeof(T)simd<uint32_t, N> — unsigned 32-bituint32_t wrap to huge values → clamp before castblock_load for contiguous data, much faster for strided patternsBatch gather pattern (process 32 outputs in parallel):
constexpr int BS = 32;
simd<int, BS> rel = simd<int, BS>(0, 1) * STRIDE; // [0, S, 2S, ..., 31S]
for (int out = 0; out < num_outputs; out += BS) {
// Clamp offsets to valid range (branchless — safe for idempotent ops like max)
simd<int, BS> offs = max(out * STRIDE + rel + OFFSET, 0);
offs = min(offs, max_elem);
simd<T, BS> vals = gather<T, BS>(
input + base, simd<uint32_t, BS>(offs) * sizeof(T));
// ... process vals ...
int n = std::min(BS, num_outputs - out);
if (n == BS) block_store<T, BS>(output + out_base + out, result);
else for (int i = 0; i < n; i++) output[out_base + out + i] = result[i];
}
Performance reference: gather achieved 72.8% bandwidth on a strided sliding-window pooling kernel (82.9× total speedup over naive scalar 1D design).
Use for inter-thread communication within a workgroup. Requires nd_range and barrier().
// Initialize SLM (compile-time size, inside kernel)
constexpr int SLM_BYTES = GROUP_SIZE * sizeof(float);
slm_init<SLM_BYTES>();
// Per-thread write (byte offset)
slm_block_store<float, 1>(local_id * sizeof(float), simd<float, 1>(partial));
// Synchronize all threads in workgroup
barrier();
// Read all GROUP_SIZE values (thread 0 only, or all threads for broadcast)
simd<float, GROUP_SIZE> parts = slm_block_load<float, GROUP_SIZE>(0);
SLM limits and cost:
slm_init must be called inside the kernel, unconditionallybarrier() must be reached by all threads in the workgroup — never inside a branch that some threads skipUse FP16 for: weights, activations, intermediate results in neural network ops, any case where error < 0.1 is acceptable.
Use FP32 for: accumulators summing >1000 terms (use simd<float, N> acc even when inputs are FP16), numerical stability-critical paths, softmax exponents.
// Pattern: FP16 inputs → FP32 accumulation → FP16 output
simd<sycl::half, 128> a = block_load<sycl::half, 128>(ptr_a);
simd<sycl::half, 128> b = block_load<sycl::half, 128>(ptr_b);
simd<float, 128> acc(0.f);
acc += convert<float>(a) * convert<float>(b);
float result = sycl::ext::intel::esimd::detail::sum<float, float, 128>(acc);
block_store<sycl::half, 1>(out, simd<sycl::half, 1>(sycl::half(result)));
// Element-wise ops (all broadcast scalars automatically)
simd<T, N> c = a + b;
simd<T, N> c = a * b + scalar; // FMA when T=float
simd_mask<N> m = a > b;
simd<T, N> r = merge(a, b, m); // r[i] = m[i] ? a[i] : b[i]
Intel GPU hardware natively supports multiplying a scalar by a simd vector — the scalar is broadcast implicitly by the FMA unit. No replicate_w or simd<T,N>(scalar) constructor needed.
// GOOD: extract scalar from simd, multiply directly — one instruction
half k_val = k_tile[tt * SUB_HD + ii]; // simd_view → half (implicit conversion)
acc[ii] += k_val * v_row; // scalar * simd<half, N> → simd<half, N>
// BAD: explicit broadcast via replicate_w — wastes a GRF shuffle instruction
simd<half, N> k_bcast = k_tile.template replicate_w<N, 1>(tt * SUB_HD + ii);
acc[ii] += k_bcast * v_row;
Why: replicate_w<N, 1>(i) emits a hardware GRF shuffle (region select) to fill an N-wide register. When used only as a multiplier, the GPU's scalar-broadcast path in the EU instruction set handles this for free inside the FMA — eliminating the shuffle saves one instruction per (tt, ii) pair.
Extraction note: simd<T,N>::operator[] returns a simd_view, which implicitly converts to T in arithmetic contexts. The result of half k_val = simd_vec[i] is a true scalar half register.
// Horizontal sum — USE detail::sum, NOT reduce<T>(vec, std::plus<T>())
// WHY: with `using namespace sycl::ext::intel::esimd`, the name `reduce` is
// ambiguous between esimd::reduce and C++17 std::reduce (iterator overload).
// The compiler silently picks std::reduce which returns 0. This is a silent bug
// with no compile error that produces wrong results.
float s = sycl::ext::intel::esimd::detail::sum<float, float, N>(vec);