Name: ESIMD LSC 2D Block Operations, Gather, and Scatter
Author: ModelTC

스킬 검색.../

ESIMD LSC 2D Block Operations, Gather, and Scatter | Skills Pool

#include <sycl/ext/intel/esimd.hpp>
using namespace sycl::ext::intel::esimd;

// Load a W x H tile of type T, NBlk blocks
auto tile = lsc_load_2d<T, W, H, NBlk,
                        Transposed,    // bool: transpose on load
                        Transformed,   // bool: VNNI transform
                        L1Hint, L3Hint>(payload);

lsc_store_2d<T, W, H, NBlk,
             L1Hint, L3Hint>(payload, data);

config_2d_mem_access<T, W, H, NBlk> payload(
    base_ptr,       // const T* — base address of the 2D surface
    surface_width,  // uint32_t — surface width in BYTES minus 1
    surface_height, // uint32_t — surface height in ROWS minus 1
    surface_pitch,  // uint32_t — surface pitch in BYTES minus 1
    x_offset,       // int — tile x offset in ELEMENTS (not bytes!)
    y_offset        // int — tile y offset in ROWS
);

// Example: load 16x8 tile of half from a [rows, cols] matrix
const int cols = 256;
const int rows = 1024;
config_2d_mem_access<half, 16, 8, 1> payload(
    ptr,
    (uint32_t)(cols * sizeof(half)) - 1u,   // width in bytes - 1
    (uint32_t)(rows) - 1u,                   // height in rows - 1
    (uint32_t)(cols * sizeof(half)) - 1u,   // pitch in bytes - 1
    col_offset,                              // x in elements
    row_offset                               // y in rows
);
auto tile = lsc_load_2d<half, 16, 8, 1,
                        false, false,
                        cache_hint::cached, cache_hint::cached>(payload);

payload.set_x(new_col_offset);  // in elements
payload.set_y(new_row_offset);  // in rows

// BAD: compile error — Transposed=true not supported for half
lsc_load_2d<half, 8, 16, 1, /*Transposed=*/true, false, ...>(payload);

// GOOD: use uint32_t, then bit_cast_view to half
config_2d_mem_access<uint32_t, K/2, N, 1> k_payload(
    reinterpret_cast<const uint32_t*>(k_ptr),
    (uint32_t)(HEAD_DIM * sizeof(half)) - 1u,
    (uint32_t)(num_blocks) - 1u,
    (uint32_t)(HEAD_DIM * sizeof(half)) - 1u,
    0, 0);

auto k_u32  = lsc_load_2d<uint32_t, K/2, N, 1,
                           /*Transposed=*/true, /*Transformed=*/false,
                           cache_hint::cached, cache_hint::cached>(k_payload);
// MUST store to named variable first (see pitfalls)
auto k_tile = k_u32.template bit_cast_view<half>().read();

lsc_prefetch_2d<T, W, H, NBlk,
                L1Hint, L3Hint>(payload);

// Prefetch next K tile during VS phase (cross-phase prefetch)
k_payload.set_y(next_k_offset);
lsc_prefetch_2d<uint32_t, K/2, N, 1,
                cache_hint::cached, cache_hint::cached>(k_payload);

// Load N contiguous elements of type T
simd<half, 128> data = block_load<half, 128>(ptr + offset);

// Store N contiguous elements
block_store<half, 128>(ptr + offset, data);

// With cache hints
simd<half, 128> data = block_load<half, 128>(ptr + offset,
    overaligned_tag<16>{},       // alignment hint
    properties{cache_hint_L1<cache_hint::streaming>,
               cache_hint_L3<cache_hint::cached>});

// Gather: load one element per lane from different offsets
simd<uint32_t, 16> offsets = ...;  // byte offsets from base
simd<half, 16> data = gather<half, 16>(base_ptr, offsets);

// Scatter: store one element per lane to different offsets
scatter<half, 16>(base_ptr, offsets, data);

// Predicated gather/scatter with mask
simd_mask<16> pred = ...;
simd<half, 16> data = gather<half, 16>(base_ptr, offsets, pred);
scatter<half, 16>(base_ptr, offsets, data, pred);

Hint	Meaning	Use Case
`cache_hint::cached`	Cache in L1/L3	Data reused soon (default for most loads)
`cache_hint::uncached`	Bypass cache	Streaming writes, one-time reads
`cache_hint::streaming`	Streaming hint	Sequential access, mild temporal locality
`cache_hint::write_back`	Write-back	Store with cache retention

// Best for block_store — use default, no explicit hints
block_store<half, N>(ptr + offset, data);

// For 2D loads that will be reused — cached/cached
lsc_load_2d<..., cache_hint::cached, cache_hint::cached>(payload);

Parameter	Constraint
Tile width (in bytes)	`W * sizeof(T) <= 64` bytes
Tile height	`H <= 32` rows
Surface pitch	Must be `>= surface_width` (both in bytes)
Alignment	Base pointer must be 64-byte aligned (surface base)
X offset	Must be element-aligned for the type

// BAD: rvalue, compiler error or undefined behavior
auto tile = lsc_load_2d<uint32_t, K/2, N, 1, true, false, ...>(payload)
                .template bit_cast_view<half>().read();

// GOOD: store to named variable first
auto k_u32  = lsc_load_2d<uint32_t, K/2, N, 1, true, false, ...>(payload);
auto k_tile = k_u32.template bit_cast_view<half>().read();

// BAD: byte offset for x — loads wrong columns
config_2d_mem_access<half, 16, 8, 1> payload(ptr, wd, ht, pt,
    (int)(col * sizeof(half)), row);  // WRONG: x in bytes

// GOOD: element offset for x
config_2d_mem_access<half, 16, 8, 1> payload(ptr, wd, ht, pt,
    (int)col, row);  // CORRECT: x in elements

// BAD: raw values
config_2d_mem_access<half, 16, 8, 1> payload(ptr,
    cols * sizeof(half),      // WRONG: not minus 1
    rows,                     // WRONG: not minus 1
    cols * sizeof(half),      // WRONG: not minus 1
    0, 0);

// GOOD: subtract 1
config_2d_mem_access<half, 16, 8, 1> payload(ptr,
    (uint32_t)(cols * sizeof(half)) - 1u,
    (uint32_t)(rows) - 1u,
    (uint32_t)(cols * sizeof(half)) - 1u,
    0, 0);

lsc_store_2d<half, W, 8, 1, ...>(payload, data_upper);
payload.set_y(payload.get_y() + 8);
lsc_store_2d<half, W, 8, 1, ...>(payload, data_lower);

Access Pattern	Operation	When to Use
`lsc_load_2d` / `lsc_store_2d`	2D tiled access	Matrix tiles for DPAS, strided 2D regions
`block_load` / `block_store`	1D contiguous	Full rows, SLM cooperative loads, simple patterns
`gather` / `scatter`	Non-contiguous	Indexed access, strided columns, irregular patterns
`lsc_prefetch_2d`	2D prefetch	Hide latency for next iteration's tiles

Skill	Relevance
`intel-esimd-base`	Foundational ESIMD programming, includes memory access overview
`esimd-lsc-slm`	SLM-specific LSC operations (slm_block_load, lsc_slm_gather/scatter)
`xe2-dpas-patterns`	DPAS tiling and VNNI layout that 2D loads produce
`intel-esimd-attn-gemm`	Attention GEMM kernels using lsc_load_2d for K tiles
`xe2-sdp-hd256`	HD=256 SDP kernel using lsc_slm_scatter for S transpose
`intel-gpu-kernel-opt`	General optimization methodology including memory access tuning

Type	Max W (elements)	Max H	Max Elements
`half` (2B)	32	32	1024
`float` (4B)	16	32	512
`uint32_t` (4B)	16	32	512
`uint8_t` (1B)	64	32	2048

ESIMD LSC 2D Block Operations, Gather, and Scatter

Table of Contents

ESIMD LSC 2D Block Operations, Gather, and Scatter

Table of Contents

2D Block Operations

lsc_load_2d

lsc_store_2d

Surface Descriptor

Updating Offsets

Transposed 2D Load (VNNI)

Constraint: Only works with uint32_t, NOT half

Transposed Load Width Constraints

2D Prefetch

1D Block Load / Store

Gather / Scatter

Cache Hints

Tile Size Constraints

2D Block Load/Store Limits

Practical Maximum Tile Sizes

2D Store Height Limit

Common Pitfalls

1. `lsc_load_2d` Returns Rvalue — Cannot Chain `.bit_cast_view()`

2. X Offset is in Elements, Not Bytes

3. Width/Height/Pitch Are Value Minus 1

4. Transposed Load Does NOT Work with `half`

5. `lsc_store_2d` Height Limit for FP16

Choosing the Right Access Pattern

Helm Chart Scaffolding

Python Observability

K8s Manifest Generator

Istio Traffic Management

Secrets Management

Gitops Workflow

ESIMD LSC 2D Block Operations, Gather, and Scatter

Table of Contents

ESIMD LSC 2D Block Operations, Gather, and Scatter

Table of Contents

2D Block Operations

lsc_load_2d

lsc_store_2d

Surface Descriptor

Updating Offsets

Transposed 2D Load (VNNI)

Constraint: Only works with uint32_t, NOT half

Transposed Load Width Constraints

2D Prefetch

1D Block Load / Store

Gather / Scatter

Cache Hints

Tile Size Constraints

2D Block Load/Store Limits

Practical Maximum Tile Sizes

2D Store Height Limit

Common Pitfalls

1. lsc_load_2d Returns Rvalue — Cannot Chain .bit_cast_view()

2. X Offset is in Elements, Not Bytes

3. Width/Height/Pitch Are Value Minus 1

4. Transposed Load Does NOT Work with half

5. lsc_store_2d Height Limit for FP16

Choosing the Right Access Pattern

Related Skills

Helm Chart Scaffolding

Python Observability

K8s Manifest Generator

Istio Traffic Management

Secrets Management

Gitops Workflow

1. `lsc_load_2d` Returns Rvalue — Cannot Chain `.bit_cast_view()`

4. Transposed Load Does NOT Work with `half`

5. `lsc_store_2d` Height Limit for FP16