Full pipeline for turning a SYCL/ESIMD GPU kernel into a Python-importable wheel package on Windows with Intel oneAPI 2025.x and conda. Covers every layer of the stack: ESIMD kernel (.cpp/.h) → Windows DLL (icpx) → PyTorch C++ extension (.pyd, CMake) → Python package → wheel (.whl, scikit-build-core). Use this skill whenever the user is working on Intel Arc GPU (Xe2 / BMG / PTL-H) SYCL or ESIMD kernels and wants to expose them to Python, package them as a wheel, set up a build script, debug build failures, or understand how the DLL + .pyd + wheel layers fit together. Also use it when they hit Windows-specific build issues like setvars.bat failing, cmake.exe producing no output, or ur_api.h not found.
lgrf_uni/kernels.cpp ──icpx──► my_kernel.dll + my_kernel.lib
│
csrc/entry.cpp │ (linked via .lib)
csrc/wrapper.cpp ──cmake──► _ext.cp311-win_amd64.pyd
│
python/my_package/ │ (copied in)
__init__.py ──build──► dist/my_package-0.0.1-cp311-abi3-win_amd64.whl
_ext.*.pyd
my_kernel.dll
Two compilation passes, always in this order:
icpx -fsycl, AOT for the target GPUicx (host-only), links the DLL's .libmy_kernel/
├── CMakeLists.txt
├── pyproject.toml
├── build.bat ← all-in-one build script
├── run_build.bat ← Claude Code launcher (sets conda env, calls build.bat)
├── lgrf_uni/
│ ├── esimd_kernel_api.h ← dllexport / dllimport macro
│ ├── kernels.cpp ← extern "C" dispatchers, sycl::queue interop
│ └── single_kernels/ ← header-only kernel implementations
│ └── my_kernel.h
├── csrc/
│ ├── entry.cpp ← PYBIND11_MODULE registrations
│ ├── my_kernel_wrapper.cpp ← thin C++ wrapper (Tensor checks → DLL call)
│ └── utils.h ← get_queue() helper
├── python/my_package/
│ ├── __init__.py ← DLL preload + _ext import
│ └── version.py
└── test/
└── test_my_kernel.py
lgrf_uni/esimd_kernel_api.h#pragma once
#ifdef BUILD_ESIMD_KERNEL_LIB
#define ESIMD_KERNEL_API __declspec(dllexport)
#else
#define ESIMD_KERNEL_API __declspec(dllimport)
#endif
Define -DBUILD_ESIMD_KERNEL_LIB only when compiling the DLL, not when linking it.
lgrf_uni/kernels.cpp#include <sycl/sycl.hpp>
#include <sycl/ext/intel/esimd.hpp>
#include "esimd_kernel_api.h"
#include "single_kernels/my_kernel.h" // header-only ESIMD implementation
extern "C" ESIMD_KERNEL_API void my_kernel(
void* input, void* output, int N,
void* sycl_queue_ptr) // pass PyTorch's queue as void*
{
sycl::queue& q = *reinterpret_cast<sycl::queue*>(sycl_queue_ptr);
// launch via q.submit(...) using the header-only kernel
q.submit([&](sycl::handler& cgh) {
cgh.parallel_for(sycl::nd_range<1>(...), [=](sycl::nd_item<1> ndi)
SYCL_ESIMD_KERNEL { my_kernel_impl(..., ndi); });
}).wait();
}
Rules:
extern "C" (no C++ name mangling) + ESIMD_KERNEL_APIvoid* sycl_queue_ptr, cast to sycl::queue* — this is how you share the queue with PyTorchsingle_kernels/; kernels.cpp is just dispatchingicpx kernels.cpp -shared -o my_kernel.dll ^
-DBUILD_ESIMD_KERNEL_LIB ^
-fsycl -fsycl-targets=spir64_gen ^
-Xs "-device ptl-h -options -doubleGRF" ^
-O3
Output: my_kernel.dll (runtime) + my_kernel.lib (import library for the linker).
Device targets: ptl-h = Panther Lake, xe2-hpg = BMG. Use -doubleGRF for ESIMD kernels that need all 256 GRF registers.
csrc/utils.h — borrow PyTorch's XPU queue#pragma once
#include <torch/extension.h>
#include <c10/xpu/XPUStream.h>
namespace utils {
static inline sycl::queue& get_queue(const torch::Device& device) {
return c10::xpu::getCurrentXPUStream(device.index()).queue();
}
}
csrc/my_kernel_wrapper.cpp#include <torch/extension.h>
#include "../lgrf_uni/esimd_kernel_api.h" // dllimport declarations
#include "utils.h"
// Forward-declare the DLL function (or include a header with the declaration)
extern "C" ESIMD_KERNEL_API void my_kernel(void*, void*, int, void*);