Skill File

Cuda Writing

Name: Cuda Writing
Author: jarmak-personal

PROACTIVELY USE THIS SKILL when writing, modifying, or reviewing GPU kernels, CUDA/NVRTC kernel source, CCCL primitive usage, device memory management, stream-based pipelining, or any GPU dispatch logic in src/vibespatial/. Covers kernel lifecycle, ADR-0033 tier system, stream overlap patterns, warp-level intrinsics, count-scatter patterns, precompilation (ADR-0034), precision dispatch (ADR-0002), and GPU saturation techniques.

jarmak-personal0 starsMar 20, 2026

Occupation
Categories: Machine Learning

Skill Content

GPU Kernel Development Guide — vibeSpatial

You are writing GPU code for vibeSpatial. Follow these rules strictly. All GPU primitive dispatch decisions are governed by ADR-0033. Precision dispatch is governed by ADR-0002. Precompilation is governed by ADR-0034.

1. ADR-0033 Tier Decision Tree

Before writing any GPU code, classify your operation:

Is the inner loop geometry-specific (ring traversal, winding, segment intersection)?
  -> Yes: Tier 1 (custom NVRTC kernel)
  -> No: Is it segmented (per-row/ring/group reduction, sort, scan)?
    -> Yes: Tier 3a (CCCL segmented_*)
    -> No: Is it sort/unique/search/partition/compaction/scan?
      -> Yes: Tier 3a (CCCL)
      -> No: Is it element-wise/gather/scatter/concat?
        -> Yes: Tier 2 (CuPy)
        -> No: Tier 1 (custom NVRTC kernel)

Then for CCCL (Tier 3a): Can input use CountingIterator/TransformIterator? Use Tier 3c iterator. Called repeatedly with same types/ops? Use Tier 3b make_*.

Related Skills

Cuda Writing | Skills Pool

# One-shot API (cold-call pays full JIT):
cccl_algorithms.exclusive_scan(values, out, _sum_op, init, n)

# make_* API (pre-compiled, reusable):
scanner = cccl_algorithms.make_exclusive_scan(None, values, out, _sum_op, n, init)
temp = cp.empty(scanner_temp_bytes, dtype=cp.uint8)
scanner(temp, values, out, _sum_op, n, init)  # no JIT check

from vibespatial.cuda.cccl_primitives import counting_iterator, transform_iterator

# Instead of: indices = cp.arange(n, dtype=cp.int32)
indices = counting_iterator(0, dtype=np.int32)

# Fuse a transform with a reduction (no intermediate buffer):
squared = transform_iterator(d_values, lambda x: x * x)

# Callers use strategy enums, never raw CCCL calls:
result = compact_indices(mask, strategy=CompactionStrategy.AUTO)
offsets = exclusive_sum(counts, strategy=ScanStrategy.AUTO)

from vibespatial.cuda._runtime import (
    get_cuda_runtime, make_kernel_cache_key,
    KERNEL_PARAM_PTR, KERNEL_PARAM_I32, KERNEL_PARAM_F64,
)

# 1. Get runtime singleton
runtime = get_cuda_runtime()

# 2. Compile (cached via SHA1 of source)
cache_key = make_kernel_cache_key("my-kernel", _KERNEL_SOURCE)
kernels = runtime.compile_kernels(
    cache_key=cache_key, source=_KERNEL_SOURCE,
    kernel_names=("my_kernel",),
)

# 3. Parameters as (values_tuple, types_tuple)
ptr = runtime.pointer
params = (
    (ptr(d_input), ptr(d_output), width, height, some_float),
    (KERNEL_PARAM_PTR, KERNEL_PARAM_PTR, KERNEL_PARAM_I32, KERNEL_PARAM_I32, KERNEL_PARAM_F64),
)

# 4. Occupancy-based launch config (NEVER hardcode block=(256,1,1))
grid, block = runtime.launch_config(kernels["my_kernel"], item_count)

# 5. Launch
runtime.launch(kernels["my_kernel"], grid=grid, block=block, params=params)

KERNEL_PARAM_PTR = ctypes.c_void_p    # Device memory pointers
KERNEL_PARAM_I32 = ctypes.c_int       # 32-bit integers
KERNEL_PARAM_F64 = ctypes.c_double    # 64-bit floats

grid, block = runtime.launch_config(kernel, item_count)
# Or: block_size = runtime.optimal_block_size(kernel, shared_mem_bytes=0)

extern "C" __global__ void __launch_bounds__(256, 4)
my_kernel(const double* __restrict__ x, ...) {{
    // 256 max threads/block, at least 4 blocks per SM
}}

# WITHOUT zero-fill (default — use when kernel writes every element)
d_output = runtime.allocate((n,), np.float64)

# WITH zero-fill (use for count arrays, sparse-update targets)
d_counts = runtime.allocate((n,), np.int32, zero=True)

d_data = runtime.from_host(host_array)              # H2D
host_result = runtime.copy_device_to_host(d_output)  # D2H
ptr = runtime.pointer(d_array)                       # device pointer (int), 0 for None

stats = runtime.memory_pool_stats()  # {"used_bytes", "total_bytes", "free_bytes"}
runtime.free_pool_memory()           # release cached memory

# Manual create/destroy
stream = runtime.create_stream()
runtime.launch(kernel, grid=g, block=b, params=p, stream=stream)
stream.synchronize()
runtime.destroy_stream(stream)

# Context manager (auto-sync + cleanup)
with runtime.stream_context() as stream:
    runtime.launch(kernel, grid=g, block=b, params=p, stream=stream)

# Pinned host memory enables true async DMA
h_buf = runtime.allocate_pinned((n,), np.int32)

# Async D2H (enqueue only — sync stream before reading)
runtime.copy_device_to_host_async(d_array, stream, h_buf)

# Async H2D
runtime.copy_host_to_device_async(h_array, d_array, stream)

from vibespatial.cuda._runtime import count_scatter_total, count_scatter_total_with_transfer

# Simple: single-sync async pinned transfer (replaces 2x .get())
total = count_scatter_total(runtime, device_counts, device_offsets)

# Advanced: also starts full counts D2H on background stream
total, xfer_stream, pinned_counts = count_scatter_total_with_transfer(
    runtime, device_counts, device_offsets,
)
# ... launch scatter kernel on null stream ...
runtime.synchronize()
xfer_stream.synchronize()  # counts already transferred
runtime.destroy_stream(xfer_stream)
host_counts = pinned_counts

stream_q = runtime.create_stream()
stream_t = runtime.create_stream()
d_query = runtime.allocate(query_bounds.shape, query_bounds.dtype)
d_tree = runtime.allocate(tree_bounds.shape, tree_bounds.dtype)
runtime.copy_host_to_device_async(query_bounds, d_query, stream_q)
runtime.copy_host_to_device_async(tree_bounds, d_tree, stream_t)
stream_q.synchronize()
stream_t.synchronize()
runtime.destroy_stream(stream_q)
runtime.destroy_stream(stream_t)

with runtime.stream_context() as s_poly:
    runtime.launch(bounds_polygon_kernel, ..., stream=s_poly)
with runtime.stream_context() as s_mpoly:
    runtime.launch(bounds_multipolygon_kernel, ..., stream=s_mpoly)

from vibespatial.cuda.cccl_primitives import (
    # Tier 3a — CCCL default (beats CuPy)
    exclusive_sum,            # Prefix sum (1.8-3.7x faster)
    compact_indices,          # Bool mask -> indices (CuPy default; CCCL via explicit strategy)
    sort_pairs,               # Radix or merge sort with values
    unique_sorted_pairs,      # Unique-by-key on sorted input
    segmented_reduce_sum,     # Per-segment sum
    segmented_reduce_min,     # Per-segment min
    segmented_reduce_max,     # Per-segment max
    segmented_sort,           # Sort within offset-delimited segments
    lower_bound,              # Binary search (first insertion point)
    upper_bound,              # Binary search (last insertion point)
    three_way_partition,      # Split into 3 groups by predicates

    # Tier 3c — Zero-allocation iterators
    counting_iterator,        # Lazy [0,1,2,...] (replaces cp.arange)
    transform_iterator,       # Fused element-wise transform

    # Tier 4 — Still CuPy-default (CCCL available but marginal win)
    reduce_sum,               # Scalar reduction (CuPy cp.sum faster at small scale)
)

offsets = exclusive_sum(counts, synchronize=False)
sorted_result = sort_pairs(keys, values, synchronize=False)
lb = lower_bound(sorted_data, queries, synchronize=False)

from vibespatial.cuda.cccl_precompile import request_warmup
request_warmup(["exclusive_scan_i32", "exclusive_scan_i64", "select_i32"])

from vibespatial.cuda.nvrtc_precompile import request_nvrtc_warmup
request_nvrtc_warmup([
    ("my-kernel", _KERNEL_SOURCE, _KERNEL_NAMES),
])

import vibespatial
status = vibespatial.precompile_status()  # dict of spec -> compiled/pending/failed

const bool valid = row < row_count;
const unsigned char is_candidate = valid ? candidate_mask[row] : 0;
if (__ballot_sync(0xFFFFFFFF, is_candidate) == 0) {{
    return;  // entire warp skips all global memory reads
}}
if (!valid || !is_candidate) return;

if (condition) {{
    // ... divergent work ...
}}
__syncwarp(0xFFFFFFFF);  // reconverge before shuffle/ballot

const unsigned int FULL_MASK = 0xFFFFFFFF;
for (int offset = 16; offset > 0; offset >>= 1) {{
    my_crossings ^= __shfl_xor_sync(FULL_MASK, my_crossings, offset);
    my_boundary  |= __shfl_xor_sync(FULL_MASK, my_boundary, offset);
}}

__shared__ int warp_results[8];  // up to 256 threads = 8 warps
const int warp_id = threadIdx.x / 32;
const int lane_id = threadIdx.x % 32;

// 1. Warp-level shuffle reduction
for (int offset = 16; offset > 0; offset >>= 1)
    my_value ^= __shfl_xor_sync(0xFFFFFFFF, my_value, offset);

// 2. Lane 0 writes to shared memory
if (lane_id == 0) warp_results[warp_id] = my_value;
__syncthreads();

// 3. Thread 0 reduces across warps
if (threadIdx.x == 0) {{
    int total = 0;
    for (int w = 0; w < num_warps; ++w) total ^= warp_results[w];
    output[blockIdx.x] = total;
}}

Pass 0 (count):   Each thread computes output size -> counts[tid]
Prefix sum:        exclusive_sum(counts) -> offsets[tid]
Get total:         count_scatter_total(runtime, counts, offsets)
Allocate output:   runtime.allocate((total,), dtype)
Pass 1 (scatter):  Each thread writes output at offsets[tid]

for (int ring = ring_start; ring < ring_end; ++ring) {{
    int edge_count = ring_offsets[ring + 1] - ring_offsets[ring] - 1;
    for (int e = threadIdx.x; e < edge_count; e += blockDim.x) {{
        // ... even-odd test on edge e ...
    }}
    // Warp shuffle + shared memory reduction for this ring
}}

// BAD: 32-way bank conflict when reading columns
__shared__ float tile[32][32];

// GOOD: +1 padding shifts bank mapping — no conflicts
__shared__ float tile[32][33];

runtime.launch(kernel, grid=g, block=b, params=p, shared_mem_bytes=1024)

__shared__ int scratch[256];                  // static
extern __shared__ float dynamic_smem[];       // dynamic (via shared_mem_bytes)

__pipeline_memcpy_async(&shared_data[tid], &global_data[idx], sizeof(double));
__pipeline_commit();
__pipeline_wait_prior(0);  // wait for all committed copies
__syncthreads();

Kernel Class	Consumer GPU (fp64:fp32 < 0.25)	Datacenter GPU (fp64:fp32 >= 0.25)
COARSE (bounds, index, filter)	Staged fp32 with coordinate centering	Native fp64
METRIC (distance, area, length)	Staged fp32 with Kahan compensation	Native fp64
PREDICATE (PIP, binary preds)	Staged fp32 coarse pass + selective fp64 refinement for ambiguous rows	Native fp64
CONSTRUCTIVE (clip, overlay, buffer)	Native fp64 (until robustness work proves cheaper path)	Native fp64

Coalesced reads: Adjacent threads read adjacent addresses. SoA layout (separate x[], y[]) is already coalesced.
Avoid AoS: Never interleave x,y in a single array. NVIDIA GTC 2024 benchmark: AoS was 5.9x slower than SoA due to strided access.
Minimize global writes: Use shared memory or registers for intermediates; write to global once.

const __restrict__: Always annotate read-only pointer parameters. On CC 3.5+, the compiler automatically routes through the read-only data cache (__ldg path), increasing effective cache capacity:

extern "C" __global__ void my_kernel(
    const double* __restrict__ x,   // read-only -> __ldg cache
    const double* __restrict__ y,
    double* __restrict__ output,    // write-only
    ...

Vectorized loads for bandwidth-bound bulk I/O. Use 128-bit wide loads to reduce instruction count and increase bandwidth (1.3-1.5x speedup per NVIDIA benchmarks):
```
// Instead of scalar loads:
double val = input[idx];
// Use 128-bit vectorized loads (requires aligned pointer):
double2 vals = reinterpret_cast<const double2*>(input)[idx];
// For fp32: float4 vals = reinterpret_cast<const float4*>(input)[idx];
```
cudaMalloc guarantees 256-byte alignment, so device pointers are valid. Handle remainder elements with a scalar tail.

// Grid-stride loop with ILP (4 elements/thread)
const int stride = blockDim.x * gridDim.x;
for (int idx = blockIdx.x * blockDim.x + threadIdx.x;
     idx < n;
     idx += stride * 4) {{
    #pragma unroll
    for (int j = 0; j < 4; j++) {{
        int elem = idx + j * stride;
        if (elem < n) output[elem] = compute(input[elem]);
    }}
}}

grid_size = min(
    (n + block_size - 1) // block_size,
    sm_count * max_blocks_per_sm  # fill GPU exactly — avoid wave waste
)

_WORK_BINS = [64, 1024]  # simple < 64 verts, medium 64-1024, complex > 1024

def _should_bin_dispatch(work_estimates):
    if len(work_estimates) < 1024: return False
    return work_estimates.std() / work_estimates.mean() > 2.0

Reverse block traversal: Alternate block indexing direction between consecutive kernels so kernel B starts from data kernel A last touched:

// Kernel A: forward
int blockId = blockIdx.x;
// Kernel B: reverse (hits A's still-cached tail data)
int blockId = gridDim.x - blockIdx.x - 1;

Cache tiling: If intermediate data fits in L2 (40 MB A100, 72 MB RTX 4090, 50 MB H100), process data in L2-sized chunks end-to-end rather than running each stage over the full dataset. L2 cache hits can yield up to 10x bandwidth improvement over HBM misses.

from vibespatial.runtime.adaptive import plan_dispatch_selection
from vibespatial.runtime.precision import KernelClass
from vibespatial.runtime._runtime._runtime import ExecutionMode

selection = plan_dispatch_selection(
    kernel_name="my_kernel",
    kernel_class=KernelClass.COARSE,
    row_count=n,
    requested_mode=dispatch_mode,
)
if selection.selected is ExecutionMode.GPU:
    return _my_kernel_gpu(...)

Threads	Wall Time	Speedup
1	~18-22s	1x
4	~4-5s	4x
8	~2-3s	7x (recommended for CCCL)
16	~1.5s	12x (recommended for NVRTC)

Operation	CCCL Specs	NVRTC Units	Est. Cold Cost
`gs.within(other)`	2	1	~1s
`gs.to_wkb()`	2	0	~1s
`gpd.sjoin(a, b)`	12	2-3	~2s
`gpd.overlay(a, b)`	6	2	~1.5s
`gs.dissolve(by=)`	3	0	~1s

Cuda Writing

GPU Kernel Development Guide — vibeSpatial

1. ADR-0033 Tier Decision Tree

Cuda Writing

GPU Kernel Development Guide — vibeSpatial

1. ADR-0033 Tier Decision Tree

Tier 1 — Custom NVRTC Kernels

Tier 2 — CuPy Built-Ins (Default for Element-Wise)

Tier 3a — CCCL Algorithmic Primitives

Tier 3b — CCCL make_* Reusable Callables

Tier 3c — CCCL Iterators (Zero-Allocation)

Strategy Enum Pattern (API Stability)

2. Kernel Launch Lifecycle (Tier 1)

Parameter Type Constants

Occupancy-Based Block Sizing

__launch_bounds__ Directive

3. Device Memory Management

Allocation

Transfers

Memory Pool

4. CUDA Streams — When and How

When to Use Streams

When NOT to Use Streams

Stream API

Async Transfers

Pattern: Count-Scatter Total

Pattern: Independent Uploads

Pattern: Independent Family Kernels

Synchronization Rules

5. CCCL Primitives

Available in cccl_primitives.py

Synchronize Parameter

6. Precompilation and Warmup (ADR-0034)

The Problem

Three-Level Demand-Driven Strategy

Key Properties

Thread Scaling (18 CCCL specs)

Cost by Operation

Observability

7. Warp-Level Intrinsics

Warp Ballot — Early Exit

__syncwarp() — Warp Reconvergence (Volta+)

Warp Shuffle — Intra-Warp Reduction

Block-Level Reduction (via shared memory)

8. Two-Pass Count-Scatter Pattern

Best Practices

9. Shared Memory

When to Use

Cooperative Intra-Ring PIP

Bank Conflict Avoidance

Declaring Shared Memory

Async Copy to Shared Memory (CUDA 11.0+)

10. Precision Dispatch (ADR-0002)

Policy by Kernel Class

How It Works

Implementation Checklist

11. Performance Rules

Memory Access

Kernel Launch

Grid-Stride Loops with ILP

Divergence and Load Balancing

Arithmetic Micro-Optimizations

L2 Cache-Aware Patterns

12. Dispatcher Pattern

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

Tier 3b — CCCL `make_*` Reusable Callables

`__launch_bounds__` Directive

`__syncwarp()` — Warp Reconvergence (Volta+)