Name: cuTile → Triton Conversion
Author: NVIDIA

cuTile → Triton Conversion | Skills Pool

gemma_attention

# Count kernels (only main kernel gets @triton.jit, helpers stay plain def)
grep "@ct\.kernel" source.py | wc -l

# Check for patterns needing special handling
grep "ct\.transpose\|ct\.permute" source.py   # → use tl.trans/tl.permute
grep "ct\.astype" source.py                    # → use .to(dtype)
grep "ct\.load\|ct\.store" source.py          # → TMA for 2D+ (tl.make_tensor_descriptor), NOT raw tl.load(ptr+offs)
grep "ct\.launch" source.py                    # → bracket launch: kernel then [grid] then (args)
grep "ct\.Constant\|ct\.ConstInt" source.py    # → tl.constexpr
grep "ct\.cdiv" source.py                      # → triton.cdiv (host) or Python (a+b-1)//b
grep "ct\.bid\|ct\.num_blocks" source.py       # → tl.program_id/tl.num_programs
grep "1 << .*\.bit_length" source.py           # → triton.next_power_of_2 if needed
grep "transpose\|transpose_v" source.py       # → if hit, read translations/advanced-patterns.md (dual kernels + META grid)

Conversion Progress:
 [ ] Step 0 (attention / Gemma FMHA / GQA / soft cap / sliding window): Read [references/optimization-strategy.md](./references/optimization-strategy.md) and apply §4 checklist before inner-loop Triton
 [ ] Step 1: Pre-flight — run grep commands above, note special patterns and 2D+ loads (→ TMA)
 [ ] Step 2: Analyze source cuTile kernel (identify patterns, shapes, dtypes)
 [ ] Step 3: Create Triton file with correct structure (see translations/file-structure.md)
 [ ] Step 4: Convert kernel signature (tensor args → pointer args, Constant → constexpr)
 [ ] Step 4b: TMA (MANDATORY for 2D+ loads) — use tl.make_tensor_descriptor for every 2D+ tile load/store; do NOT ship raw tl.load(ptr+offs,mask) for block-shaped access (see workflow.md § TMA OPTIMIZATION)
 [ ] Step 5: Convert kernel body (apply gotchas table below + API mapping)
 [ ] Step 6: Convert host wrapper (grid tuple/lambda, bracket-style launch: kernel, grid, then arguments; no ct.launch); call triton.set_allocator(alloc_fn) if using TMA
 [ ] Step 7: Validate — run pytest or syntax check on Triton file
 [ ] Step 8: Test — run pytest, verify X passed 0 failed
 [ ] Step 9: If test fails → fix → re-validate → re-test (loop until green)
 [ ] Step 10: Benchmark — run perf test, compare vs cuTile (see workflow.md § PERFORMANCE ANALYSIS)
 [ ] Step 10b: If GEMM/BMM/attention and Triton &gt;20% slower → walk [references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3 then [references/optimizing-reference.md](./references/optimizing-reference.md) (EVEN_K, transpose, grid, autotune, epilogue subtile), then re-benchmark
 [ ] Step 10c: If op has `transpose` / layout flag → read [translations/advanced-patterns.md](./translations/advanced-patterns.md); verify **separate kernels** per layout (not transpose-kernel + `tl.trans`); **autotuned** launches use `lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — no fixed `BLOCK_H`/`BLOCK_N` through `apply()` unless autotune is disabled

Post-conversion Verification (TMA is mandatory for 2D+ loads):
 [ ] TMA: All 2D+ tile loads use tl.make_tensor_descriptor(...).load([...]); no raw ptr+mask for block-shaped 2D+ access (else 5x-20x regression)
 [ ] Grid uses tuple or lambda (not 3-tuple required like cuTile)
 [ ] Triton autotune added if cuTile op used kernel_configs/autotune (see workflow § PERFORMANCE ANALYSIS)
 [ ] Host grid uses triton.cdiv where appropriate (not (a+b-1)//b only)
 [ ] Pointer/offset indexing: Triton uses element offsets (ptr + offs), not block index in tl.load (or use TMA descriptor)
 [ ] ct.astype(x, dtype) → x.to(dtype) in Triton
 [ ] ct.mma(a, b, acc=acc) → tl.dot(a, b, acc) (no keyword in Triton)
 [ ] Optional/None args: Triton allows None in kernel args if desired (cuTile required dummy+flag)
 [ ] Masking applied when BLOCK_SIZE > actual dimension (same as cuTile); with TMA, masks can often be removed for full tiles
 [ ] Reduction divisor uses actual_size, NOT BLOCK_SIZE
 [ ] fp32/tf32: Triton defaults allow_tf32=True; match cuTile behavior if you had explicit tf32 cast
 [ ] If any 2D+ load uses raw ptr+mask (exception only): document WHY TMA was not used
 [ ] tl.assume() alignment hints added for strides and pointers

Pattern	cuTile	Triton	Common Mistake
mma accumulator	`ct.mma(a, b, acc=acc)`	`tl.dot(a, b, acc)`	Using keyword `acc=` in Triton (positional only)
mma float32→tf32	Explicit `ct.astype(..., ct.tfloat32)` guard before ct.mma	`tl.dot(a, b, allow_tf32=True)` (default)	Over-specifying; Triton auto-casts by default
Type cast	`ct.astype(x, dtype)`	`x.to(dtype)`	Using ct.astype in Triton
Grid	`(n, 1, 1)` tuple, `ct.launch(stream, grid, kernel, args)`	`lambda meta: (n,)` or tuple, bracket launch	Using ct.launch or 3-tuple in Triton
Host cdiv	`(a + b - 1) // b` (Python)	`triton.cdiv(a, b)`	Forgetting triton.cdiv in host
2D+ tile load	`ct.load(arr, index=(i,j), shape=(BM,BK))` (cuTile uses TMA)	`tl.make_tensor_descriptor(...).load([...])`	Using raw `tl.load(ptr+offs, mask=m)` → 5-20x regression; always use TMA for 2D+ block loads
Index type	Block index in ct.load/ct.store	Element offset (ptr + offs) or TMA descriptor	Using block index as tl.load offset
arange	`ct.arange(N, dtype=ct.int32)`	`tl.arange(0, N)`	Triton has start param (0, N)
None args	Dummy tensor + flag	Allowed in kernel	Carrying over dummy+flag when not needed
String const	`ct.Constant[int]` only (no str)	`tl.constexpr` (any type)	Keeping int enum; Triton can use str constexpr if needed
Shape args	Static/constexpr in ct.full/ct.zeros	Dynamic shapes OK in Triton	Over-constraining shapes
Launch	`ct.launch(stream, grid, kernel, args)`	bracket launch (grid then args)	Leaving ct.launch in Triton host
Branch vars	Pre-define before if	Can define in branch	Over-defining before branch in Triton
Pointer table type	Typed tensor descriptor (auto)	`tl.load(ptrs+idx).to(tl.pointer_type(DTYPE))` where `DTYPE: tl.constexpr`	Hardcoding `tl.float16` → `cudaErrorIllegalAddress` for bfloat16/float32 inputs
Stride dtype	cuTile uses tensor shape (auto)	Pass strides as `torch.int64`, not `int32`	`int32` overflows → illegal address for large matrices (M×K > 2^31)
dtype map coverage	cuTile typed tensors (auto)	`_DTYPE_MAP` must cover all dtypes (incl. float8); use `hasattr` guards	Missing entry → `ValueError: Unsupported dtype` before kernel launch
tl.math.erf dtype	cuTile erf handles all dtypes	`tl.math.erf` only accepts fp32/fp64	`ValueError: Expected dtype ['fp32', 'fp64'] but got fp16` — do NOT replace with tanh approximation (mathematically wrong); let Triton auto-promote or cast input
tl.exp with fp16	cuTile exp handles all dtypes	Cast to fp32 before `tl.exp` for precision: `tl.exp(x.to(tl.float32))`	Precision loss or NaN with fp16 inputs in exp/log/sqrt
Math func approx	N/A	Never substitute `tl.math.erf` with tanh-based approximation	Using GELU tanh formula (`0.044715x³`) as erf approximation is mathematically incorrect* — they are different functions
Layout flag (`transpose`)	cuTile may use one path per layout	Need two Triton kernels when math differs (e.g. MLA: `qk` `[H,N]` vs `[N,H]`, different `V` TMA)	Reusing transpose-only logic for `transpose=False` + fixed blocks → 3–15× on that mode; see advanced-patterns.md
Batched matmul	`ct.matmul(W, X)` broadcasts implicitly at tile level	`tl.dot(W, X)` only supports 2D operands	Using `broadcast_to + tl.dot` → 10-50× slower, no tensor cores (see FFT anti-pattern below)
Batch-per-block	cuTile processes 1 batch per block naturally	Triton temptation: process BS batches per block	Creates BS× register pressure, breaks tensor core compatibility

Pattern	SLOW (Regression)	FAST (Optimized)	Impact
Memory access (2D+ tiles)	Raw ptr + masks: `tl.load(ptr+offs, mask=m)` for block-shaped 2D+ loads	TMA: `tl.make_tensor_descriptor(...).load([off])`	5-20x (500%-2000%) — most common cause of conversion regression; use TMA for every 2D+ tile load
Group iteration	Linear search all groups per tile	While-loop with `last_problem_end` tracking	2-5x
Tile sizes	Fixed `BLOCK_M=128, BLOCK_N=128`	`@triton.autotune` with GPU-specific configs	2-3x
Alignment	No hints	`tl.assume(stride % 8 == 0)`, `tl.assume(ptr % 16 == 0)`	1.5-2x
Full-tile masks	Masks on every load/store	Remove masks, let TMA handle bounds	1.2-1.5x
K-loop offsets	Recalculate full offset each iter	`a_ptrs += BLOCK_K` or TMA offset increment	1.1-1.2x
Memory layout	5D reshape for split dims	Transpose + contiguous first/second half	50-150%
constexpr params	Dynamic dimension params	Mark `bs`, `hd`, `n_h` as `tl.constexpr`	10-20%
Unnecessary clones	`q.clone()` before in-place op	Transpose → contiguous (natural copy)	10-20%
Row stride pattern	Per-element stride calculation	Row stride with `ptr + pid * row_stride`	10-30%
broadcast_to + tl.dot	`W.broadcast_to((BS,M,K))` then `tl.dot(W, X)`	1-batch-per-block, load W as 2D `(M,K)`, use `tl.dot(W, X)`	10-50× (FFT case study)
extract_slice chains	Chain of `extract_slice` + `reshape` (24+ calls)	Direct offset computation, load into final shape	2-5×
Full details: translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION)

Category	Document	Content
Strategy	optimization-strategy.md	Ordered hub: advanced-patterns + optimizing-reference; §4 Gemma FMHA mandatory checklist
Workflows	translations/workflow.md	Standard c2t conversion (phases + checklist)
	translations/file-structure.md	Where to place Triton files when converting from cuTile
	translations/advanced-patterns.md	Dual layout flags (transpose), autotune + `META` grid, MLA-style two kernels
API	api-mapping.md	cuTile → Triton mapping
	optimizing-reference.md	GEMM/BMM/attention optimizations (EVEN_K, transpose, grid, autotune, epilogue subtile)
Testing & errors	references/debugging.md	Triton runtime errors (cudaErrorIllegalAddress, pointer type, stride overflow)

Example	Directory	Complexity
Vector Add	examples/01_vector_add/	Basic
Softmax	examples/02_softmax/	Intermediate
LayerNorm	examples/03_layernorm/	Intermediate
MatMul	examples/04_matmul/	Advanced
Attention	examples/05_attention/	Advanced

MANDATORY COMPLETION GATES:
 [ ] 1. CORRECTNESS: pytest passes with 0 failures
     Command: python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
     Gate: "X passed, 0 failed"

 [ ] 2. TMA OPTIMIZATION: All 2D+ tile loads use tl.make_tensor_descriptor
     Verify: grep -n "tl.load.*mask" triton_file.py | wc -l  # Should be 0 for 2D+ ops
     Skip = 5-20x performance regression

 [ ] 3. PERFORMANCE TEST: Triton within 20% of cuTile baseline
     Command: python -m pytest {test_path} -k "test_perf" --print-record -v
     OR: Run benchmark script: cd tests/benchmark && python bench_{op}.py
     Gate: Triton TFLOPS >= 0.8 * CuTile TFLOPS

 [ ] 4. PERFORMANCE COMPARISON RECORDED:
     Document results:
     | Config | Triton (TFLOPS) | CuTile (TFLOPS) | Ratio |
     |--------|-----------------|-----------------|-------|
     | [fill] | [fill]          | [fill]          | [fill]|

CONVERSION COMPLETE: All 4 gates passed? → YES / NO

cuTile → Triton Conversion

Instructions

cuTile → Triton Conversion

Instructions

Workflow Selection

Pre-flight Analysis (Run BEFORE converting)

Conversion Checklist

Gotchas (Most Common Translation Errors) {#gotchas-most-common-translation-errors}

Performance Gotchas (10-50x Regression Risk) {#performance-gotchas-10-50x-regression-risk}

Optimization strategy (hub)

Reference Documents {#reference-documents}

Worked Examples

⚠️ MANDATORY COMPLETION CHECKLIST (DO NOT SKIP)

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2