Converts cuTile GPU kernels (@ct.kernel) to Triton (@triton.jit). Handles standard in-repo conversion, debugging (cudaErrorIllegalAddress, shape mismatch, numerical mismatch), and mapping cuTile idioms (ct.load/ct.store, ct.Constant, ct.launch) to Triton equivalents. Covers dual-kernel layout flags (e.g. transpose=True/False + autotune grid via META) per translations/advanced-patterns.md. Use when converting, porting, or translating cuTile kernels to Triton, or debugging existing Triton translations.
Convert @ct.kernel kernels to @triton.jit. API mapping: references/api-mapping.md (cuTile → Triton).
In this skill’s Markdown, Triton launch syntax kernel[grid](…) uses Unicode brackets so link checkers do not parse [grid](…) as a hyperlink; use normal ASCII brackets in real Triton code.
Follow the phase-gated workflow in translations/workflow.md. Every conversion should go through analyze → convert → validate → test → benchmark, with explicit gates before moving on. Use the documents in Workflow Selection when the task matches a special case (errors, layout flags, perf).
— If the op is (e.g. Gemma ), read converting the inner loop, then apply . For other GEMM/BMM/attention-adjacent kernels, still skim of that file after TMA is done.
gemma_attentionSelect path — Existing TileGym op: standard mode in translations/workflow.md. If the cuTile source uses transpose / transpose_v, dual layouts, or MLA-style paths, read translations/advanced-patterns.md before writing Triton (two kernels + META grid, not one kernel + tl.trans).
Pre-flight — Run the Pre-flight Analysis grep commands on the cuTile source. Count @ct.kernel definitions; note TMA-relevant ct.load/ct.store, ct.launch, Constant, and layout flags.
Read mapping — Keep references/api-mapping.md open for cuTile → Triton API pairs. For runtime failures (illegal address, dtype, strides), use references/debugging.md.
Convert — Copy the Conversion Checklist into a todo list and execute in order. Structure and file placement: translations/file-structure.md. Mandatory: any 2D+ block-shaped tile load/store uses tl.make_tensor_descriptor (TMA), not raw tl.load(ptr+offs, mask=…) for full tiles—skipping this is the most common source of large regressions. Host side: Triton bracket launch <code>kernel[grid](args)</code> with tuple or lambda META: (…) for autotune; no ct.launch.
Validate — Syntax-check the new Triton module; run the relevant TileGym pytest targets for the op: pytest tests/ops/test_<op>.py -k "triton" -vs. Fix failures before benchmarking.
Benchmark — Compare Triton vs cuTile on perf tests. If Triton is clearly slower, follow PERFORMANCE ANALYSIS (Phase c2t-5) in translations/workflow.md and references/optimizing-reference.md for GEMM/BMM/attention; use references/optimization-strategy.md as the ordered checklist. If you see 10–50× slowdowns, read CRITICAL PERFORMANCE PATTERNS in that same workflow file first.
Execution rules (MUST):
cudaErrorIllegalAddress, shape mismatch, numerical mismatch) → references/debugging.mdtranspose, autotune + META grid, Array.slice, ct.gather().item()) → translations/advanced-patterns.md (MLA-style two kernels, avoid 3–15× regression on transpose=False).loop_unroll_factor, occupancy autotuning, TMEM-friendly block sizes, slab allocator, dual-path kernel designtranspose=True only, collapse on transpose=False (or opposite) → translations/advanced-patterns.md — §1 Dual layout flag; two @triton.jit kernels + grid = lambda META: (... META["BLOCK_H"] ...)# Count kernels (only main kernel gets @triton.jit, helpers stay plain def)
grep "@ct\.kernel" source.py | wc -l
# Check for patterns needing special handling
grep "ct\.transpose\|ct\.permute" source.py # → use tl.trans/tl.permute
grep "ct\.astype" source.py # → use .to(dtype)
grep "ct\.load\|ct\.store" source.py # → TMA for 2D+ (tl.make_tensor_descriptor), NOT raw tl.load(ptr+offs)
grep "ct\.launch" source.py # → bracket launch: kernel then [grid] then (args)
grep "ct\.Constant\|ct\.ConstInt" source.py # → tl.constexpr
grep "ct\.cdiv" source.py # → triton.cdiv (host) or Python (a+b-1)//b
grep "ct\.bid\|ct\.num_blocks" source.py # → tl.program_id/tl.num_programs
grep "1 << .*\.bit_length" source.py # → triton.next_power_of_2 if needed
grep "transpose\|transpose_v" source.py # → if hit, read translations/advanced-patterns.md (dual kernels + META grid)
Copy this checklist and track progress:
Conversion Progress:
[ ] Step 0 (attention / Gemma FMHA / GQA / soft cap / sliding window): Read [references/optimization-strategy.md](./references/optimization-strategy.md) and apply §4 checklist before inner-loop Triton
[ ] Step 1: Pre-flight — run grep commands above, note special patterns and 2D+ loads (→ TMA)
[ ] Step 2: Analyze source cuTile kernel (identify patterns, shapes, dtypes)
[ ] Step 3: Create Triton file with correct structure (see translations/file-structure.md)
[ ] Step 4: Convert kernel signature (tensor args → pointer args, Constant → constexpr)
[ ] Step 4b: TMA (MANDATORY for 2D+ loads) — use tl.make_tensor_descriptor for every 2D+ tile load/store; do NOT ship raw tl.load(ptr+offs,mask) for block-shaped access (see workflow.md § TMA OPTIMIZATION)
[ ] Step 5: Convert kernel body (apply gotchas table below + API mapping)
[ ] Step 6: Convert host wrapper (grid tuple/lambda, bracket-style launch: kernel, grid, then arguments; no ct.launch); call triton.set_allocator(alloc_fn) if using TMA
[ ] Step 7: Validate — run pytest or syntax check on Triton file
[ ] Step 8: Test — run pytest, verify X passed 0 failed
[ ] Step 9: If test fails → fix → re-validate → re-test (loop until green)
[ ] Step 10: Benchmark — run perf test, compare vs cuTile (see workflow.md § PERFORMANCE ANALYSIS)
[ ] Step 10b: If GEMM/BMM/attention and Triton >20% slower → walk [references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3 then [references/optimizing-reference.md](./references/optimizing-reference.md) (EVEN_K, transpose, grid, autotune, epilogue subtile), then re-benchmark
[ ] Step 10c: If op has `transpose` / layout flag → read [translations/advanced-patterns.md](./translations/advanced-patterns.md); verify **separate kernels** per layout (not transpose-kernel + `tl.trans`); **autotuned** launches use `lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — no fixed `BLOCK_H`/`BLOCK_N` through `apply()` unless autotune is disabled
Post-conversion Verification (TMA is mandatory for 2D+ loads):
[ ] TMA: All 2D+ tile loads use tl.make_tensor_descriptor(...).load([...]); no raw ptr+mask for block-shaped 2D+ access (else 5x-20x regression)
[ ] Grid uses tuple or lambda (not 3-tuple required like cuTile)
[ ] Triton autotune added if cuTile op used kernel_configs/autotune (see workflow § PERFORMANCE ANALYSIS)
[ ] Host grid uses triton.cdiv where appropriate (not (a+b-1)//b only)
[ ] Pointer/offset indexing: Triton uses element offsets (ptr + offs), not block index in tl.load (or use TMA descriptor)
[ ] ct.astype(x, dtype) → x.to(dtype) in Triton
[ ] ct.mma(a, b, acc=acc) → tl.dot(a, b, acc) (no keyword in Triton)
[ ] Optional/None args: Triton allows None in kernel args if desired (cuTile required dummy+flag)
[ ] Masking applied when BLOCK_SIZE > actual dimension (same as cuTile); with TMA, masks can often be removed for full tiles
[ ] Reduction divisor uses actual_size, NOT BLOCK_SIZE
[ ] fp32/tf32: Triton defaults allow_tf32=True; match cuTile behavior if you had explicit tf32 cast
[ ] If any 2D+ load uses raw ptr+mask (exception only): document WHY TMA was not used
[ ] tl.assume() alignment hints added for strides and pointers
| Pattern | cuTile | Triton | Common Mistake |
|---|---|---|---|
| mma accumulator | ct.mma(a, b, acc=acc) | tl.dot(a, b, acc) | Using keyword acc= in Triton (positional only) |
| mma float32→tf32 | Explicit ct.astype(..., ct.tfloat32) guard before ct.mma | tl.dot(a, b, allow_tf32=True) (default) | Over-specifying; Triton auto-casts by default |
| Type cast | ct.astype(x, dtype) | x.to(dtype) | Using ct.astype in Triton |
| Grid | (n, 1, 1) tuple, ct.launch(stream, grid, kernel, args) | lambda meta: (n,) or tuple, bracket launch | Using ct.launch or 3-tuple in Triton |
| Host cdiv | (a + b - 1) // b (Python) | triton.cdiv(a, b) | Forgetting triton.cdiv in host |
| 2D+ tile load | ct.load(arr, index=(i,j), shape=(BM,BK)) (cuTile uses TMA) | tl.make_tensor_descriptor(...).load([...]) | Using raw tl.load(ptr+offs, mask=m) → 5-20x regression; always use TMA for 2D+ block loads |
| Index type | Block index in ct.load/ct.store | Element offset (ptr + offs) or TMA descriptor | Using block index as tl.load offset |
| arange | ct.arange(N, dtype=ct.int32) | tl.arange(0, N) | Triton has start param (0, N) |
| None args | Dummy tensor + flag | Allowed in kernel | Carrying over dummy+flag when not needed |
| String const | ct.Constant[int] only (no str) | tl.constexpr (any type) | Keeping int enum; Triton can use str constexpr if needed |
| Shape args | Static/constexpr in ct.full/ct.zeros | Dynamic shapes OK in Triton | Over-constraining shapes |
| Launch | ct.launch(stream, grid, kernel, args) | bracket launch (grid then args) | Leaving ct.launch in Triton host |
| Branch vars | Pre-define before if | Can define in branch | Over-defining before branch in Triton |
| Pointer table type | Typed tensor descriptor (auto) | tl.load(ptrs+idx).to(tl.pointer_type(DTYPE)) where DTYPE: tl.constexpr | Hardcoding tl.float16 → cudaErrorIllegalAddress for bfloat16/float32 inputs |
| Stride dtype | cuTile uses tensor shape (auto) | Pass strides as torch.int64, not int32 | int32 overflows → illegal address for large matrices (M×K > 2^31) |
| dtype map coverage | cuTile typed tensors (auto) | _DTYPE_MAP must cover all dtypes (incl. float8); use hasattr guards | Missing entry → ValueError: Unsupported dtype before kernel launch |
| tl.math.erf dtype | cuTile erf handles all dtypes | tl.math.erf only accepts fp32/fp64 | ValueError: Expected dtype ['fp32', 'fp64'] but got fp16 — do NOT replace with tanh approximation (mathematically wrong); let Triton auto-promote or cast input |
| tl.exp with fp16 | cuTile exp handles all dtypes | Cast to fp32 before tl.exp for precision: tl.exp(x.to(tl.float32)) | Precision loss or NaN with fp16 inputs in exp/log/sqrt |
| Math func approx | N/A | Never substitute tl.math.erf with tanh-based approximation | Using GELU tanh formula (0.044715*x³) as erf approximation is mathematically incorrect — they are different functions |
Layout flag (transpose) | cuTile may use one path per layout | Need two Triton kernels when math differs (e.g. MLA: qk [H,N] vs [N,H], different V TMA) | Reusing transpose-only logic for transpose=False + fixed blocks → 3–15× on that mode; see advanced-patterns.md |
| Batched matmul | ct.matmul(W, X) broadcasts implicitly at tile level | tl.dot(W, X) only supports 2D operands | Using broadcast_to + tl.dot → 10-50× slower, no tensor cores (see FFT anti-pattern below) |
| Batch-per-block | cuTile processes 1 batch per block naturally | Triton temptation: process BS batches per block | Creates BS× register pressure, breaks tensor core compatibility |
⚠️ These cause CATASTROPHIC slowdowns. Check BEFORE benchmarking.
| Pattern | SLOW (Regression) | FAST (Optimized) | Impact |
|---|---|---|---|
| Memory access (2D+ tiles) | Raw ptr + masks: tl.load(ptr+offs, mask=m) for block-shaped 2D+ loads | TMA: tl.make_tensor_descriptor(...).load([off]) | 5-20x (500%-2000%) — most common cause of conversion regression; use TMA for every 2D+ tile load |
| Group iteration | Linear search all groups per tile | While-loop with last_problem_end tracking | 2-5x |
| Tile sizes | Fixed BLOCK_M=128, BLOCK_N=128 | @triton.autotune with GPU-specific configs | 2-3x |
| Alignment | No hints | tl.assume(stride % 8 == 0), tl.assume(ptr % 16 == 0) | 1.5-2x |
| Full-tile masks | Masks on every load/store | Remove masks, let TMA handle bounds | 1.2-1.5x |
| K-loop offsets | Recalculate full offset each iter | a_ptrs += BLOCK_K or TMA offset increment | 1.1-1.2x |
| Memory layout | 5D reshape for split dims | Transpose + contiguous first/second half | 50-150% |
| constexpr params | Dynamic dimension params | Mark bs, hd, n_h as tl.constexpr | 10-20% |
| Unnecessary clones | q.clone() before in-place op | Transpose → contiguous (natural copy) | 10-20% |
| Row stride pattern | Per-element stride calculation | Row stride with ptr + pid * row_stride | 10-30% |
| broadcast_to + tl.dot | W.broadcast_to((BS,M,K)) then tl.dot(W, X) | 1-batch-per-block, load W as 2D (M,K), use tl.dot(W, X) | 10-50× (FFT case study) |
| extract_slice chains | Chain of extract_slice + reshape (24+ calls) | Direct offset computation, load into final shape | 2-5× |
| Full details: translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION) |
Full API mapping: references/api-mapping.md.
Triton math dtype (erf/erfc/exp/log/sqrt) and the "don't substitute erf with tanh" pattern: references/debugging.md — section Triton Math Function Dtype Requirements (CRITICAL).
Summarizes translations/advanced-patterns.md (layout flags, dual kernels, autotune+META, batched launch, Blackwell pointers) and references/optimizing-reference.md (post-TMA micro-opts, §9) into §1–§3 plus a mandatory §4 Gemma FMHA checklist.
Rule: For attention / FMHA / Gemma-style conversions, open optimization-strategy in the same session as workflow — do not rely on TMA alone for perf sign-off.
Read from cuTile → Triton perspective. Core files live in this skill under ``.
| Category | Document | Content |
|---|---|---|
| Strategy | optimization-strategy.md | Ordered hub: advanced-patterns + optimizing-reference; §4 Gemma FMHA mandatory checklist |
| Workflows | translations/workflow.md | Standard c2t conversion (phases + checklist) |
| translations/file-structure.md | Where to place Triton files when converting from cuTile | |
| translations/advanced-patterns.md | Dual layout flags (transpose), autotune + META grid, MLA-style two kernels | |
| API | api-mapping.md | cuTile → Triton mapping |
| optimizing-reference.md | GEMM/BMM/attention optimizations (EVEN_K, transpose, grid, autotune, epilogue subtile) | |
| Testing & errors | references/debugging.md | Triton runtime errors (cudaErrorIllegalAddress, pointer type, stride overflow) |
Use cutile_kernel.py as source and triton_kernel.py as target:
| Example | Directory | Complexity |
|---|---|---|
| Vector Add | examples/01_vector_add/ | Basic |
| Softmax | examples/02_softmax/ | Intermediate |
| LayerNorm | examples/03_layernorm/ | Intermediate |
| MatMul | examples/04_matmul/ | Advanced |
| Attention | examples/05_attention/ | Advanced |
Read cutile_kernel.py first, then triton_kernel.py, to see the inverse mapping.
A conversion is NOT COMPLETE until ALL items are checked. Copy and complete:
MANDATORY COMPLETION GATES:
[ ] 1. CORRECTNESS: pytest passes with 0 failures
Command: python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
Gate: "X passed, 0 failed"
[ ] 2. TMA OPTIMIZATION: All 2D+ tile loads use tl.make_tensor_descriptor
Verify: grep -n "tl.load.*mask" triton_file.py | wc -l # Should be 0 for 2D+ ops
Skip = 5-20x performance regression
[ ] 3. PERFORMANCE TEST: Triton within 20% of cuTile baseline
Command: python -m pytest {test_path} -k "test_perf" --print-record -v
OR: Run benchmark script: cd tests/benchmark && python bench_{op}.py
Gate: Triton TFLOPS >= 0.8 * CuTile TFLOPS
[ ] 4. PERFORMANCE COMPARISON RECORDED:
Document results:
| Config | Triton (TFLOPS) | CuTile (TFLOPS) | Ratio |
|--------|-----------------|-----------------|-------|
| [fill] | [fill] | [fill] | [fill]|
CONVERSION COMPLETE: All 4 gates passed? → YES / NO
Why this matters:
If any gate fails: Fix and re-verify before declaring complete.