Name: CuTe DSL
Author: NVIDIA

CuTe DSL is a Python-based domain-specific language for GPU kernel development, part of CUTLASS 4.x. It provides Python abstractions over CUTLASS C++ templates with JIT compilation to optimized CUDA kernels via MLIR and ptxas.

When to Use

Triggers:

Writing CUDA kernels in Python (element-wise, GEMM, custom ops)
Optimizing GPU memory access patterns (vectorized loads, TMA, shared memory)
Building tensor core (MMA) kernels for Ampere/Hopper/Blackwell
Integrating custom GPU kernels with PyTorch or JAX
Prototyping high-performance kernels without C++ metaprogramming

Symptoms (wrong tool otherwise):

Need shared memory coordination or tensor core MMA → use CuTe DSL (not Triton for complex patterns)
Need simple element-wise ops with no shared memory → CuTe DSL or Triton both work
Need to call existing CUTLASS C++ kernels → use CUTLASS C++ APIs instead
Need reductions, scans, or non-GEMM collective ops → consider CUB/Thrust

Keywords: cute, cutlass, cute.jit, cute.kernel, from_dlpack, zipped_divide, TiledMMA, TiledCopy, TMA, WGMMA, tcgen05, pipeline, mbarrier

Requirement	Detail
Platform	Linux x86_64 only
Python	3.10–3.13
GPU	NVIDIA Ampere+ (SM80, SM90, SM100)
CUDA Driver	≥ 575.51.03 (Toolkit 12.9 compat)
Install	`pip install nvidia-cutlass-dsl`
Optional	`apache-tvm-ffi`, `torch-c-dlpack-ext`

Operation	Arch	Example path (append to base URL)
Element-wise add	SM80	`ampere/elementwise_add.py`
Element-wise + autotune	SM80	`ampere/elementwise_add_autotune.py`
Element-wise apply	SM80	`ampere/elementwise_apply.py`
SGEMM (scalar)	SM80	`ampere/sgemm.py`
Tensor-core GEMM	SM80	`ampere/tensorop_gemm.py`
Flash Attention v2	SM80	`ampere/flash_attention_v2.py`
HSTU Attention	SM80	`ampere/hstu_attention.py`
Shared memory allocator	SM80	`ampere/smem_allocator.py`
CTA norm (LayerNorm)	SM90	`hopper/cta_norm.py`
Dense GEMM	SM90	`hopper/dense_gemm.py`
Dense GEMM persistent	SM90	`hopper/dense_gemm_persistent.py`
Flash MHA	SM90	`hopper/fmha.py`
Dense GEMM	SM100	`blackwell/dense_gemm.py`
Dense GEMM persistent	SM100	`blackwell/dense_gemm_persistent.py`
Dense GEMM + alpha/beta	SM100	`blackwell/dense_gemm_alpha_beta_persistent.py`
RMSNorm	SM100	`blackwell/rmsnorm.py`
Reduce	SM100	`blackwell/reduce.py`
Flash MHA	SM100

Error	Cause	Fix
`MLIR function requires a Context`	Called @kernel from Python	Launch via @cute.jit host function
`DSLAstPreprocessorError` on return	Early return in @kernel	Use `if cutlass.dynamic_expr(cond):`
Type mismatch on store	`a * 2` promotes FP16→FP32	Use `a + a` or `.to(cutlass.Float16)`
`could not get source code`	Kernel in `exec()` context	Write to file and import
Scalar loads in Nsight	Missing alignment hint	Add `assumed_align=16` to `from_dlpack`
`Missing required argument`	Not all @jit params passed	Pass ALL declared parameters
`AttributeError: sigmoid`	No `cute.math.sigmoid`	Use `1.0/(1.0+cute.math.exp(-x))`

File	Content
`concepts-architecture.md`	Core abstractions, terminology, compilation pipeline
`concepts-layouts.md`	Layout algebra: composition, complement, divide, swizzle
`concepts-tensors.md`	Tensor types, partitioning, tiling, predication
`concepts-mma.md`	MMA atoms, TiledMMA, per-architecture tensor core ops
`patterns-getting-started.md`	Installation, decorators, first kernel walkthrough
`patterns-elementwise.md`	Invariant principles, pattern variations, reference impl
`patterns-gemm.md`	3-level tiling, shared memory, pipelining, autotuning
`patterns-memory.md`	from_dlpack, TMA, cp.async, TMEM, copy atoms
`patterns-compilation.md`	Control flow, JIT caching, TVM FFI, AOT compilation
`patterns-pipeline.md`	Producer-consumer, pipeline classes, barriers, warp specialization
`api-core.md`	cute module: layouts, tensors, math, copy, gemm, printing
`api-arch.md`	cute.arch: thread indexing, sync, atomics, memory ops
`api-nvgpu.md`	cute.nvgpu: warp/warpgroup/cpasync/tcgen05 MMA and copy
`api-runtime-utils.md`	Runtime: from_dlpack, fake tensors, utils, schedulers
`troubleshooting.md`	Debugging, env vars, common errors, limitations, FAQ

CuTe DSL

CuTe DSL

When to Use

Requirements

Workflows

Workflow 0: Starting from Examples (Recommended)

Workflow 1: Element-wise Kernel

Workflow 2: GEMM Kernel

Workflow 3: Framework Integration

Workflow 4: Debugging & Profiling

Output Formats

Companion Script Contract

Examples

Example: 2D Unary Element-wise (ReLU)

Error Handling

Finding More Information

Tier 1: This File (SKILL.md)

Tier 2: references/ Directory

Tier 3: Original Documentation

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2