Use when writing, modifying, porting, or optimizing CuTe DSL GPU kernels in Python; reading CuTe DSL API reference material; integrating a CuTe DSL kernel into a project; or rewriting an existing CUDA or C++ operator into CuTe DSL while preserving correctness and performance expectations.
Use the bundled CuTe DSL API snapshots in this skill and the workspace CUTLASS checkout to design, implement, debug, and integrate CuTe DSL GPU kernels in a way that is reusable across projects, including cache-dit.
Use this skill when you need to:
Do not use this skill for:
cutlass-cpp-kernelcuda-cpp-kerneloperator-migrationRead the relevant API reference files before writing kernel code.
Do not guess CuTe DSL APIs or architecture helpers from memory when the bundled docs or workspace CUTLASS examples can answer the question precisely.
Use Copilot-friendly sibling-file references for bundled docs in this skill, for example:
cute.mdcute_runtime.mdutils.mdcute_nvgpu_tcgen05.mdpipeline.mdUse workspace-relative paths for CUTLASS sources, for example:
vipshop/cutlass/python/CuTeDSL/vipshop/cutlass/examples/python/CuTeDSL/vipshop/cutlass/python/pycute/vipshop/cutlass/include/cute/vipshop/cutlass/media/docs/pythonDSL/Do not use agent-specific skill paths or placeholder-driven argument text in the final skill content.
Core API references:
cute.md — core CuTe DSL types and tensor or layout operationscute_runtime.md — runtime helpers and data interoputils.md — helper utilities and hardware infoArchitecture-specific references:
cute_nvgpu.md — architecture API indexcute_nvgpu_warp.md — warp-level APIs for SM80 to SM89cute_nvgpu_warpgroup.md — warpgroup APIs for SM90cute_nvgpu_tcgen05.md — tcgen05 and SM100+ APIscute_nvgpu_cpasync.md — async-copy APIscute_arch.md — low-level architecture primitivesutils_sm90.md and utils_sm100.md — architecture helpersPipeline and overview:
pipeline.mdintro.mdAdditional workflow and concept references from the workspace CUTLASS docs:
vipshop/cutlass/media/docs/pythonDSL/overview.rst — high-level positioning of CUTLASS DSLs and how CuTe DSL relates to CUTLASS C++vipshop/cutlass/media/docs/pythonDSL/quick_start.rst — environment, install, and setup assumptionsvipshop/cutlass/media/docs/pythonDSL/functionality.rst — supported dtypes, architectures, and current feature scopevipshop/cutlass/media/docs/pythonDSL/limitations.rst — current CuTe DSL limitations and unsupported casesvipshop/cutlass/media/docs/pythonDSL/faqs.rst — common issues and expected behaviorvipshop/cutlass/media/docs/pythonDSL/cute_dsl.rst — CuTe DSL workflow overviewvipshop/cutlass/media/docs/pythonDSL/cute_dsl_api.rst — API documentation entrypointvipshop/cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst — DSL programming model and mental modelvipshop/cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_control_flow.rst — control-flow semantics and restrictionsvipshop/cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_dynamic_layout.rst — static vs dynamic layout handlingvipshop/cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_jit_arg_generation.rst — JIT argument typing and signature generationvipshop/cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_jit_caching.rst — JIT cache behaviorvipshop/cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_jit_compilation_options.rst — compilation flags and debugging optionsvipshop/cutlass/media/docs/pythonDSL/cute_dsl_general/framework_integration.rst — framework interop patternsvipshop/cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_ahead_of_time_compilation.rst — AOT compilation and export flowvipshop/cutlass/media/docs/pythonDSL/cute_dsl_general/debugging.rst — debugging workflow and generated-artifact inspectionvipshop/cutlass/media/docs/pythonDSL/cute_dsl_general/autotuning_gemm.rst — autotuning guidance for GEMM kernelsThese workspace docs are especially valuable when the bundled API snapshots are too terse for workflow, compilation, debugging, or integration questions.
CUDA architecture and profiling references bundled in this skill:
sm89-optimization-guide.mdsm90-optimization-guide.mdsm100-optimization-guide.mdsm103-optimization-guide.mdsm120-optimization-guide.mdtroubleshooting.mdUse these files when interpreting nsys and ncu results for generated CuTe DSL kernels on different GPU families.
Use the workspace CUTLASS checkout for source examples and implementation patterns.
Key locations:
vipshop/cutlass/python/CuTeDSL/ — CuTe DSL implementation sourcesvipshop/cutlass/examples/python/CuTeDSL/ — CuTe DSL examples by architecture and topicvipshop/cutlass/python/pycute/ — pycute helpers and layout utilitiesvipshop/cutlass/include/cute/ — CuTe C++ headers for semantic groundingUse the shell path /workspace/dev/vipshop/cutlass only when you need a literal command path.
CuTe DSL kernels often need architecture-aware profiling because the generated kernel structure can look similar while the best bottleneck diagnosis differs by GPU generation.
Use the bundled optimization guides as follows:
sm89 and sm120, prioritize memory throughput, L2 hit rate, occupancy, and fusion opportunity; these targets do not have TMA, TMEM, or cluster features.sm90, inspect whether TMA-style overlap, warpgroup execution, and shared-memory staging are actually visible in the timeline and counters.sm100 and sm103, inspect whether tcgen05 or WGMMA, TMEM, TMA v2, and cluster-capable execution are being used effectively.Recommended profiling order:
smXX-optimization-guide.md file.nsys to identify launch gaps, missing overlap, copy or compute imbalance, and end-to-end bottlenecks.ncu to inspect occupancy, memory throughput, L2 hit rate, register pressure, shared-memory pressure, tensor core utilization, and stall reasons.Before writing code, answer these questions:
Then work in this order:
vipshop/cutlass/media/docs/pythonDSL/ when the question is about control flow, JIT behavior, debugging, AOT, integration, or limitations.vipshop/cutlass/examples/python/CuTeDSL/.When tuning the generated kernel, treat the bundled smXX-optimization-guide.md files as first-line references for interpreting profiling output rather than relying only on generic CUDA advice.
Keep integration guidance generic unless the target repository requires a specific loader or manifest format.
For cache-dit or other repositories:
operator-migrationWhen rewriting an existing operator into CuTe DSL:
Use cutlass-cpp-kernel alongside this skill when you need C++ CUTLASS or CuTe source study to understand the original design.
cp.async, pipeline stages, or other asynchronous movement, treat synchronization as a primary suspect. When only specific shapes or pipeline configurations produce bad outputs, first inspect barrier placement, shared-stage reuse, and predicate coverage on partial-tile loads or stores.Every operator or kernel task completed under this skill must include validation.
Minimum requirements:
Additional requirement for rewrites or migrations:
When you finish a task using this skill, report: