Shared kernel design workflow across all supported languages and DSLs. Provides language selection table, naming conventions, versioning rules, KernelPlan structure, composition patterns, clone workflow, implementation workflow, devlog template, and designer output contract. Use when: (1) choosing which language-specific kernel design skill to load, (2) the intended implementation language is not fixed yet, (3) you need naming or versioning guidance before selecting a DSL, (4) you are implementing any kernel regardless of DSL, (5) you are updating docs that refer to kernel design skills.
This skill contains everything that is common across all supported DSLs. Once the implementation language is known, also load the matching language-specific design skill for DSL-specific runtime patterns, pitfalls, and API guidance.
| Language key | Python package path | Design skill | API reference skill | Use when |
|---|---|---|---|---|
cutile-dsl | cutile | /design-cutile-dsl-kernel | /cutile-dsl-ref | Block-level control, tiling, CTA remapping, compiler hints are sufficient |
cute-dsl | cute_python | /design-cute-dsl-kernel | /cute-dsl-ref | Explicit thread/warp scheduling, TMA pipelines, shared memory control needed |
cute-dsl).cute_python).src/mla_var3/kernel/<lang_pkg>/mla/<design>/....python -m mla_var3.kernel <kernel> [<version>] is the preferred user-facing entry point.We use a nested structure for kernel packages inside the kernel sub-package:
kernel.cutile, kernel.cute_python).kernel.cutile.mla).kernel.cutile.mla.flash_mla).kernel.cutile.mla.flash_mla.flash_mla_v2). The first version has no suffix and is called the "base version".flash_mla_v2.py) must contain the KernelPlan subclass.kernel.<lang_pkg>.mla.<design>.<design>[_v<N>].<design>[_v<N>].py
<design>/<design>/ (no suffix, aliased as v0)<design>/<design>_vN/# Full path
python -m mla_var3.kernel.<lang_pkg>.mla.<design> [<version>] [args]
# Shortcut (discovers across all languages)
python -m mla_var3.kernel <design> [<version>] [args]
# Examples
python -m mla_var3.kernel.cutile.mla.mla_var6_plus v4 --b=32 --s=16 --t=4096
python -m mla_var3.kernel mla_var6_plus v4 --b=32 --s=16 --t=4096
source .venv/bin/activate
python ./scripts/clone-kernel.py <kernel_full_name> <new_suffix>
@ct.kernel, @cute.kernel, @cute.jit)KernelPlan subclass namesTiling subclass namessource .venv/bin/activate
python -m mla_var3.kernel.<lang_pkg>.mla.<design> <version> --prof_type=disabled --check
Every kernel version must implement a KernelPlan subclass. The plan() method returns a DSL-specific runtime wrapper (see the language-specific skill for the concrete type).
@dataclass
class MyKernel(KernelPlan):
b: int = 64; s: int = 1; t: int = 4096 # problem dimensions
tiling: MyTiling = field(default_factory=MyTiling)
def prepare_inputs(self, device) -> tuple:
# Allocate and return input tensors
def reference_fn(self, *inputs) -> tuple:
# Reference implementation for --check
def _autotune_configs(self) -> list[MyTiling]:
# Candidate tiling configs for autotuner search
def _algorithmic_flops_bytes(self, tiling) -> tuple[int, int]:
# Analytical (FLOPs, bytes) for roofline
def plan(self, *inputs) -> BenchmarkFn:
# Build executable runtime object (DSL-specific)
def plan_empty(self, peak_tflops, peak_gbps) -> BenchmarkFn:
# Roofline-only prediction (no real tensors)
@dataclass
class MyTiling(Tiling):
# DSL-specific fields — see the language-specific skill for examples
def validate(self, pd: "MyKernel") -> bool:
# Return True if this tiling is valid for the given problem dimensions
...
def plan(self, *inputs) -> KernelPipeline:
stage1 = stage1_plan.plan(...)
stage2 = stage2_plan.plan(...)
return KernelPipeline(_name="my_pipeline", stages=[stage1, stage2])
def plan(self, *inputs) -> KernelPipeline:
a = plan_a.plan(...)
b = plan_b.plan(...)
concurrent = ConcurrentKernels(
_name="overlap_group", concurrent_kernels=[a, b],
validate_joint_tiling_fn=validate_fn,
)
combine = combine_plan.plan(...)
return KernelPipeline(_name="pipeline", stages=[concurrent, combine])
python ./scripts/clone-kernel.pydocs/knowledge/ for implementation patterns--prof_type=disabled --checkdocs/knowledge/optimizations/ or docs/knowledge/anti-patterns/.docs/knowledge/languages/<language>/....Add to docs/kernels/<kernel>.md under ## Development log:
### V<N>: [Brief Description]
**Location**: `src/mla_var3/kernel/<lang_pkg>/mla/<kernel>/<kernel>_v<N>/`
**What changed**:
- [Bullet list of changes]
**High-level description of main code changes**:
- [Description of optimizations and how they relate to profiling insights]
Performance metrics, bottleneck analysis, issues, and insights are filled by the profiler agent after profiling.
Return results to the orchestrator in this format:
## New Version: [kernel] [version]
### Changes Applied
1. [change + rationale]
### Files
- Created: [paths]
- Modified: [paths]
### Correctness: [PASS/FAIL]
### Devlog Entry Written: [path]
docs/knowledge/optimizations/docs/knowledge/anti-patterns/docs/kernels/<kernel>.mdPyTorch深度学习模式与最佳实践,用于构建稳健、高效且可复现的训练流程、模型架构和数据加载。