Structured workflow for SIPU operator development in torch_sipu. Use when implementing, modifying, fixing, or refactoring operators, for infrastructure changes that affect operator behavior, or for addressing MR review feedback from colleagues.
You are working in the torch_sipu repository — a PyTorch device backend extension for SIPU hardware. When the user asks you to add, modify, or refactor an operator or its underlying infrastructure, you MUST follow the steps below in order. Do not skip steps.
Before doing anything, classify the user's request into one of two scenarios and state it explicitly:
User wants to implement, refactor, or modify an operator from scratch (or continue in-progress work).
Signals: "implement X", "refactor X", "add dtype support for X", "fix bug in X kernel", no MR mentioned.
Action:
User has already submitted an MR and received review comments from a colleague that need to be addressed.
Signals: "reviewer said", "MR rejected", "MR comment", "CR feedback", "review feedback", "colleague said", paste of review comments.
Action: Skip Step 0 entirely. Jump directly to Step 8. (Branch already exists.)
Prerequisite: You MUST already be on a feature branch (enforced by Entry Branch Guard above).
Before starting development, check if a JIRA ticket exists. If not, ask the user whether to create one.
Reference: Read
.claude/skills/operator-dev/refs/jira-mr-automation.md§1 for the JIRA ticket creation script.
| Change Type | Scope | CI Label |
|---|---|---|
| New Triton op | Python kernel + registration + tests | triton |
| New C++ op | .su/.cpp + YAML + tests | sikernel |
| New op (both backends) | Triton + C++ + tests | sikernel or triton |
Refactor .cpp → .su | Replace .cpp with .su, update YAML | sikernel |
| Bug fix (kernel) | Modify existing kernel + add regression test | sikernel or triton |
| Bug fix (registration) | Fix YAML or dispatcher registration | aten |
| Infrastructure change | Headers (.suh), utilities, build | sikernel |
Before writing any code:
CompositeImplicitAutograd — if it auto-decomposes, do NOT register it unless you have a performance reason for a fused kernel:
python -c "import torch; print(torch._C._dispatch_dump('aten::<op>'))"
REGISTER_PRIVATEUSE1_DISPATCH) — for TensorIterator-based opsTORCH_SIPU_IMPL_FUNC) — for complex ops needing custom shape setupfind torch_sipu/csrc/aten/native/sipu/ -name "*<Op>*"
find torch_sipu/backends/sipu_triton_kernels/ops/ -name "*<op>*"
grep -r "<op>" torch_sipu/csrc/aten/native/native_functions.yaml
grep -r "<op>" torch_sipu/backends/sipu_triton_kernels/__init__.py
Reference: Read
.claude/skills/operator-dev/refs/dispatch-guide.mdfor the complete dispatch mechanism decision tree, stub header reference, YAML entry patterns, and operator category quick reference.
Before making any edits, output a table of files to modify and files NOT modified. Rule: if a file is not in this table, do not touch it.
Before writing any code, use the hardware optimization decision engine to determine the implementation strategy.
Reference: Read
.claude/skills/operator-dev/refs/hardware-optimization-guide.mdfor the complete decision engine, hardware architecture summary, implementation pattern library, and similar-op reference matching.
| Category | Examples | Typical Path |
|---|---|---|
| E1: Unary element-wise | neg, sigmoid, silu, rsqrt | PATH-A: TensorIterator + Tile→RVV→Scalar |
| E2: Binary element-wise | add, mul, sub, div | PATH-A: with *_with_scalars* variants |
| C: Comparison | eq, ne, gt, ge | PATH-A: with CompareVec |
| R1: Simple reduction | sum, prod, any, all | PATH-A-REDUCE: Reduce.suh |
| R2: Compound reduction | softmax, layernorm, rmsnorm | PATH-B: parallel_for + VectorizedM1 |
| M: Matrix | mm, bmm, attention | PATH-C or Triton |
| S: Structural | cat, topk, sort | PATH-B or PATH-C |
| X: Custom SIPU | mm_t2t, flash_attention | PATH-C: sikernel library |
Output the selected strategy before proceeding:
Op category: E1 (unary element-wise)
Execution path: PATH-A — TensorIterator + Tile→RVV→Scalar cascade
Precision: Standard M1→M2 widening for bf16/fp16
Reference impl: silu in UnaryOpsKernel.su (most similar)
Vec library reference: When implementing the vectorized path, consult
.claude/skills/operator-dev/refs/vec-library-guide.mdfor the complete API reference.
Every new source file MUST start with the Apache v2.0 license header. Use the file's creation year and the correct comment syntax. Full template is in CLAUDE.md.
Reference: Read
.claude/skills/operator-dev/refs/triton-template.mdfor the complete Triton template, decorator stack, preprocessing configs, and registration guide.
Key: Create ops/<op>.py → export in ops/__init__.py → register in __init__.py with _sipu_lib_aten.impl().
Reference: Read
.claude/skills/operator-dev/refs/cpp-template.mdfor Option A (DispatchStub.su), Option B (Structured kernel.su), and Option C (host-only.cpp).
Key points:
REGISTER_PRIVATEUSE1_DISPATCHTORCH_SIPU_IMPL_FUNC + parallel_for + VectorizedM1.cppInference Backend: Do NOT implement backward. Register
AutogradPrivateUse1ONLY for metadata ops (to.dtype,type_as) or in-place ops conflicting with autograd (_index_put_impl_).
.cpp to .suReference: Read
.claude/skills/operator-dev/refs/cpp-to-su-migration.mdfor the migration workflow and SoftMax example.
Triton-only changes (Python files) do NOT require a rebuild.
conda activate pytorch
source setup_sipu_sdk_env.sh
make install-dev # Extension only
# OR
make install-all-dev # Full rebuild (if third_party/sikernel changed)
Common failures: CMAKE_SIPU_COMPILER not found → forgot SDK env. clang++ not found → forgot conda. Stale build → rm -rf build/.
Reference: Read
.claude/skills/operator-dev/refs/performance-guide.mdfor the performance checklist and anti-patterns. Also consult.claude/skills/operator-dev/refs/hardware-optimization-guide.md§5 for the deterministic validation checklist.
Key checks:
VectorizedM1 in hot loop with scalar tail?State the performance level after review.
Reference: Read
.claude/skills/operator-dev/refs/test-template.mdfor the complete test template and tolerance guidelines.
torch.testing.assert_close — never torch.equal for floating-point.torch.float32 and torch.bfloat16.CUDA_VISIBLE_DEVICES= PYTORCH_TESTING_DEVICE_ONLY_FOR=sipu pytest test/test_<op>.py -v
TRITON_KERNEL_VERIFY=1 python examples/run_<op>.py # Triton only
List every file changed with a one-line description.
Provide exact test commands (single file + full suite).
Call out: dtype gaps, shape limitations, tolerance concerns, regression risk.
Follow the dev-workflow skill (Steps 3–6) for lint, commit, squash, push, and MR creation. Key points:
make lint → lintrunner -a --all-files to auto-fix.<type>(<scope>): jira#S1SW-XXXX <description>. Ask user for Jira number if not provided./pr-review <commit_hash> to catch issues early..claude/skills/operator-dev/refs/jira-mr-automation.md §2.Reference: Read
.claude/skills/operator-dev/refs/mr-feedback-guide.mdfor the complete review feedback handling workflow (comment classification, fix ordering, commit format, reviewer reply, CI re-trigger).
Quick summary: Classify comments → fix in order (bugs → tests → performance → style → design) → fixup commit → reply to reviewer → push → comment test CI.
# Triton backend ops (one file per op)
torch_sipu/backends/sipu_triton_kernels/ops/*.py
torch_sipu/backends/sipu_triton_kernels/ops/__init__.py # exports
torch_sipu/backends/sipu_triton_kernels/__init__.py # dispatcher registration
# AI backend ops
torch_sipu/backends/AI/ops/*.py
torch_sipu/backends/AI/__init__.py # dispatcher registration
# C++ kernel ops — joint compilation (.su) and host-only (.cpp)
torch_sipu/csrc/aten/native/sipu/*.su # joint compilation kernels (modern)
torch_sipu/csrc/aten/native/sipu/*.cpp # host-only C++ kernels
torch_sipu/csrc/aten/native/native_functions.yaml # C++ dispatch registration
torch_sipu/csrc/aten/native/ext_native_functions.yaml # extension ops (custom ops not in ATen)
# C++ infrastructure headers (.suh) — shared by many ops
torch_sipu/csrc/aten/native/sipu/Loops.suh # scalar element-wise loops (sipu_kernel)
torch_sipu/csrc/aten/native/sipu/VecLoops.suh # vectorized loops (sipu_kernel_vec)
torch_sipu/csrc/aten/native/sipu/TileLoops.suh # tiled loops (sipu_kernel_tile)
torch_sipu/csrc/aten/native/sipu/Reduce.suh # reduction utilities (vectorized_reduction)
torch_sipu/csrc/aten/native/sipu/Parallel.suh # parallel execution (parallel_for, invoke_parallel)
torch_sipu/csrc/aten/native/sipu/Vec.suh # vector type utilities
torch_sipu/csrc/aten/native/sipu/Tile.suh # tile type utilities
# Triton op utilities
torch_sipu/backends/sipu_triton_kernels/ops/utils.py # cpu_fallback, precheck_supported_dtypes, request_fallback
torch_sipu/backends/sipu_triton_kernels/ops/verify_decorator.py # @sipu_verify
torch_sipu/backends/sipu_triton_kernels/ops/preprocessing_framework.py # @triton_preprocess, *_OP_CONFIG
# Test utilities
torch_sipu/testing/_internal/triton_utils.py # skipIfUseSipuTritonKernels, onlySipuTritonKernels
torch_sipu/testing/_internal/common_utils.py
# Tests
test/test_*.py
# Examples
examples/run_*.py