Use this skill when adding a new optimized kernel or operator to veomni/ops/. Covers the full lifecycle: understanding VeOmni's ops architecture (monkey-patch + global function pointer pattern), implementing the kernel, registering it, adding tests, and documenting it. Trigger: 'add op', 'new kernel', 'add attention variant', 'new fused op', 'add triton kernel', 'optimize operator'.
.agents/knowledge/constraints.md — especially rules about NPU guards (#19, #20).docs/design/kernel_selection.md — understand the kernel lifecycle and selection mechanisms.VeOmni ops use a global function pointer + monkey-patch pattern:
veomni/ops/<op_name>/
├── __init__.py # Public API function + apply_veomni_*_patch()
├── <impl_a>.py # Implementation A (e.g., triton kernel)
├── <impl_b>.py # Implementation B (e.g., eager PyTorch fallback)
└── <npu_impl>.py # NPU variant (optional)
Key pattern: Each op module defines:
_fused_moe_forward = None) — starts as None.fused_moe_forward()apply_veomni_fused_moe_patch()) — binds the pointer to a concrete implementation at runtime.Patch functions are called from:
veomni/ops/__init__.py -> apply_ops_patch() (import time, for attention/loss/load-balancing)veomni/models/auto.py -> build_foundation_model() (model build time, for MoE)| Op | Directory | Patch time | Implementations |
|---|---|---|---|
| Flash Attention (FA2/3/4 + SP) | flash_attn/ | import | transformers FA, with sequence parallel wrappers |
| Cross-Entropy Loss | fused_cross_entropy/ | import | eager (PyTorch), liger_kernel (fused) |
| Load Balancing Loss | fused_load_balancing_loss/ | import | torch_native, triton_kernel |
| Fused MoE | fused_moe/ | model build | group_gemm (triton), quack_gemm (CUTLASS), npu_group_gemm |
| Group GEMM | group_gemm/ | N/A (library) | triton kernels + benchmark utils |
| Batch Invariant Ops | batch_invariant_ops/ | N/A (utility) | numerical stability helpers |
| DiT RoPE | dit/rope_wan/ | N/A (direct import) | Wan rotary embedding |
| NPU Patches | npu_patch/ | conditional | hccl_premul_sum, npu_fused_operator |
Determine op category:
apply_ops_patch() or build_foundation_model().is_torch_npu_available() guard.Decide selection mechanism: read docs/design/kernel_selection.md to determine if you need:
OpsImplementationConfig (veomni/arguments/arguments_types.py)Determine patch timing:
apply_ops_patch() in veomni/ops/__init__.py.build_foundation_model().Create the op directory: veomni/ops/<op_name>/
Write __init__.py following the pattern:
_my_op = None # global function pointer
def my_op(...):
"""Public API — dispatches through the pointer."""
if _my_op is None:
raise NotImplementedError("...")
return _my_op(...)
def apply_veomni_my_op_patch():
"""Bind the function pointer to a concrete implementation."""
global _my_op
if is_torch_npu_available():
from .npu_impl import npu_my_op
_my_op = npu_my_op
else:
from .default_impl import default_my_op
_my_op = default_my_op
Write implementations in separate files (e.g., triton_impl.py, eager.py, npu_impl.py).
Register in the ops system:
veomni/ops/__init__.py(alias, function_pointer) tuple to build_ALL_OPS()apply_veomni_*_patch() from apply_ops_patch()build_foundation_model() in veomni/models/auto.pyNPU support:
is_torch_npu_available()npu_impl.py)veomni/ops/npu_patch/ if they are general-purposeAdd unit tests to tests/ops/:
Add benchmark (optional but recommended for performance-critical ops):
veomni/ops/group_gemm/utils/benchmark_utils.py as referenceRun: pytest tests/ops/ -v
Update docs/design/kernel_selection.md:
Update .agents/knowledge/architecture.md if the op adds a new subdirectory to veomni/ops/.
/veomni-review skill.make quality.build_ALL_OPS() and format_kernel_functions() include the new op.build_ALL_OPS(): the op will work but won't appear in the ops format output, making debugging harder.is_torch_npu_available() guard crashes on GPU-only environments.get_parallel_state().sp_enabled to check and dispatch.__all__: if the op provides a public API function, export it from veomni/ops/__init__.py.