Before You Start

Read .agents/knowledge/constraints.md — especially rules about NPU guards (#19, #20).
Read docs/design/kernel_selection.md — understand the kernel lifecycle and selection mechanisms.
Familiarize yourself with the ops architecture below.

VeOmni Ops Architecture

VeOmni ops use a global function pointer + monkey-patch pattern:

veomni/ops/<op_name>/
├── __init__.py          # Public API function + apply_veomni_*_patch()
├── <impl_a>.py          # Implementation A (e.g., triton kernel)
├── <impl_b>.py          # Implementation B (e.g., eager PyTorch fallback)
└── <npu_impl>.py        # NPU variant (optional)

Key pattern: Each op module defines:

A global function pointer (e.g., _fused_moe_forward = None) — starts as None.
A public API function (e.g., ) — dispatches through the pointer.

fused_moe_forward()

Op	Directory	Patch time	Implementations
Flash Attention (FA2/3/4 + SP)	`flash_attn/`	import	transformers FA, with sequence parallel wrappers
Cross-Entropy Loss	`fused_cross_entropy/`	import	eager (PyTorch), liger_kernel (fused)
Load Balancing Loss	`fused_load_balancing_loss/`	import	torch_native, triton_kernel
Fused MoE	`fused_moe/`	model build	group_gemm (triton), quack_gemm (CUTLASS), npu_group_gemm
Group GEMM	`group_gemm/`	N/A (library)	triton kernels + benchmark utils
Batch Invariant Ops	`batch_invariant_ops/`	N/A (utility)	numerical stability helpers
DiT RoPE	`dit/rope_wan/`	N/A (direct import)	Wan rotary embedding
NPU Patches	`npu_patch/`	conditional	hccl_premul_sum, npu_fused_operator

Create the op directory: veomni/ops/<op_name>/

Write __init__.py following the pattern:

_my_op = None  # global function pointer

def my_op(...):
    """Public API — dispatches through the pointer."""
    if _my_op is None:
        raise NotImplementedError("...")
    return _my_op(...)

def apply_veomni_my_op_patch():
    """Bind the function pointer to a concrete implementation."""
    global _my_op
    if is_torch_npu_available():
        from .npu_impl import npu_my_op
        _my_op = npu_my_op
    else:
        from .default_impl import default_my_op
        _my_op = default_my_op

Write implementations in separate files (e.g., triton_impl.py, eager.py, npu_impl.py).
Register in the ops system:
- Add import to veomni/ops/__init__.py
- If monkey-patch op: add the (alias, function_pointer) tuple to build_ALL_OPS()
- If import-time patch: call apply_veomni_*_patch() from apply_ops_patch()
- If build-time patch: call from build_foundation_model() in veomni/models/auto.py
NPU support:
- Always guard NPU imports with is_torch_npu_available()
- Put NPU implementations in a separate file (e.g., npu_impl.py)
- NPU patches live in veomni/ops/npu_patch/ if they are general-purpose

Veomni New Op | Skills Pool

Veomni New Op

Veomni New Op

Before You Start

VeOmni Ops Architecture

Existing Ops

Phase 1: Design

Phase 2: Implement

Phase 3: Test

Phase 4: Document

Phase 5: Finalize

Common Pitfalls

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2