Name: Findop
Author: wasamtc

Findop | Skills Pool

grep -n "<op_name>" /share_data/tangcong/project/pytorch_v2.7.1/aten/src/ATen/native/native_functions.yaml

Op: aten::<op>
Signature: <func line>
Variants: function / method
Structured: Yes (structured_delegate: <base>) / No
Tags: [pointwise, ...]

Dispatch type: Structured kernel (base: <op>.out)
  — meta() at: <path>
  — impl() at: <path per backend>

Dispatch type: Unstructured (traditional dispatch)
  — Each backend registers its own full implementation

grep -n "AT_DISPATCH_\w*TYPES" <cpu_kernel_file>

| dtype       | CPU | CUDA |
|-------------|-----|------|
| float32     |  Y  |  Y   |
| float64     |  Y  |  Y   |
| bfloat16    |  ?  |  ?   |
| float16     |  ?  |  ?   |
| int8        |  ?  |  ?   |
| int16       |  ?  |  ?   |
| int32       |  ?  |  ?   |
| int64       |  ?  |  ?   |
| bool        |  ?  |  ?   |
| complex64   |  ?  |  ?   |
| complex128  |  ?  |  ?   |

grep -rn "def <op_name>" /share_data/tangcong/project/pytorch_v2.7.1/torch/_refs/
grep -rn "<op_name>" /share_data/tangcong/project/pytorch_v2.7.1/torch/functional.py

<Op>.cpp (declares stub) → DECLARE_DISPATCH(<fn_type>, <stub_name>)
cpu/<Op>Kernel.cpp → REGISTER_DISPATCH(<stub_name>, &<cpu_impl>)
cuda/<Op>Kernel.cu → REGISTER_DISPATCH(<stub_name>, &<cuda_impl>)

grep -rn "DECLARE_DISPATCH.*<op>" /share_data/tangcong/project/pytorch_v2.7.1/aten/src/ATen/native/
grep -rn "REGISTER_DISPATCH.*<op>" /share_data/tangcong/project/pytorch_v2.7.1/aten/src/ATen/native/

torch.<op>(tensor)
  → aten::<op> (C++ dispatcher)
    → [CPU] <file.cpp>:<line> → DispatchStub → cpu/<file>Kernel.cpp:<kernel_func>
    → [CUDA] <file.cpp>:<line> → DispatchStub → cuda/<file>Kernel.cu:<kernel_func>

CPU Implementation: <file>:<line_range>
  TensorIterator: Yes/No (type: unary_op/binary_op/reduce_op/...)
  Scalar kernel: <brief description of scalar computation>
  Vectorized kernel: <brief description, which Vectorized ops used>
  Special handling: <any edge cases, special dtype paths, etc.>

CUDA Implementation: <file>:<line_range>
  Launch pattern: gpu_kernel / gpu_reduce_kernel / custom<<<>>>
  Kernel type: element-wise functor / block reduction / ...
  Vectorized loads: Yes/No
  Shared memory: Yes/No
  Library calls: cuBLAS / cuDNN / none
  Special: <any notable optimizations>

PyTorch pattern	SIPU equivalent
TensorIterator + unary/binary functor	`Loops.suh` / `VecLoops.suh` / `TileLoops.suh`
TensorIterator + reduction	`Reduce.suh` (vectorized_reduction)
Vectorized<scalar_t>	`VectorizedM1` (Vec.suh)
Custom CUDA kernel	`parallel_for` + `VectorizedM1` (Parallel.suh)
cuBLAS/cuDNN call	sikernel library or Triton backend

Category: E1/E2/C/R1/R2/M/S/X
Recommended path: PATH-A / PATH-A-REDUCE / PATH-B / PATH-C / Triton

═══════════════════════════════════════════════════════
  PyTorch Operator Investigation Report: aten::<op>
═══════════════════════════════════════════════════════

1. Schema
   <func signature from native_functions.yaml>

2. Dispatch Type
   Structured / Unstructured
   CompositeImplicitAutograd: Yes/No

3. Supported dtypes
   CPU:  [list]
   CUDA: [list]

4. Dispatch Path
   Python → C++ → [CPU] <path>
                → [CUDA] <path>

5. CPU Kernel
   File: <path>
   Pattern: TensorIterator + scalar/vec
   Vectorized ops: <list>

6. CUDA Kernel
   File: <path>
   Pattern: gpu_kernel / custom
   Key optimizations: <list>

7. SIPU Implementation Recommendation
   Category: <E1/E2/C/R1/R2/M/S/X>
   Path: <PATH-A/B/C/Triton>
   Key notes:
   - <note 1>
   - <note 2>
   - ...
═══════════════════════════════════════════════════════

Findop

PyTorch Operator Investigation

Trigger

Target Directory

Findop

PyTorch Operator Investigation

Trigger

Target Directory

Workflow

Step 1: Locate the Op in native_functions.yaml

Step 2: Determine Structured vs Unstructured

Step 3: Identify Supported dtypes

3.1 Check dispatch table in native_functions.yaml

3.2 Find dtype constraints in CPU kernel

3.3 Find dtype constraints in CUDA kernel

Step 4: Trace the Dispatch Path

4.1 Python entry point

4.2 Native function dispatch

4.3 DispatchStub pattern (if applicable)

Step 5: Analyze CPU Implementation

5.1 Scalar path

5.2 Vectorized path (Vec256/Vectorized)

5.3 TensorIterator usage

Step 6: Analyze CUDA Implementation

6.1 Launch pattern

6.2 Kernel structure

6.3 Special optimizations

Step 7: Implementation Notes for SIPU Porting

7.1 Recommended approach

7.2 Classification

7.3 Key considerations

Step 8: Final Summary

Rules

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2