Investigate a PyTorch operator's implementation details: dispatch mechanism, structured vs unstructured, supported dtypes, CPU scalar/vector/CUDA kernels, and implementation notes for porting. Only use when the user explicitly types "/findop". Do NOT use automatically or proactively.
ONLY activate when the user explicitly types /findop <op_name>. Do NOT trigger automatically.
<op_name> is the ATen operator name (e.g. add, softmax, topk, silu, index_select).
PyTorch source: /share_data/tangcong/project/pytorch_v2.7.1
Key paths:
aten/src/ATen/native/native_functions.yaml — op schema & dispatch tableaten/src/ATen/native/ — high-level op implementation (.cpp)aten/src/ATen/native/cpu/ — CPU kernel implementationsaten/src/ATen/native/cuda/ — CUDA kernel implementationsaten/src/ATen/native/sparse/ — sparse implementationsaten/src/ATen/native/quantized/ — quantized implementationstorch/ — Python-level entry pointsSearch for the operator's schema entry:
grep -n "<op_name>" /share_data/tangcong/project/pytorch_v2.7.1/aten/src/ATen/native/native_functions.yaml
Read the matched section (typically 5-20 lines per op) to extract:
aten::<op>.<overload>(...) -> ...function, methodpointwise, core, etc.Present a summary table:
Op: aten::<op>
Signature: <func line>
Variants: function / method
Structured: Yes (structured_delegate: <base>) / No
Tags: [pointwise, ...]
Structured ops have one of:
structured: True — this IS the base structured kernelstructured_delegate: <op>.out — delegates to a structured out variantUnstructured ops have neither field.
If structured, explain:
meta() function computes output shape/dtype without dataimpl() function runs the actual computationReport:
Dispatch type: Structured kernel (base: <op>.out)
— meta() at: <path>
— impl() at: <path per backend>
OR
Dispatch type: Unstructured (traditional dispatch)
— Each backend registers its own full implementation
The dispatch: section shows which backends have implementations:
CPU / CUDA / SparseCPU / SparseCUDA / QuantizedCPU etc.CompositeImplicitAutograd — auto-decomposes, no per-backend kernelCompositeExplicitAutograd — explicit composite, also no per-backend kernelSearch for dtype dispatch macros in the CPU implementation:
grep -n "AT_DISPATCH_\w*TYPES" <cpu_kernel_file>
Common patterns:
AT_DISPATCH_ALL_TYPES — int8..int64, float, doubleAT_DISPATCH_ALL_TYPES_AND — above + specified extras (BFloat16, Half, etc.)AT_DISPATCH_FLOATING_TYPES — float, double onlyAT_DISPATCH_FLOATING_TYPES_AND — float, double + extrasAT_DISPATCH_FLOATING_AND_COMPLEX_TYPES — float, double, cfloat, cdoubleAT_DISPATCH_COMPLEX_TYPES — cfloat, cdouble onlyAT_DISPATCH_INTEGRAL_TYPES — uint8, int8, int16, int32, int64Same macro search in the CUDA file. CUDA may support fewer or more types than CPU.
Present a comparison table:
| dtype | CPU | CUDA |
|-------------|-----|------|
| float32 | Y | Y |
| float64 | Y | Y |
| bfloat16 | ? | ? |
| float16 | ? | ? |
| int8 | ? | ? |
| int16 | ? | ? |
| int32 | ? | ? |
| int64 | ? | ? |
| bool | ? | ? |
| complex64 | ? | ? |
| complex128 | ? | ? |
Show the full dispatch chain from Python call to kernel execution.
Find the Python binding:
grep -rn "def <op_name>" /share_data/tangcong/project/pytorch_v2.7.1/torch/_refs/
grep -rn "<op_name>" /share_data/tangcong/project/pytorch_v2.7.1/torch/functional.py
From native_functions.yaml, determine:
dispatch: has CPU: <func> → registered to CPU dispatch keydispatch: has CUDA: <func> → registered to CUDA dispatch keyCompositeImplicitAutograd → decomposes into other opsDECLARE_DISPATCH + DEFINE_DISPATCH + REGISTER_DISPATCHMany ops use DispatchStub for CPU/CUDA:
<Op>.cpp (declares stub) → DECLARE_DISPATCH(<fn_type>, <stub_name>)
cpu/<Op>Kernel.cpp → REGISTER_DISPATCH(<stub_name>, &<cpu_impl>)
cuda/<Op>Kernel.cu → REGISTER_DISPATCH(<stub_name>, &<cuda_impl>)
Search for the stub:
grep -rn "DECLARE_DISPATCH.*<op>" /share_data/tangcong/project/pytorch_v2.7.1/aten/src/ATen/native/
grep -rn "REGISTER_DISPATCH.*<op>" /share_data/tangcong/project/pytorch_v2.7.1/aten/src/ATen/native/
Present the full path:
torch.<op>(tensor)
→ aten::<op> (C++ dispatcher)
→ [CPU] <file.cpp>:<line> → DispatchStub → cpu/<file>Kernel.cpp:<kernel_func>
→ [CUDA] <file.cpp>:<line> → DispatchStub → cuda/<file>Kernel.cu:<kernel_func>
Read the CPU kernel file and identify:
at::vec::Vectorized<scalar_t> for SIMDat::vec::map, at::vec::map2, or explicit Vectorized operationsvec.exp(), vec.log(), custom formulas)TensorIterator / TensorIteratorConfig?unary_op, binary_op, reduce_op, nullary_op?check_mem_overlap, allow_cpu_scalars, etc.Present:
CPU Implementation: <file>:<line_range>
TensorIterator: Yes/No (type: unary_op/binary_op/reduce_op/...)
Scalar kernel: <brief description of scalar computation>
Vectorized kernel: <brief description, which Vectorized ops used>
Special handling: <any edge cases, special dtype paths, etc.>
Read the CUDA kernel file and identify:
gpu_kernel (element-wise via TensorIterator)gpu_reduce_kernel (reduction)<<<blocks, threads>>>)float4, __half2)__shfl_down_sync, etc.)Present:
CUDA Implementation: <file>:<line_range>
Launch pattern: gpu_kernel / gpu_reduce_kernel / custom<<<>>>
Kernel type: element-wise functor / block reduction / ...
Vectorized loads: Yes/No
Shared memory: Yes/No
Library calls: cuBLAS / cuDNN / none
Special: <any notable optimizations>
Based on the analysis above, provide concrete guidance for implementing this op in torch_sipu:
Map PyTorch's implementation pattern to SIPU's infrastructure:
| PyTorch pattern | SIPU equivalent |
|---|---|
| TensorIterator + unary/binary functor | Loops.suh / VecLoops.suh / TileLoops.suh |
| TensorIterator + reduction | Reduce.suh (vectorized_reduction) |
| Vectorized<scalar_t> | VectorizedM1 (Vec.suh) |
| Custom CUDA kernel | parallel_for + VectorizedM1 (Parallel.suh) |
| cuBLAS/cuDNN call | sikernel library or Triton backend |
State the operator category per the operator-dev skill:
Category: E1/E2/C/R1/R2/M/S/X
Recommended path: PATH-A / PATH-A-REDUCE / PATH-B / PATH-C / Triton
Check and report on each:
TORCH_SIPU_IMPL_FUNC pattern..out() or _() variants that need separate registration?Output a concise report combining all findings:
═══════════════════════════════════════════════════════
PyTorch Operator Investigation Report: aten::<op>
═══════════════════════════════════════════════════════
1. Schema
<func signature from native_functions.yaml>
2. Dispatch Type
Structured / Unstructured
CompositeImplicitAutograd: Yes/No
3. Supported dtypes
CPU: [list]
CUDA: [list]
4. Dispatch Path
Python → C++ → [CPU] <path>
→ [CUDA] <path>
5. CPU Kernel
File: <path>
Pattern: TensorIterator + scalar/vec
Vectorized ops: <list>
6. CUDA Kernel
File: <path>
Pattern: gpu_kernel / custom
Key optimizations: <list>
7. SIPU Implementation Recommendation
Category: <E1/E2/C/R1/R2/M/S/X>
Path: <PATH-A/B/C/Triton>
Key notes:
- <note 1>
- <note 2>
- ...
═══════════════════════════════════════════════════════
AT_DISPATCH_* macros, not assumptions