Name: Skill: high-performance-gpu-operator-development
Author: Dingxingdi

Skill: high-performance-gpu-operator-development

Use this skill when the user wants to generate or optimize high-performance GPU kernels using languages like Triton or native C++/CUDA. This is applicable for requests like 'write a fast Triton kernel for matrix operations', 'optimize my GPU code for throughput', 'create a custom operator that outperforms the standard library', or 'debug a memory error in a parallel kernel'. It is specifically triggered when performance bottlenecks (latency/bandwidth) or hardware-specific optimizations (register pressure/memory coalescing) are the primary concerns rather than general Python logic.

Dingxingdi0 Sterne10.04.2026

Kategorien: Framework-Interna

1. Capability Definition & Real Case

Professional Definition: The ability to develop, debug, and iteratively optimize high-performance GPU operators by translating abstract mathematical operations into hardware-aware execution models. This involves utilizing Domain-Specific Languages (DSLs) like Triton or low-level native extensions (C++/CUDA) to manage memory hierarchies, perform operator fusion, coordinate block-level parallelism, and measure empirical speedups over reference implementations while maintaining strict numerical precision.
Dimension Hierarchy: Data and ML Workflow Engineering->Machine Learning Engineering->high-performance-gpu-operator-development

Real Case

[Case 1]

Initial Environment: A development environment contains an NVIDIA or AMD GPU, the Triton DSL installed, and a reference Python implementation of a 2D tensor 'flip' operation. The workspace includes a performance profiling tool to measure memory bandwidth and register usage.
Real Question: Implement a high-performance 2D flip kernel that processes blocks of data and flips each row horizontally. The kernel must be significantly faster than a standard two-pass implementation.

Skill: high-performance-gpu-operator-development

Dingxingdi0 Sterne10.04.2026

Kategorien: Framework-Interna

1. Capability Definition & Real Case

Professional Definition: The ability to develop, debug, and iteratively optimize high-performance GPU operators by translating abstract mathematical operations into hardware-aware execution models. This involves utilizing Domain-Specific Languages (DSLs) like Triton or low-level native extensions (C++/CUDA) to manage memory hierarchies, perform operator fusion, coordinate block-level parallelism, and measure empirical speedups over reference implementations while maintaining strict numerical precision.

Dimension Hierarchy: Data and ML Workflow Engineering->Machine Learning Engineering->high-performance-gpu-operator-development

Real Case

[Case 1]

Initial Environment: A development environment contains an NVIDIA or AMD GPU, the Triton DSL installed, and a reference Python implementation of a 2D tensor 'flip' operation. The workspace includes a performance profiling tool to measure memory bandwidth and register usage.

Real Question: Implement a high-performance 2D flip kernel that processes blocks of data and flips each row horizontally. The kernel must be significantly faster than a standard two-pass implementation.

Skill: high-performance-gpu-operator-development

1. Capability Definition & Real Case

Real Case

Skill: high-performance-gpu-operator-development

1. Capability Definition & Real Case

Real Case

Pipeline Execution Instructions

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2