Use this skill when the user wants to generate or optimize high-performance GPU kernels using languages like Triton or native C++/CUDA. This is applicable for requests like 'write a fast Triton kernel for matrix operations', 'optimize my GPU code for throughput', 'create a custom operator that outperforms the standard library', or 'debug a memory error in a parallel kernel'. It is specifically triggered when performance bottlenecks (latency/bandwidth) or hardware-specific optimizations (register pressure/memory coalescing) are the primary concerns rather than general Python logic.
[Case 1]
X + rows * M + (M - 1 - cols). By reading from the flipped source positions and writing directly to the destination, it avoids intermediate register storage and reduces memory bandwidth requirements. Finally, it uses tiled masks to handle boundary conditions for arbitrary tensor sizes, ensuring the kernel is robust as well as fast.[Case 2]
To synthesize data for this capability, you must strictly follow a 3-phase pipeline. Do not hallucinate steps. Read the corresponding reference file for each phase sequentially:
Phase 1: Environment Exploration
Read the exploration guidelines to discover raw knowledge seeds:
references/EXPLORATION.md
Phase 2: Trajectory Selection
Once Phase 1 is complete, read the selection criteria to evaluate the trajectory:
references/SELECTION.md
Phase 3: Data Synthesis
Once a trajectory passes Phase 2, read the synthesis instructions to generate the final data:
references/SYNTHESIS.md