Use when optimizing CUDA kernels, GPU memory hierarchy, shared memory, coalescing, launch configuration, occupancy, warp divergence, host-device transfers, and end-to-end GPU pipeline.
Optimize CUDA code for throughput, latency, memory efficiency, and predictable scaling while preserving correctness.
Use for kernel optimization, launch configuration tuning, memory hierarchy tuning, transfer reduction, and end-to-end GPU pipeline improvement.
Provide: