Cuda Performance

技能檔案

Cuda Performance

Use when optimizing CUDA kernels, GPU memory hierarchy, shared memory, coalescing, launch configuration, occupancy, warp divergence, host-device transfers, and end-to-end GPU pipeline.

miaodi0 星標2026年4月10日

職業
分類: 框架內部結構

技能內容

CUDA Performance Skill

Purpose

Optimize CUDA code for throughput, latency, memory efficiency, and predictable scaling while preserving correctness.

When To Use

Use for kernel optimization, launch configuration tuning, memory hierarchy tuning, transfer reduction, and end-to-end GPU pipeline improvement.

Priorities

Preserve correctness.
Measure before optimizing.
Optimize memory behavior before instruction-level tuning.
Prefer shared memory when it meaningfully reduces global memory traffic or enables cooperative data reuse.
Optimize end-to-end runtime, not just isolated kernel duration.

Workflow

Identify the dominant kernels and transfer costs from profiling or benchmarks.
Confirm the target metric: latency, throughput, utilization, or total pipeline time.