NVIDIA CUDA parallel computing platform — use when writing .cu kernels, using cuBLAS/cuDNN/cuFFT/cuSPARSE/cuRAND/cuSolver, Thrust, or Cooperative Groups for GPU-accelerated computing
CUDA is NVIDIA's parallel computing platform and programming model for GPU-accelerated applications. It provides direct access to the GPU's virtual instruction set and parallel compute elements for executing kernels in C, C++, and Fortran.
cuda-samples version: v13.2 (CUDA Toolkit 13.2) CUDALibrarySamples: main (April 2026) Language: C/C++ (.cu files) Licenses: BSD-3-Clause (cuda-samples), Apache-2.0 (CUDALibrarySamples)
// Minimal kernel + launch
__global__ void addVectors(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
int main() {
int n = 1 << 20;
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, n * sizeof(float));
cudaMalloc(&d_b, n * sizeof(float));
cudaMalloc(&d_c, n * sizeof(float));
int threads = 256;
int blocks = (n + threads - 1) / threads;
addVectors<<<blocks, threads>>>(d_a, d_b, d_c, n);
cudaDeviceSynchronize();
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
__global__ function executed on GPU by many parallel threads<<<gridDim, blockDim>>> configures parallelismcudaMalloc and freed with cudaFreecudaMallocManaged): Automatically migrates data between CPU and GPU; check support with device_prop.managedMemory| Domain | File | Description |
|---|---|---|
| CUDA Runtime | api-runtime.md | Device mgmt, memory, streams, events, kernel launch |
| cuBLAS | api-cublas.md | Dense linear algebra: GEMM, GEMV, TRSM, grouped batched ops |
| cuFFT | api-cufft.md | 1D/2D/3D FFT and batched transforms |
| cuSPARSE | api-cusparse.md | Sparse matrix ops: SpMM, SpMV, format conversions |
| cuRAND | api-curand.md | Random number generation on GPU |
| cuSolver | api-cusolver.md | Dense/sparse solvers: QR, LU, eigenvalue, SVD |
| Thrust | api-thrust.md | STL-like GPU algorithms: sort, reduce, transform, scan |
| Cooperative Groups | api-cooperative-groups.md | Flexible thread synchronization beyond blocks |
| Workflows | workflows.md | Complete working examples |
See references/workflows.md for complete examples.
Quick reference:
CUDA_CHECK(err) macro patterncudaDeviceSynchronize() or stream synchronize before reading results on hostcudaOccupancyMaxPotentialBlockSize to tune block dimensions; uses uint32_t for array indexing to avoid overflowcudaStreamNonBlocking when creating non-default streams to avoid implicit synchronization with the null streamdevice_prop.managedMemory before using cudaMallocManaged; use cudaMemPrefetchAsync to avoid page faultscuModuleUnload before cuCtxDestroy when using the CUDA Driver API--gpu-architecture=sm_90a for Hopper (H100), sm_80 for Ampere (A100), sm_70 for Volta (V100)cuda::std::tuple, cuda::std::make_tuple, and cuda::std::get (the thrust::tuple variants are replaced)