Name: Cuda
Author: datathings

Overview

CUDA is NVIDIA's parallel computing platform and programming model for GPU-accelerated applications. It provides direct access to the GPU's virtual instruction set and parallel compute elements for executing kernels in C, C++, and Fortran.

cuda-samples version: v13.2 (CUDA Toolkit 13.2) CUDALibrarySamples: main (April 2026) Language: C/C++ (.cu files) Licenses: BSD-3-Clause (cuda-samples), Apache-2.0 (CUDALibrarySamples)

Quick Start

// Minimal kernel + launch
__global__ void addVectors(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

int main() {
    int n = 1 << 20;
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, n * sizeof(float));
    cudaMalloc(&d_b, n * sizeof(float));
    cudaMalloc(&d_c, n * sizeof(float));

    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    addVectors<<<blocks, threads>>>(d_a, d_b, d_c, n);
    cudaDeviceSynchronize();

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
}

Overview

Quick Start

// Minimal kernel + launch
__global__ void addVectors(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

int main() {
    int n = 1 << 20;
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, n * sizeof(float));
    cudaMalloc(&d_b, n * sizeof(float));
    cudaMalloc(&d_c, n * sizeof(float));

    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    addVectors<<<blocks, threads>>>(d_a, d_b, d_c, n);
    cudaDeviceSynchronize();

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
}

Domain	File	Description
CUDA Runtime	api-runtime.md	Device mgmt, memory, streams, events, kernel launch
cuBLAS	api-cublas.md	Dense linear algebra: GEMM, GEMV, TRSM, grouped batched ops
cuFFT	api-cufft.md	1D/2D/3D FFT and batched transforms
cuSPARSE	api-cusparse.md	Sparse matrix ops: SpMM, SpMV, format conversions
cuRAND	api-curand.md	Random number generation on GPU
cuSolver	api-cusolver.md	Dense/sparse solvers: QR, LU, eigenvalue, SVD
Thrust	api-thrust.md	STL-like GPU algorithms: sort, reduce, transform, scan
Cooperative Groups	api-cooperative-groups.md	Flexible thread synchronization beyond blocks
Workflows	workflows.md	Complete working examples

Cuda

Overview

Quick Start

Cuda

Overview

Quick Start

Core Concepts

API Reference

Common Workflows

Key Considerations

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2