Optimizes LLM inference kernels using FlashAttention, KV cache paging, Triton/CUDA kernel development, operator fusion, quantized GEMM, speculative decoding, and TensorRT-LLM. Use when reducing latency, memory, or throughput bottlenecks in model serving.
Optimize low-level inference kernels for LLM serving — FlashAttention, KV cache management, operator fusion, custom CUDA/Triton kernels, quantized GEMM, and speculative decoding — to reduce latency, memory footprint, and cost per token.
Use this skill when:
fine-tuning or llm-creation)quantization-researchdistillation-compression)torch.profiler, nsys profile, or py-spy to identify whether the workload is memory-bound (attention, KV cache) or compute-bound (GEMM, MLP). Measure time-to-first-token (TTFT) and inter-token latency separately.from flash_attn import flash_attn_func. Key idea: tile Q/K/V to SRAM, compute softmax in a single pass without materializing the N×N attention matrix. Reduces memory from O(N²) to O(N). For serving, implement paged attention (vLLM-style): allocate KV cache in fixed-size blocks, map logical positions to physical blocks via a block table.torch.compile(mode="max-autotune") to auto-fuse where possible.import triton
import triton.language as tl
@triton.jit
def fused_softmax_kernel(output_ptr, input_ptr, n_cols, BLOCK_SIZE: tl.constexpr):
row_idx = tl.program_id(0)
col_offsets = tl.arange(0, BLOCK_SIZE)
mask = col_offsets < n_cols
row = tl.load(input_ptr + row_idx * n_cols + col_offsets, mask=mask, other=-float('inf'))
row_max = tl.max(row, axis=0)
numerator = tl.exp(row - row_max)
denominator = tl.sum(numerator, axis=0)
tl.store(output_ptr + row_idx * n_cols + col_offsets, numerator / denominator, mask=mask)
torch.cuda.Event timers.torch.compile fusion before manual kernel writing — it handles many common patterns.Profile report — bottleneck identification with nsys/torch.profiler tracesKernel implementation — Triton/CUDA source with correctness tests against referenceBenchmark results — tokens/sec, TTFT, memory, p99 latency before and after optimizationIntegration plan — how the kernel plugs into the serving framework (vLLM, TensorRT-LLM, etc.)quantization-researchdistillation-compressionbenchmark-designllm-creationtorch.nn.functional.scaled_dot_product_attention on random inputs (atol=1e-2 for fp16).