Purpose

Optimize low-level inference kernels for LLM serving — FlashAttention, KV cache management, operator fusion, custom CUDA/Triton kernels, quantized GEMM, and speculative decoding — to reduce latency, memory footprint, and cost per token.

When to use this skill

Use this skill when:

implementing or tuning FlashAttention or paged attention for a serving stack
writing custom CUDA or Triton kernels for transformer operations
optimizing KV cache memory (paged allocation, prefix caching, multi-query attention)
fusing operators (QKV projection fusion, fused MLP, fused softmax)
deploying with TensorRT-LLM, vLLM, or ONNX Runtime and tuning their kernel configs
implementing speculative decoding or quantized inference (INT8/INT4 GEMM)

Do not use this skill when

the task is model training optimization (use fine-tuning or llm-creation)
the goal is weight quantization research without kernel changes (use )

Purpose

When to use this skill

Use this skill when:

implementing or tuning FlashAttention or paged attention for a serving stack
writing custom CUDA or Triton kernels for transformer operations
optimizing KV cache memory (paged allocation, prefix caching, multi-query attention)
fusing operators (QKV projection fusion, fused MLP, fused softmax)
deploying with TensorRT-LLM, vLLM, or ONNX Runtime and tuning their kernel configs
implementing speculative decoding or quantized inference (INT8/INT4 GEMM)

Do not use this skill when

the task is model training optimization (use fine-tuning or llm-creation)
the goal is weight quantization research without kernel changes (use )

import triton import triton.language as tl @triton.jit def fused_softmax_kernel(output_ptr, input_ptr, n_cols, BLOCK_SIZE: tl.constexpr): row_idx = tl.program_id(0) col_offsets = tl.arange(0, BLOCK_SIZE) mask = col_offsets < n_cols row = tl.load(input_ptr + row_idx * n_cols + col_offsets, mask=mask, other=-float('inf')) row_max = tl.max(row, axis=0) numerator = tl.exp(row - row_max) denominator = tl.sum(numerator, axis=0) tl.store(output_ptr + row_idx * n_cols + col_offsets, numerator / denominator, mask=mask)

Inference Kernel Optimization

Purpose

When to use this skill

Do not use this skill when

Inference Kernel Optimization

Purpose

When to use this skill

Do not use this skill when

Operating procedure

Decision rules

Output requirements

References

Failure handling

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2

Inference Kernel Optimization

Purpose

When to use this skill

Do not use this skill when

Inference Kernel Optimization

Purpose

When to use this skill

Do not use this skill when

Operating procedure

Decision rules

Output requirements

References

Related skills

Failure handling

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2