Optimize existing Triton kernels for NVIDIA TileIR backend on Blackwell GPUs (sm_100+). Adds TileIR-specific autotune configs: occupancy, num_ctas, TMA descriptors. Covers kernel classification (dot-related, norm-like, elementwise, reduction), type-specific transformations, and PTX-vs-TileIR benchmarking. Triggered by: "optimize for TileIR", "add TileIR configs", "Blackwell optimization", "TMA descriptors", "2CTA mode", "occupancy tuning". Kernels use standard `import triton`; TileIR activates via ENABLE_TILE=1 when nvtriton is installed.
Optimize EXISTING Triton kernels for NVIDIA's TileIR backend on Blackwell GPUs. This skill does NOT write kernels from scratch -- that is the Triton Specialist's job.
TileIR is NVIDIA's compiler backend for Triton that generates optimized CUDA code using CGA-level (Cooperative Grid Array) tile representations. Critical differences:
| Parameter | PTX Backend | TileIR Backend |
|---|---|---|
num_warps | Strict directive | Ignored (compiler decides) |
num_stages | Strict directive | Cost hint (compiler optimizes) |
occupancy | Not available | Critical tuning param (1-32) |
num_ctas | Limited | 2CTA mode for Blackwell |
| Block sizes | Smaller often better | Larger often better |
| TMA | Not available | Required for dot kernels |
Key implication: Do not tune num_warps for TileIR -- focus on occupancy instead.
Three packages share import triton:
| Package | Source | Use Case |
|---|---|---|
pytorch-triton | PyTorch wheel | torch.compile, standard kernels |
triton | OpenAI PyPI | Official Triton from triton-lang.org |
| nvtriton | Triton-to-tile-IR | TileIR backend for Blackwell |
Only one triton package should be installed at a time. "Converting to TileIR" means
adding TileIR-specific configs, NOT changing imports. TileIR activates via ENABLE_TILE=1.
TileIR targets Blackwell (sm_100+). Without nvtriton or Blackwell hardware, the specialist still adds TileIR-optimized configs that standard triton safely ignores, enabling future deployment.
Expected speedups (with nvtriton on Blackwell):
| Kernel Type | Speedup | Key Lever |
|---|---|---|
| Dot-Related (GEMM, Attention) | 1.2-2.0x | TMA + 2CTA |
| Norm-Like (LayerNorm, Softmax) | 2.0-5.0x | High occupancy |
| Element-Wise (ReLU, Add, Exp) | 1.5-3.0x | Occupancy + num_stages |
| Reduction (Sum, Mean, Max) | 1.8-4.0x | High occupancy |
Five-phase workflow: compatibility, classify, transform, validate, benchmark.
Verify the kernel works in PTX mode before applying TileIR optimizations.
python scripts/tileir_check.py
Then use the kernel-triton-writing skill's verify_kernel.py to verify with ENABLE_TILE=0:
python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'
Determine kernel type to select the optimization strategy.
python scripts/classify_kernel.py --file kernel.py
Classification decision tree:
Contains tl.dot()?
YES --> dot-related: TMA + 2CTA + occupancy + larger blocks
NO --> Has reduction + normalization?
YES --> norm-like: high occupancy (2, 4) + num_warps (4, 8)
NO --> Point-wise only?
YES --> element-wise: occupancy (1-16) + num_stages (2-4)
NO --> reduction: high occupancy + num_warps
Classify and apply optimizations in one step:
python scripts/classify_kernel.py --file kernel.py --apply-optimizations
Output JSON includes optimized_code and changes_applied fields.
Type-specific transformations:
Dot-related (highest priority):
tl.load/tl.store to TMA descriptors (MANDATORY). See references/tma-conversion.md.num_ctas=2) with SM oversubscription guard in pre-hook.Norm-like (LayerNorm, Softmax, RMSNorm):
Element-wise (ReLU, GELU, Add, Mul, Exp):
Reduction (Sum, Mean, Max):
Gate TileIR-specific configs for sm_100+:
import torch
def get_configs_with_gating(pre_hook=None):
configs = get_baseline_configs()
if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10:
configs.extend(get_tileir_specific_configs(pre_hook))
return configs
See references/config-templates.md for complete config templates per kernel type.
Use the kernel-triton-writing skill's verify_kernel.py to verify the optimized kernel with TileIR backend:
python scripts/verify_kernel.py --kernel path/to/optimized_kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'
Set ENABLE_TILE=1 before running. Check: numerical correctness, no compilation errors,
TMA/2CTA patterns compile successfully.
Use triton.testing.do_bench() (as documented in the perf-workload-profiling skill) to compare PTX (ENABLE_TILE=0) vs TileIR (ENABLE_TILE=1).
Benchmark across multiple input sizes (128, 1024, 8192) -- performance varies by size.
Check TileIR availability (nvtriton, ENABLE_TILE, Blackwell GPU):
python scripts/tileir_check.py
Returns JSON: nvtriton_installed, tileir_active, blackwell_gpu, gpu_capability, recommendation.
Classify kernel type and optionally apply TileIR optimizations:
# Classify only
python scripts/classify_kernel.py --file kernel.py
# Classify + apply optimizations
python scripts/classify_kernel.py --file kernel.py --apply-optimizations
# From inline code
python scripts/classify_kernel.py --code '<kernel_code>'
Returns JSON: classification, confidence, indicators, recommendations.
With --apply-optimizations: adds optimized_code and changes_applied.
TMA descriptor errors (dot-related kernels):
pre_hook=tma_set_block_size_hook to config generation -- without it,
TMA descriptors keep dummy block sizes, causing runtime errors or wrong results.b.T.contiguous() in wrapper and use tl.dot(a, b.T, accumulator)
in kernel. Transposition mismatch produces incorrect results silently.2CTA oversubscription:
num_ctas=2:
if "NUM_SMS" in nargs and "NUM_CTAS" in nargs:
nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"]
Config function signatures:
pre_hook=None, even if unused.
Without it: TypeError: get_autotune_configs() takes 0 positional arguments.Hardware gating:
torch.cuda.get_device_capability()[0] >= 10.
TMA/2CTA on pre-Blackwell GPUs causes runtime crashes.API availability:
1.0 / (1.0 + tl.exp(-x)) instead of tl.sigmoid(x) -- not available in
all Triton versions including some nvtriton builds.Performance tuning:
num_warps -- TileIR ignores it. Focus on occupancy.export TILEIR_ENABLE_APPROX=1
export TILEIR_ENABLE_FTZ=1
Stop and report if:
After optimization, return:
## TileIR Optimization: kernel_name
### Classification
- Kernel type: [dot-related | norm-like | element-wise | reduction]
- Strategy: [TMA + 2CTA | High occupancy | Occupancy + num_stages]
### Compatibility Check (ENABLE_TILE=0)
[PASSED | FAILED] — Max difference: X.Xe-Y
### Transformations Applied
- [List of transformations]
### TileIR Validation (ENABLE_TILE=1)
[PASSED | FAILED] — Max difference: X.Xe-Y
### Benchmark Comparison
| Backend | Time (ms) | Speedup |
|---------|-----------|---------|
| PTX (ENABLE_TILE=0) | X.XXX | 1.0x |
| TileIR (ENABLE_TILE=1) | X.XXX | Y.Yx |
### Output
File: kernel_name_tileir.py