Name: Triton TileIR Optimization
Author: NVIDIA

スキルを検索.../

Triton TileIR Optimization | Skills Pool

Kernel Type	Speedup	Key Lever
Dot-Related (GEMM, Attention)	1.2-2.0x	TMA + 2CTA
Norm-Like (LayerNorm, Softmax)	2.0-5.0x	High occupancy
Element-Wise (ReLU, Add, Exp)	1.5-3.0x	Occupancy + num_stages
Reduction (Sum, Mean, Max)	1.8-4.0x	High occupancy

python scripts/tileir_check.py

python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'

python scripts/classify_kernel.py --file kernel.py

Contains tl.dot()?
  YES --> dot-related: TMA + 2CTA + occupancy + larger blocks
  NO  --> Has reduction + normalization?
            YES --> norm-like: high occupancy (2, 4) + num_warps (4, 8)
            NO  --> Point-wise only?
                      YES --> element-wise: occupancy (1-16) + num_stages (2-4)
                      NO  --> reduction: high occupancy + num_warps

python scripts/classify_kernel.py --file kernel.py --apply-optimizations

import torch

def get_configs_with_gating(pre_hook=None):
    configs = get_baseline_configs()
    if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10:
        configs.extend(get_tileir_specific_configs(pre_hook))
    return configs

python scripts/verify_kernel.py --kernel path/to/optimized_kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'

python scripts/tileir_check.py

# Classify only
python scripts/classify_kernel.py --file kernel.py

# Classify + apply optimizations
python scripts/classify_kernel.py --file kernel.py --apply-optimizations

# From inline code
python scripts/classify_kernel.py --code '<kernel_code>'

Do not over-tune num_warps -- TileIR ignores it. Focus on occupancy.
Use larger block sizes (256x256, 256x128) for TileIR, not PTX-tuned small blocks.
Benchmark across small/medium/large inputs; one-size configs underperform.

For exp/log heavy kernels, enable approximate math:

export TILEIR_ENABLE_APPROX=1
export TILEIR_ENABLE_FTZ=1

## TileIR Optimization: kernel_name

### Classification
- Kernel type: [dot-related | norm-like | element-wise | reduction]
- Strategy: [TMA + 2CTA | High occupancy | Occupancy + num_stages]

### Compatibility Check (ENABLE_TILE=0)
[PASSED | FAILED] — Max difference: X.Xe-Y

### Transformations Applied
- [List of transformations]

### TileIR Validation (ENABLE_TILE=1)
[PASSED | FAILED] — Max difference: X.Xe-Y

### Benchmark Comparison
| Backend | Time (ms) | Speedup |
|---------|-----------|---------|
| PTX (ENABLE_TILE=0) | X.XXX | 1.0x |
| TileIR (ENABLE_TILE=1) | X.XXX | Y.Yx |

### Output
File: kernel_name_tileir.py

Parameter	PTX Backend	TileIR Backend
`num_warps`	Strict directive	Ignored (compiler decides)
`num_stages`	Strict directive	Cost hint (compiler optimizes)
`occupancy`	Not available	Critical tuning param (1-32)
`num_ctas`	Limited	2CTA mode for Blackwell
Block sizes	Smaller often better	Larger often better

Parameter	PTX Backend	TileIR Backend
`num_warps`	Strict directive	Ignored (compiler decides)
`num_stages`	Strict directive	Cost hint (compiler optimizes)
`occupancy`	Not available	Critical tuning param (1-32)
`num_ctas`	Limited	2CTA mode for Blackwell
Block sizes	Smaller often better	Larger often better

Package	Source	Use Case
`pytorch-triton`	PyTorch wheel	`torch.compile`, standard kernels
`triton`	OpenAI PyPI	Official Triton from triton-lang.org
nvtriton	Triton-to-tile-IR	TileIR backend for Blackwell

Triton TileIR Optimization

Principles

TileIR vs PTX Backend

Triton TileIR Optimization

Principles

TileIR vs PTX Backend

Triton Package Landscape

When TileIR Applies

Workflow

Phase 1: Compatibility Test (ENABLE_TILE=0)

Phase 2: Classify Kernel

Phase 3: Apply Transformations

Phase 4: TileIR Validation (ENABLE_TILE=1)

Phase 5: Benchmark

Scripts

tileir_check.py

classify_kernel.py

Error Handling

Common Pitfalls

When to Abort

Output Format

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2