Name: TensorRT-LLM Codebase Exploration Guide
Author: NVIDIA

TensorRT-LLM Codebase Exploration Guide

Systematic approach to exploring the TensorRT-LLM codebase before implementing new features or optimizations. Teaches how to discover existing infrastructure, trace code paths, and avoid reimplementing what already exists. Derived from real mistakes where ~250 lines of code were written and deleted because existing forward methods weren't discovered upfront. Use when starting any new feature, optimization, or code modification in TRT-LLM.

NVIDIA13,410 星標2026年4月8日

職業
分類: 機器學習

Why This Matters

TRT-LLM is a large codebase (~500K lines) with many reusable abstractions. The most common source of wasted effort is reimplementing something that already exists. On the short-seq MHA branch, ~250 lines were written across 4 iterations before discovering that a 10-line dispatch to an existing method (forward_context_default) was the right solution.

Rule of thumb: Spend 30 minutes reading existing code before writing 1 line of new code.

MANDATORY: Ignore the TensorRT backend, focus on the PyTorch backend

Step-by-Step Exploration Workflow

Step 1: Map the Class You're Modifying

Before adding code to a class, understand its full structure:

# List all methods (not just forward*)
grep -n "def " tensorrt_llm/_torch/modules/attention.py | head -50

# List all attributes set in __init__
grep -n "self\." tensorrt_llm/_torch/modules/attention.py | grep "__init__" -A 200 | head -80

# Find the class hierarchy
grep -n "class MLA\|class Attention\|class TrtllmAttention" tensorrt_llm/_torch/modules/attention.py

Why This Matters

Rule of thumb: Spend 30 minutes reading existing code before writing 1 line of new code.

MANDATORY: Ignore the TensorRT backend, focus on the PyTorch backend

Step-by-Step Exploration Workflow

Step 1: Map the Class You're Modifying

Before adding code to a class, understand its full structure:

# List all methods (not just forward*)
grep -n "def " tensorrt_llm/_torch/modules/attention.py | head -50

# List all attributes set in __init__
grep -n "self\." tensorrt_llm/_torch/modules/attention.py | grep "__init__" -A 200 | head -80

# Find the class hierarchy
grep -n "class MLA\|class Attention\|class TrtllmAttention" tensorrt_llm/_torch/modules/attention.py

What you need	Search for	Common hits
Attention computation	`TrtllmAttention`, `create_attention`, `FlashInferAttention`	Handles packed seqs, variable lengths, KV cache natively
Compiled fusion	`maybe_compile`, `maybe_compiled_cat`, `maybe_compiled_copy_`	Already in `tensorrt_llm/_torch/utils.py`
RoPE application	`RotaryEmbedding`, `apply_rotary_pos_emb`, `rope_fusion`	Multiple implementations exist; check which one the current code path uses
KV cache management	`mla_rope_append_paged_kv`, `append_paged_kv`, `latent_cache`	Fused RoPE + cache operations in C++ kernels
Sparse attention	`DSATrtllmAttention`, `indexer`, `topk_indices`	DSA-specific backend with sparse routing

Mistake	Consequence	Prevention
Reading only the method you're modifying	Miss that another method does what you need	Read ALL methods in the class
Searching only for the exact function name	Miss equivalent implementations	Search for the concept (e.g., "attention", "rope", "expand kv")
Assuming assertions are immutable	Work around them with hacks (separate attributes)	Question whether the assertion's intent still applies
Not reading the fused kernel's capabilities	Reimplement what it already does	Check what `latent_cache`, `rope_fusion` etc. control
Only reading Python code	Miss C++ implementations called via bindings	Check `tensorrt_llm/_torch/attention_backend/` for native kernels
Calling a method directly instead of through its dispatcher	Miss edge cases (cached KV, chunked prefill, SM-version gating)	Search for callers of the method to find the dispatch chain
Assuming hardware-uniform numerical behavior	Silent accuracy degradation on specific SM versions	Check for `get_sm_version()` guards near the call site; test on multiple hardware

Area	Key files to read
Attention modules	`tensorrt_llm/_torch/modules/attention.py`
Attention backends	`tensorrt_llm/_torch/attention_backend/` (trtllm_attention.py, sparse/)
Model definitions	`tensorrt_llm/_torch/models/modeling_*.py`
Utilities	`tensorrt_llm/_torch/utils.py`
RoPE	`tensorrt_llm/_torch/modules/rotary_embedding.py`
Test fixtures	`tests/unittest/_torch/attention/`
Weight loading	`tensorrt_llm/_torch/models/modeling_deepseekv3.py` (search `load_`)

TensorRT-LLM Codebase Exploration Guide

Why This Matters

MANDATORY: Ignore the TensorRT backend, focus on the PyTorch backend

Step-by-Step Exploration Workflow

Step 1: Map the Class You're Modifying

TensorRT-LLM Codebase Exploration Guide

Why This Matters

MANDATORY: Ignore the TensorRT backend, focus on the PyTorch backend

Step-by-Step Exploration Workflow

Step 1: Map the Class You're Modifying

Step 2: Trace Existing Forward Methods

Step 3: Search for Existing Backends and Utilities

Step 4: Check What the Fused Kernels Handle

Step 5: Check Assertions and Invariants

Step 6: Understand Weight Layouts

Step 7: Trace Method Limitations

Key Discovery Patterns

Pattern: "Can I Reuse an Existing Forward Method?"

Pattern: "Is This Already Handled by a Fused Kernel?"

Pattern: "Am I Calling the Right Abstraction Level?"

Pattern: "Does a Utility Already Exist?"

Common Exploration Mistakes

File Reference for Exploration

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns