Translates a HuggingFace model into a prefill-only AutoDeploy custom model using reference custom ops, validates with hierarchical equivalence tests.
Input: HuggingFace model ID. Output: prefill-only custom model file + hierarchical tests + summary report.
Web/GitHub fetches require user approval and the user may leave. Do ALL network access now and save locally before proceeding.
Before anything else, check whether the model can fit on the current system.
nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits to get the total VRAM (in MiB) across all GPUs on the system.Step 1 — Check local transformers install first:
python -c "import transformers; print(transformers.__file__)"
Look for models/{model_type}/modeling_*.py under that path. If found, use it directly — no network needed.
Step 2 — If not found, download the HF repo (code only, skip weights):
huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"
This downloads config, code, and tokenizer files into the standard HF cache ($HF_HOME or ~/.cache/huggingface/) while skipping large weight files. Files cached here are automatically found by transformers.AutoConfig.from_pretrained and similar APIs — no extra path wiring needed. Once downloaded you can work fully offline — read config.json and modeling_*.py from the cache snapshot directory printed by the command.
Before writing anything, check if an AD custom model already covers this architecture:
config.json to find its model_type and architectures fields.tensorrt_llm/_torch/auto_deploy/models/custom/ for existing modeling_*.py files that register the same config class name (grep for the architectures value or model_type).tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py for existing registrations.If existing code is found:
If no existing code is found: proceed to write a new model file in Phase 2.
Check examples/auto_deploy/model_registry/models.yaml for other models from the same family (e.g., if asked to onboard Qwen/Qwen3-8B, look for Qwen/Qwen3-0.6B, Qwen/Qwen3-32B, Qwen/Qwen3-235B-A22B, etc.). Also check HuggingFace for the full set of model sizes/variants in the family.
model_type / architectures in their config) — these can all use a single modeling file.Study the locally-available config.json and modeling_*.py (NOT from tensorrt_llm/_torch/models/). Identify attention type (MHA/GQA/MLA), MoE config, RoPE variant, normalization, activation, and any data-dependent ops that break torch.export (e.g. torch.nonzero, data-conditioned if).
Create tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.py. Use modeling_glm4_moe_lite.py as a structural template only (class layout, dataclass outputs, forward signature).
The goal is a minimal prefill-only model for torch.export with AD canonical IR ops. Keep the code as lean as possible — every line should serve the export path. Do not port HF features that AD doesn't need.
Strip: KV cache, training paths, dropout, flash attention variants, repeat_interleave/repeat_kv for GQA (AD attention ops handle this natively), fallback logic for generating position_ids (assert instead), optional code paths gated on config flags irrelevant to prefill export.
Keep: PreTrainedModel hierarchy, ModelOutput dataclass, minimal forward (input_ids, position_ids, inputs_embeds=None, **kwargs).
Critical: Make sure the custom modeling code nn.Module hierarchy matches what the checkpoint safetensor json expects.
Critical rule: Do NOT import or reuse existing AD custom model code (e.g. from .modeling_deepseek import ...). Every modeling_{name}.py must be self-contained. Use the HF source ($CLONE_DIR/modeling_*.py) as the source of truth for the model's logic and translate it fresh — even if a structurally similar AD model already exists. This prevents hidden coupling, makes each model auditable on its own, and ensures model-specific quirks are captured correctly.
Use torch.ops.auto_deploy.torch_* canonical ops WHENEVER POSSIBLE. These are the IR nodes that AD transforms later replace with optimized backends (triton, flashinfer, trtllm) at deployment time. If a canonical op exists for an operation, you MUST use it — do not reimplement the logic in plain PyTorch.
Available canonical ops (see tensorrt_llm/_torch/auto_deploy/custom_ops/README.md for full list):
torch_attention, torch_attention_sdpa, torch_attention_repeat_kvtorch_mlatorch_rope_with_explicit_cos_sin, torch_rope_with_complex_freqs, torch_rope_with_qk_interleavingtorch_moe, torch_moe_fused, torch_moe_router, torch_moe_dense_mlptorch_rmsnorm, torch_rmsnorm_gated, torch_l2normtorch_linear_simpletorch_ssm, torch_causal_conv1dtorch_gated_delta_ruletorch_quant_fp8_linear, torch_quant_nvfp4_linear, etc.Never use triton_*/flashinfer_*/trtllm_* — backend selection happens later in AD transforms. Plain PyTorch is acceptable ONLY for operations where no canonical op exists (e.g., simple activation functions, embedding lookups, basic tensor arithmetic). If you find yourself writing manual attention, MoE routing, RoPE, or normalization in plain PyTorch, stop and use the canonical op instead.
Do NOT use repeat_interleave or repeat_kv for GQA. HF reference code often repeats K/V heads to match the Q head count before attention. The AD canonical attention ops (torch_attention, torch_attention_sdpa) handle GQA natively — they accept Q, K, V with different head counts and do the right thing internally. Manually repeating K/V heads is unnecessary bloat and prevents AD from optimizing the attention path.
AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM).__all__ entry in models/custom/__init__.py.AutoConfig.from_pretrained(model_id) (either from the installed transformers or from files in the HF cache downloaded in Phase 0), import it from transformers and use it directly. Do NOT recreate or copy the config class into the modeling file when it is already available. Note: AD's factory already calls AutoConfig.from_pretrained(model_id, trust_remote_code=True) and passes the result to your model, so you rarely need to import the config at all — if you find yourself doing so, sanity-check that it's genuinely needed.transformers and not bundled with the checkpoint), define a minimal config class in the modeling file and AutoConfig.register(model_type, ConfigCls, exist_ok=True). A good sanity check: if the E2E test passes without a custom config class, you don't need one — AutoConfig.from_pretrained already picked up the right class.The custom model's forward signature must follow these rules:
input_ids — The top-level model always receives input_ids. A submodule graph may internally receive inputs_embeds (e.g., after the embedding layer), but the exported entry point takes token IDs.position_ids — Vanilla sequential position_ids are always provided. Assert position_ids is not None at the top of the forward method — it is a required input, never optional. Do not include fallback logic to generate position_ids from input_ids (HF models often do this; strip it). If the model uses a non-standard RoPE variant or custom position encoding, the model must compute it internally on top of the provided vanilla position_ids.input_ids.attention_mask, past_key_values, use_cache, or similar HF-runtime arguments. AD manages masking and caching via its own transforms and runtime.Create tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_{name}_modeling.py. Use test_glm4_moe_lite_modeling.py as template. No smoke tests. Small config (hidden=64, layers=2-3, vocab=1000). Use pytest.skip if HF class unavailable.
HF Reference Strategy: Equivalence tests compare our custom implementation against the HF reference with identical weights and inputs. Use actual HF classes if they exist — prefer importing directly over standalone HF-like implementations for unit tests. Standalone "reference" implementations are effectively alternative AD IR models and defeat the purpose of the reference test; they also tend to silently agree with whatever bugs exist in the custom model.
transformers: import them directly (e.g., from transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLM). Wrap imports in _get_hf_*_class() try/except helpers that return None on ImportError, and use pytest.skip when None.transformers: copy the minimal module definitions from the HF modeling_*.py source into the test file as standalone reference classes. This keeps tests self-contained without requiring a specific transformers version or HF cache at test time. Important: make sure the copy is minimal and strictly faithful to the HF implementation only. Do NOT tweak the functionality of the reference. The same applies to config classes that use trust_remote_code (i.e., not available in transformers): copy a minimal faithful version into the test file. The modeling file should NOT import the config class — AD loads it at runtime via AutoConfig.from_pretrained(..., trust_remote_code=True). The test-only config copy lets you verify config-wrapping behavior (e.g., structure of state_dict).load_state_dict pre-hooks already registered on the custom model.Numerical comparison: For equivalence tests comparing custom ops against HF reference, use the shared assert_rmse_close utility from _model_test_utils:
from _model_test_utils import assert_rmse_close
This computes rmse(actual - expected) / rmse(expected) — more robust than per-element torch.testing.assert_close since a few outlier elements won't fail the test. Use torch.testing.assert_close only for blocks with identical math (e.g., plain MLP with no custom ops).
Recommended rmse_ratio_tol values for bfloat16:
torch.testing.assert_close with tight rtol/atol (1e-3)0.020.050.10Bottom-up levels (each must pass before next):
assert_rmse_close (or torch.testing.assert_close for identical-math blocks).torch_export_to_gm with Dim.DYNAMIC for batch+seq, verify finite output, test a second shape.Invoke the ad-onboard-reviewer subagent with ONLY the following information:
Do NOT include your own assessment of correctness. Do NOT summarize what you did. Let the reviewer read the files and judge independently.
If the reviewer returns FAIL on any item:
Do NOT proceed to Phase 8 until the reviewer returns PASS.
Before running the model end-to-end, ensure it and all identified family members from Phase 1 have valid entries in the AutoDeploy model registry at examples/auto_deploy/model_registry/.
For each model (the requested model + any family members identified in Phase 1 Step 2):
examples/auto_deploy/model_registry/models.yaml for an existing entry matching the model's HF id.yaml_extra list:
dashboard_default.yaml first.world_size_N.yaml based on model size (1 for <2B, 2 for 2-15B, 4 for 20-80B, 8 for 80B+). The world_size determines how many GPUs are needed for the run.model_kwargs, non-default transforms).examples/auto_deploy/model_registry/configs/. See existing configs for format examples.Family members that share the same architecture should all use the same modeling code. Different sizes only need different world_size_N.yaml entries and maybe different sharding configurations.
See examples/auto_deploy/model_registry/README.md for full documentation on the registry format and best practices.
build_and_run_ad.py --use-registry EXACTLY AS-IS ⚠️You MUST run the model using the model registry YAML configs. No exceptions. No workarounds. No manual --args.yaml-extra overrides. The command is:
CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registry
The --use-registry flag resolves ALL configuration from the model's entry in examples/auto_deploy/model_registry/models.yaml and its referenced YAML files under examples/auto_deploy/model_registry/configs/. This is the production path. You MUST validate the model works through it.
If the run FAILS with --use-registry:
--args.yaml-extra flags.models.yaml, modify or create config YAMLs under configs/, and re-run with --use-registry again.--use-registry before you are done.Invoke the ad-run-agent subagent to run the model through AutoDeploy on GPU. Pass it:
Step 1: Reduced num layers Run with reduced num layers to test the e2e flow for issues and iterate faster. The generation will be bad in step 1 because we are not loading all layers.
Step 2: Full layers Run with full num layers. The generation should be coherent in step 2.
The model is run via:
CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registry
The ad-run-agent will determine the required world_size from the registry, check GPU availability via nvidia-smi, select free GPUs, and wait if not enough are available.
The ad-run-agent will build+run the model, check generation quality, archive logs, and update its worklog.
If the run fails or produces bad generation:
--use-registry. Never bypass the registry.Do NOT proceed to Phase 10 until the step 2 with full layers reports a successful run with coherent generation.
build_and_run_ad.py run ⚠️Print (not file) after completion:
models.yaml and any new config YAMLs createdbuild_and_run_ad.py --use-registry run. Copy-paste the COMPLETE prompt→output pairs verbatim from the run log. Do NOT summarize, truncate, or paraphrase them. The user needs to see exactly what the model generated to judge quality.GitHub CLI config: Before running any gh command, confirm which GH_CONFIG_DIR to use. The default is ~/.config/gh, but a different directory may be needed when targeting a fork (e.g., nv-auto-deploy/TensorRT-LLM vs NVIDIA/TensorRT-LLM). Check if the user has specified a custom GH_CONFIG_DIR (e.g., in CLAUDE.local.md or environment). If not, ask the user before proceeding. Prefix all gh commands with: GH_CONFIG_DIR=<path> gh ...
Prepare a pull request against upstream (https://github.com/NVIDIA/TensorRT-LLM) targeting
branch main. Then, ask the user to provide feedback on the PR and wait for the
user to get back to you when the feedback has been posted. Then continue iterating according to the
user's feedback. For any comment or other post, please prepend your message with "[AGENT]" so that it is clear that this was a coding agent posting the comment.
When you post a PR, you MUST include:
build_and_run_ad.py --use-registry run. Copy-paste the COMPLETE prompt→output pairs verbatim — do NOT summarize, truncate, or paraphrase. The reviewer needs to see exactly what the model generated.python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registry
Every single time you push changes to the PR — whether it is a new commit, a rebase, an amendment, a fixup, or any other update — you MUST:
build_and_run_ad.py --use-registry using the ad-run-agent subagent, exactly as in Phase 9. The code has changed, so previous run results are stale and invalid.pytest <test_file> -v) for the model's test file created in Phase 6. Previous test results are stale and invalid after any code change.build_and_run_ad.py verbatim — do NOT summarize, truncate, or paraphrase.This is not optional. There are no exceptions. Even if the change seems trivial (a typo fix, a comment edit, a formatting change), both runs must be re-executed and the full raw logs must be posted. The reviewer cannot verify correctness without seeing generation output AND test results from the exact code that is currently on the branch.
Workflow for every PR update cycle:
git fetch upstream && git rebase upstream/main. If there are conflicts, resolve them before proceeding. Do NOT push without rebasing first — the branch must be up-to-date with the target branch.ad-run-agent to run build_and_run_ad.py --model <MODEL-ID> --use-registry on the updated codepytest <test_file> -vbuild_and_run_ad.py runAfter opening the PR and after every PR update you post, you MUST set up a polling loop that checks for new PR comments every 5 minutes. Do not simply post and walk away — actively monitor the PR for reviewer feedback.
How to poll:
# Fetch all PR comments, sorted newest-first, and check for any posted after your last comment
GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/pulls/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"
# Also check issue-level comments (top-level PR comments, not inline review comments)
GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/issues/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"
# Also check the PR's review status
GH_CONFIG_DIR=<path> gh pr view <PR_NUMBER> --json reviews,state
Polling loop behavior:
Do NOT stop polling prematurely. The loop must continue until the PR is approved or a clear termination signal is received. If polling has been running for an extended period (e.g., >2 hours) with no new activity, inform the user that you are still monitoring and ask if they want you to continue or stop.
torch.ops.auto_deploy.torch_* canonical ops whenever one exists for the operation. This is how AD knows what to optimize. Writing manual attention, MoE, RoPE, or normalization in plain PyTorch instead of using the canonical op will prevent AD transforms from working.repeat_interleave: AD attention ops handle GQA natively. Never repeat K/V heads manually.transformers or load from checkpoint whenever possible. Only bundle a config class if it truly doesn't exist anywhere.position_ids: Always assert position_ids is not None — it is a required input, never optional.modeling_{name}.py is a standalone translation from HF source._ad_ prefix for RoPE buffers. RotaryEmbedding.forward(x, position_ids) MUST slice by position_ids once and return pre-sliced (cos, sin). Pass those tensors to all layers. NEVER pass position_ids through to each layer/attention forward to re-index — that is redundant compute that bloats the exported graph. See Phase 2 for the full pattern.nn.ModuleList per-expert for checkpoint compatibility. Write test-only state_dict converters for HF stacked format.noaux_tc routers (DeepSeek-V3 style): use vanilla PyTorch (sigmoid + bias + group topk + normalize + scale). AD transforms can replace with fused trtllm kernels at deployment time.torch_* prefixed reference ops in AutoDeploy — never triton_*, flashinfer_*, or trtllm_*.