Generate model-derived operator shapes from real model architectures. Use when: populating model shapes, extracting operator shapes from models, generating model_shape/<op>_shape.py, LLM operator shapes, vLLM model shapes, batch seq_len shape generation.
Given an operator name, a model name, and lists of batch sizes and sequence lengths, derive the concrete tensor shapes that the operator sees during model inference. Write those shapes into model_shape/<op>_shape.py.
Default generation values in this skill:
batch_sizes = [32, 64, 128]seq_lens = [1024, 2048, 4096]
Always use scripts/generate_model_shapes.py to generate shapes. This script loads model configs from HuggingFace transformers / vLLM, maps operator → shape formulas, deduplicates, and writes the shape file automatically. No 40 GB size filter is applied — all shapes are written; shapes that exceed GPU memory are handled by OOM guards at benchmark runtime.
source /root/wch/Hopper_benchmark/.venv/bin/activate
cd /root/wch/Hopper_benchmark
# Single operator, single model
python scripts/generate_model_shapes.py \
--op rms_norm --models llama3-8b \
--batch-sizes 32 64 128 --seq-lens 1024 2048 4096
# Multiple operators, multiple models
python scripts/generate_model_shapes.py \
--op softmax add rms_norm \
--models llama3-8b deepseek-v3 qwen3-8b \
--batch-sizes 32 64 128 --seq-lens 1024 2048 4096
# All operators for a model set
python scripts/generate_model_shapes.py \
--op all --models llama3-8b qwen2-72b \
--batch-sizes 32 64 128 --seq-lens 1024 2048 4096
# Dry-run (print shapes without writing)
python scripts/generate_model_shapes.py \
--op embedding --models deepseek-v3 \
--batch-sizes 32 64 128 --seq-lens 1024 2048 4096 \
--dry-run
# Append to existing shapes (don't overwrite)
python scripts/generate_model_shapes.py \
--op rms_norm --models qwen3-8b \
--batch-sizes 32 64 --seq-lens 1024 4096 \
--append
# List available models / operators
python scripts/generate_model_shapes.py --list-models
python scripts/generate_model_shapes.py --list-ops
The script handles all steps automatically:
transformers config classesT=B*S, batched (B,NH,S,S), special encoding, etc.)batch × seq_len combinationsmodel_shape/<op>_shape.py with docstring and per-model groupingload_model_shapes()To add a new model, add a loader function and register it in _MODEL_LOADERS in scripts/generate_model_shapes.py:
def _load_new_model() -> ModelConfig:
from transformers import SomeConfig
c = SomeConfig(hidden_size=..., ...)
return ModelConfig(
name="NewModel-XB", hf_id="org/NewModel-XB",
hidden_size=c.hidden_size, num_attention_heads=c.num_attention_heads,
num_kv_heads=c.num_key_value_heads,
head_dim=c.hidden_size // c.num_attention_heads,
intermediate_size=c.intermediate_size, vocab_size=c.vocab_size,
)
_MODEL_LOADERS["newmodel-xb"] = _load_new_model
model_shape/<op>_shape.py with real model-derived shapes (currently all empty)| Parameter | Description | Example |
|---|---|---|
op_name | Operator name (matches benchmarks/<op>/) | softmax, rms_norm, embedding |
model_name | Model identifier (see supported models below) | llama3-8b, deepseek-v3, qwen2-72b |
batch_sizes | List of batch sizes to generate shapes for | [32, 64, 128] |
seq_lens | List of sequence lengths to generate shapes for | [1024, 2048, 4096] |
Retrieve the model's architecture config. Primary source: vLLM model configs (installed at /usr/local/lib/python3.10/dist-packages/vllm) or HuggingFace transformers.
vLLM 0.17.1 is installed in the project venv (.venv). Use it to load model configs directly — no internet or weights download required. Always activate the venv first:
source /root/wch/Hopper_benchmark/.venv/bin/activate
# or use the venv python directly:
/root/wch/Hopper_benchmark/.venv/bin/python
For non-standard models (DeepSeek-V3, ChatGLM, Mistral-vLLM, Falcon, etc.) that have custom config classes in vLLM:
# List available custom vLLM config files
ls /root/wch/Hopper_benchmark/.venv/lib/python3.10/site-packages/vllm/transformers_utils/configs/
# Includes: deepseek_v3.py, chatglm.py, mistral.py, falcon.py, nemotron.py, ...
# DeepSeek-V3 example (vLLM custom config)
from vllm.transformers_utils.configs.deepseek_v3 import DeepseekV3Config
c = DeepseekV3Config()
print(c.hidden_size, c.num_attention_heads, c.n_routed_experts,
c.num_experts_per_tok, c.n_group, c.topk_group)
# → 7168, 128, 256, 8, 8, 4
For standard models (LLaMA, Qwen2, GPT-2, Mistral-HF) that use HuggingFace transformers config classes:
# LLaMA example (HuggingFace config — no weights needed)
from transformers import LlamaConfig
c = LlamaConfig() # defaults are LLaMA-2-7B
# Override with specific model parameters:
c = LlamaConfig(hidden_size=4096, num_attention_heads=32,
num_key_value_heads=8, intermediate_size=14336, vocab_size=128256)
# Qwen2 example
from transformers import Qwen2Config
c = Qwen2Config(hidden_size=3584, num_attention_heads=28,
num_key_value_heads=4, intermediate_size=18944, vocab_size=152064)
# Key field names (standard HuggingFace config):
# c.hidden_size, c.num_attention_heads, c.num_key_value_heads,
# c.intermediate_size, c.vocab_size
# c.head_dim (= hidden_size // num_attention_heads if not explicit)
Auto-detection script — if you have the model ID but are unsure which class to use:
from transformers import AutoConfig
# Only works if model is already cached locally or HF Hub is accessible:
config = AutoConfig.from_pretrained("<hf_model_id>", trust_remote_code=True)
print(type(config).__name__, vars(config))
Use the table below for well-known models. If the model is not listed, look it up from HuggingFace/vLLM first.
| Model | HF ID | hidden_size | num_heads | num_kv_heads | head_dim | intermediate_size | vocab_size | Notes |
|---|---|---|---|---|---|---|---|---|
| LLaMA-2-7B | meta-llama/Llama-2-7b-hf | 4096 | 32 | 32 | 128 | 11008 | 32000 | MHA, SwiGLU |
| LLaMA-2-13B | meta-llama/Llama-2-13b-hf | 5120 | 40 | 40 | 128 | 13824 | 32000 | MHA, SwiGLU |
| LLaMA-2-70B | meta-llama/Llama-2-70b-hf | 8192 | 64 | 8 | 128 | 28672 | 32000 | GQA, SwiGLU |
| LLaMA-3-8B | meta-llama/Llama-3.1-8B | 4096 | 32 | 8 | 128 | 14336 | 128256 | GQA, SwiGLU |
| LLaMA-3-70B | meta-llama/Llama-3.1-70B | 8192 | 64 | 8 | 128 | 28672 | 128256 | GQA, SwiGLU |
| LLaMA-3-405B | meta-llama/Llama-3.1-405B | 16384 | 128 | 8 | 128 | 53248 | 128256 | GQA, SwiGLU |
| Qwen2-7B | Qwen/Qwen2-7B | 3584 | 28 | 4 | 128 | 18944 | 152064 | GQA, SwiGLU |
| Qwen2-72B | Qwen/Qwen2-72B | 8192 | 64 | 8 | 128 | 29568 | 152064 | GQA, SwiGLU |
| Mistral-7B | mistralai/Mistral-7B-v0.1 | 4096 | 32 | 8 | 128 | 14336 | 32000 | GQA, SwiGLU |
| DeepSeek-V2 | deepseek-ai/DeepSeek-V2 | 5120 | 128 | 128 | 128 | 12288 | 102400 | MoE: 160 experts, top-6, 8 groups, topk_group=2, MLA |
| DeepSeek-V3 | deepseek-ai/DeepSeek-V3 | 7168 | 128 | 128 | 128 | 18432 | 129280 | MoE: 256 experts, top-8, 8 groups, topk_group=4, MLA |
| GPT-2 | openai-community/gpt2 | 768 | 12 | 12 | 64 | 3072 | 50257 | MHA, GELU |
| GPT-2-XL | openai-community/gpt2-xl | 1600 | 25 | 25 | 64 | 6400 | 50257 | MHA, GELU |
Record all extracted parameters before proceeding to Step 2.
Each operator appears at specific locations in the transformer forward pass. Use this mapping to determine the shape formula for the given operator.
Convention:
B= batch size,S= sequence length,H= hidden_size,NH= num_heads,NKV= num_kv_heads,HD= head_dim,V= vocab_size,I= intermediate_size,T= num_tokens (used only for ops with inherently flattened token semantics: embedding, argmax, sort, topk, grouped_topk).
| Operator | Location in Model | Shape Formula | Shape Tuple Convention | Notes |
|---|---|---|---|---|
| embedding | Input embedding | (T, V, H) | (num_tokens, vocab_size, embed_dim) | 3-tuple; T = B*S (flattened) — no batch dim |
| rms_norm | Pre-attention, pre-MLP | (B, S, H) | 3D: batch × seq × hidden | Normalizes over last dim |
| rms_norm_without_weight | Variant without learnt weight | (B, S, H) | 3D: batch × seq × hidden | Same location as rms_norm |
| fused_add_rms_norm | Residual + norm (post-attn, post-MLP) | (B, S, H) | 3D: batch × seq × hidden | Fused residual add |
| rotary_embedding | Q/K in attention | (B, S, NH, HD) | 4D standard | Last dim must be even |
| softmax | Attention scores | (B, NH, S, S) | 4D standard | Softmax over last dim; for GQA use NKV or NH depending on impl |
| add | Residual connections | (B, S, H) | 3D: batch × seq × hidden | |
| mul | Various scaling | (B, S, H) | 3D: batch × seq × hidden | |
| abs | Activation or loss | (B, S, H) | 3D: batch × seq × hidden | |
| max / reduce_max | Reduction ops | (B, S, H) | 3D: batch × seq × hidden | |
| sum / reduce_sum | Reduction ops | (B, S, H) | 3D: batch × seq × hidden | |
| argmax | Decoding / sampling | (T, V) | 2D: num_logits_tokens × vocab | T ≈ batch_size (last-token logits per request) |
| swi_glu | MLP activation (LLaMA/Mistral/Qwen) | (B, S, 2*I) | 3D — last dim must be even | Input is gate+value concat; output is half |
| hardswish | MLP activation (alternative) | (B, S, I) | 3D: batch × seq × intermediate | |
| sort | Sampling / top-p | (T, V) | 2D: num_tokens × vocab | T = B*S flattened |
| topk | Sampling / MoE routing | (T, V, K) | 3-tuple: (num_tokens, V, K) | K = number of top values |
| grouped_topk | MoE expert routing | (T, E, K, G, TG) | 5-tuple: (num_tokens, num_experts, k, num_expert_group, topk_group) | MoE-specific |
| ge | Mask generation | (B, S, H) | 3D: batch × seq × hidden | |
| where | Conditional selection | (B, S, H) | 3D: batch × seq × hidden | |
| masked_fill | Attention masking | (B, NH, S, S) | 4D standard | Causal mask in attention |
| fill | Initialization | (B, S, H) | 3D: batch × seq × hidden | |
| arange | Position IDs | (S,) | 1D | |
| expand | Broadcasting | (B, S, H) | 3D: batch × seq × hidden | |
| repeat | Tiling | (B, S, H) | 3D: batch × seq × hidden | |
| index_select | Gather | (n_indices, src_rows, n_cols) | 3-tuple: gather convention | See config.py for encoding |
| l2_norm | Normalization | (B, S, H) | 3D: batch × seq × hidden | Normalize over last dim |
| concat | KV-cache or multi-head concat | (N, B, S, H) | 4-tuple: num_tensors × (B, S, H) | First dim is tensor count |
| stack | Multi-head stacking | (N, B, S, H) | 4-tuple: num_tensors × (B, S, H) | First dim is tensor count |
Critical: Not all operators use simple tensor shapes. Read benchmarks/<op>/config.py to confirm the shape tuple convention:
# Check how the operator's config.py interprets the shape tuple
cat benchmarks/<op>/config.py
Special-encoding operators:
(num_tokens, vocab_size, embedding_dim) — 3-tuple, vocab_size is metadata; T = B*S (flattened, no batch dim)(num_tokens, V, K) — 3-tuple, K is the k parameter; num_tokens = B*S(num_tokens, num_experts, k, num_expert_group, topk_group) — 5-tuple with MoE routing params(num_tensors, B, S, H) — 4-tuple, first dim is tensor count, remaining are per-tensor 3D shapeFor each (batch, seq_len) pair, compute the operator's input tensor shape using the formula from Step 2.
Guidelines:
batch_sizes × seq_lens(B, S, dim) for all standard operators — preserve batch and sequence separatelyT = B * S only for operators with inherently token-indexed semantics (embedding, argmax, sort, topk, grouped_topk)product(shape) * element_size(fp32) > 40 GB. For attention-pattern operators like softmax with (B, NH, S, S), the S*S dimension grows quadratically — watch for large S values.num_tokens = B * S, and the expert/group params come from the model config.Example — rms_norm for LLaMA-3-8B:
Model: hidden_size=4096
batch_sizes=[32, 64, 128], seq_lens=[1024, 2048, 4096]
Shapes (B, S, H) — batch and sequence kept separate:
(32, 1024, 4096) # B=32, S=1024
(32, 2048, 4096) # B=32, S=2048
(32, 4096, 4096) # B=32, S=4096
(64, 1024, 4096) # B=64, S=1024
...deduplicate if shapes repeat across (B,S) combinations
Example — softmax for LLaMA-3-8B (attention scores):
Model: num_heads=32
batch_sizes=[32, 64, 128], seq_lens=[1024, 2048, 4096]
Shapes (B, NH, S, S):
(32, 32, 1024, 1024) # ~4 GB FP32 — OK
(32, 32, 2048, 2048) # ~16 GB FP32 — OK
(64, 32, 1024, 1024) # ~8 GB FP32 — OK
(64, 32, 2048, 2048) # ~32 GB FP32 — OK
(128, 32, 1024, 1024) # ~16 GB FP32 — OK
# (32, 32, 4096, 4096) # ~64 GB FP32 — skip (>40 GB)
Example — grouped_topk for DeepSeek-V3 (MoE routing):
Model: num_experts=256, k=8, num_expert_group=8, topk_group=4
batch_sizes=[32, 64, 128], seq_lens=[1024, 2048, 4096]
Shapes (T, num_experts, k, num_expert_group, topk_group):
(32768, 256, 8, 8, 4) # B=32, S=1024
(65536, 256, 8, 8, 4) # B=32, S=2048 (or B=64, S=1024)
(131072, 256, 8, 8, 4) # B=32, S=4096 (or B=64, S=2048, or B=128, S=1024)
(262144, 256, 8, 8, 4) # B=64, S=4096 (or B=128, S=2048)
(524288, 256, 8, 8, 4) # B=128, S=4096
...
Assemble the deduplicated shapes into a Python list. Add a comment header indicating the model and parameters used.
"""Model shapes for <op> operator.
Shapes derived from <model_name> (<hf_model_id>):
hidden_size={H}, num_heads={NH}, num_kv_heads={NKV},
head_dim={HD}, intermediate_size={I}, vocab_size={V}
Batch sizes: {batch_sizes}
Sequence lengths: {seq_lens}
"""
SHAPES: list[tuple[int, ...]] = [
# B=1, S=128
(128, 4096),
# B=1, S=1024
(1024, 4096),
# ... etc
]
Validation checklist before writing:
benchmarks/<op>/config.py)product(shape) * 4 (FP32 element size)model_shape/<op>_shape.pyReplace the contents of the existing shape file:
# The file already exists (created when the operator was added)
cat model_shape/<op>_shape.py
Write the generated SHAPES list. Preserve the module-level docstring format.
Run a quick sanity check:
python -c "
from model_shape import load_model_shapes
import torch
cases = load_model_shapes('<op>', torch.float32)
print(f'Loaded {len(cases)} model shapes for <op>:')
for c in cases:
print(f' {c.label}: shape={c.shape}, dtype={c.dtype}, category={c.workload_category}')
"
Confirm:
workload_category is "Model Shape" for all casesmodel_<dim1>x<dim2>x...nvidia-smi # find a free GPU
CUDA_VISIBLE_DEVICES=<free_gpu> python scripts/run_bench.py --op <op> --dtype fp16 fp32
The model shapes will appear in the Model Shape section of the report.
When generating shapes for multiple models, combine shapes from all models into a single SHAPES list:
"""Model shapes for <op> operator.
Combined shapes from multiple models:
- LLaMA-3-8B: hidden_size=4096, num_heads=32, ...
- LLaMA-3-70B: hidden_size=8192, num_heads=64, ...
- Qwen2-72B: hidden_size=8192, num_heads=64, ...
Batch sizes: [32, 64, 128]
Sequence lengths: [1024, 2048, 4096]
"""
SHAPES: list[tuple[int, ...]] = [
# --- LLaMA-3-8B ---
(128, 4096),
(1024, 4096),
...
# --- LLaMA-3-70B ---
(128, 8192),
(1024, 8192),
...
]
Deduplicate across models — if two models produce the same shape for an operator, keep only one entry. Add a comment noting which models share that shape.
Not all operators appear in all models. Use this quick reference:
| Operator | LLaMA-2/3 | Qwen2 | Mistral | DeepSeek-V2/V3 | GPT-2 |
|---|---|---|---|---|---|
| embedding | ✓ | ✓ | ✓ | ✓ | ✓ |
| rms_norm | ✓ | ✓ | ✓ | ✓ | — (uses LayerNorm) |
| fused_add_rms_norm | ✓ | ✓ | ✓ | ✓ | — |
| rotary_embedding | ✓ | ✓ | ✓ | ✓ | — (uses learned pos) |
| softmax | ✓ | ✓ | ✓ | ✓ | ✓ |
| swi_glu | ✓ | ✓ | ✓ | ✓ | — (uses GELU) |
| grouped_topk | — | — | — | ✓ (MoE) | — |
| topk | ✓ | ✓ | ✓ | ✓ | ✓ |
| add | ✓ | ✓ | ✓ | ✓ | ✓ |
| mul | ✓ | ✓ | ✓ | ✓ | ✓ |
| sort | ✓ | ✓ | ✓ | ✓ | ✓ |
| argmax | ✓ | ✓ | ✓ | ✓ | ✓ |
| masked_fill | ✓ | ✓ | ✓ | ✓ | ✓ |
| where | ✓ | ✓ | ✓ | ✓ | ✓ |
If the operator is not applicable to the requested model, inform the user and suggest relevant models instead.
vLLM uses flattened token representations during inference for most tensor ops. The benchmark convention is:
(B, S, H) — preserve batch and sequence separately, even though vLLM flattens at runtime.T = B * S as first dim, matching vLLM semantics. These ops index into vocab or expert tables and don't have a meaningful (B, S) decomposition in context.PagedAttention in vLLM changes the attention shape:
(T, NH, HD), K/V paged in blocks(B, NH, S, S) attention score shape — this represents the per-head attention matrix regardless of paging.load_model_shapes() loaderBenchmarkCase and classify_workload/root/wch/Hopper_benchmark/.venv/)/root/wch/Hopper_benchmark/.venv/lib/python3.10/site-packages/vllm/transformers_utils/configs/ — Custom model config classes for non-standard models
deepseek_v3.py → DeepseekV3Config (MoE fields: n_routed_experts, num_experts_per_tok, n_group, topk_group)chatglm.py → ChatGLMConfigmistral.py → vLLM Mistral variantfalcon.py → RWConfig (original Falcon-7B/40B)nemotron.py → NemotronConfigtransformers config classes directly — no vLLM-specific file needed/root/wch/Hopper_benchmark/.venv/bin/python