Name: Model Shape Gen
Author: wch18

搵技能.../

Model Shape Gen | Skills Pool

source /root/wch/Hopper_benchmark/.venv/bin/activate
cd /root/wch/Hopper_benchmark

# Single operator, single model
python scripts/generate_model_shapes.py \
    --op rms_norm --models llama3-8b \
  --batch-sizes 32 64 128 --seq-lens 1024 2048 4096

# Multiple operators, multiple models
python scripts/generate_model_shapes.py \
    --op softmax add rms_norm \
    --models llama3-8b deepseek-v3 qwen3-8b \
    --batch-sizes 32 64 128 --seq-lens 1024 2048 4096

# All operators for a model set
python scripts/generate_model_shapes.py \
    --op all --models llama3-8b qwen2-72b \
  --batch-sizes 32 64 128 --seq-lens 1024 2048 4096

# Dry-run (print shapes without writing)
python scripts/generate_model_shapes.py \
    --op embedding --models deepseek-v3 \
    --batch-sizes 32 64 128 --seq-lens 1024 2048 4096 \
    --dry-run

# Append to existing shapes (don't overwrite)
python scripts/generate_model_shapes.py \
    --op rms_norm --models qwen3-8b \
    --batch-sizes 32 64 --seq-lens 1024 4096 \
    --append

# List available models / operators
python scripts/generate_model_shapes.py --list-models
python scripts/generate_model_shapes.py --list-ops

def _load_new_model() -> ModelConfig:
    from transformers import SomeConfig
    c = SomeConfig(hidden_size=..., ...)
    return ModelConfig(
        name="NewModel-XB", hf_id="org/NewModel-XB",
        hidden_size=c.hidden_size, num_attention_heads=c.num_attention_heads,
        num_kv_heads=c.num_key_value_heads,
        head_dim=c.hidden_size // c.num_attention_heads,
        intermediate_size=c.intermediate_size, vocab_size=c.vocab_size,
    )

_MODEL_LOADERS["newmodel-xb"] = _load_new_model

Parameter	Description	Example
`op_name`	Operator name (matches `benchmarks/<op>/`)	`softmax`, `rms_norm`, `embedding`
`model_name`	Model identifier (see supported models below)	`llama3-8b`, `deepseek-v3`, `qwen2-72b`
`batch_sizes`	List of batch sizes to generate shapes for	`[32, 64, 128]`
`seq_lens`	List of sequence lengths to generate shapes for	`[1024, 2048, 4096]`

source /root/wch/Hopper_benchmark/.venv/bin/activate
# or use the venv python directly:
/root/wch/Hopper_benchmark/.venv/bin/python

# List available custom vLLM config files
ls /root/wch/Hopper_benchmark/.venv/lib/python3.10/site-packages/vllm/transformers_utils/configs/
# Includes: deepseek_v3.py, chatglm.py, mistral.py, falcon.py, nemotron.py, ...

# DeepSeek-V3 example (vLLM custom config)
from vllm.transformers_utils.configs.deepseek_v3 import DeepseekV3Config
c = DeepseekV3Config()
print(c.hidden_size, c.num_attention_heads, c.n_routed_experts,
      c.num_experts_per_tok, c.n_group, c.topk_group)
# → 7168, 128, 256, 8, 8, 4

# LLaMA example (HuggingFace config — no weights needed)
from transformers import LlamaConfig
c = LlamaConfig()  # defaults are LLaMA-2-7B
# Override with specific model parameters:
c = LlamaConfig(hidden_size=4096, num_attention_heads=32,
                num_key_value_heads=8, intermediate_size=14336, vocab_size=128256)

# Qwen2 example
from transformers import Qwen2Config
c = Qwen2Config(hidden_size=3584, num_attention_heads=28,
                num_key_value_heads=4, intermediate_size=18944, vocab_size=152064)

# Key field names (standard HuggingFace config):
# c.hidden_size, c.num_attention_heads, c.num_key_value_heads,
# c.intermediate_size, c.vocab_size
# c.head_dim  (= hidden_size // num_attention_heads if not explicit)

from transformers import AutoConfig
# Only works if model is already cached locally or HF Hub is accessible:
config = AutoConfig.from_pretrained("<hf_model_id>", trust_remote_code=True)
print(type(config).__name__, vars(config))

Model	HF ID	hidden_size	num_heads	num_kv_heads	head_dim	intermediate_size	vocab_size	Notes
LLaMA-2-7B	`meta-llama/Llama-2-7b-hf`	4096	32	32	128	11008	32000	MHA, SwiGLU
LLaMA-2-13B	`meta-llama/Llama-2-13b-hf`	5120	40	40	128	13824	32000	MHA, SwiGLU
LLaMA-2-70B	`meta-llama/Llama-2-70b-hf`	8192	64	8	128	28672	32000	GQA, SwiGLU
LLaMA-3-8B	`meta-llama/Llama-3.1-8B`	4096	32	8	128	14336	128256	GQA, SwiGLU
LLaMA-3-70B	`meta-llama/Llama-3.1-70B`	8192	64	8	128	28672	128256	GQA, SwiGLU
LLaMA-3-405B	`meta-llama/Llama-3.1-405B`	16384	128	8	128	53248	128256	GQA, SwiGLU
Qwen2-7B	`Qwen/Qwen2-7B`	3584	28	4	128	18944	152064	GQA, SwiGLU
Qwen2-72B	`Qwen/Qwen2-72B`	8192	64	8	128	29568	152064	GQA, SwiGLU
Mistral-7B	`mistralai/Mistral-7B-v0.1`	4096	32	8	128	14336	32000	GQA, SwiGLU
DeepSeek-V2	`deepseek-ai/DeepSeek-V2`	5120	128	128	128	12288	102400	MoE: 160 experts, top-6, 8 groups, topk_group=2, MLA
DeepSeek-V3	`deepseek-ai/DeepSeek-V3`	7168	128	128	128	18432	129280	MoE: 256 experts, top-8, 8 groups, topk_group=4, MLA
GPT-2	`openai-community/gpt2`	768	12	12	64	3072	50257	MHA, GELU
GPT-2-XL	`openai-community/gpt2-xl`	1600	25	25	64	6400	50257	MHA, GELU

Operator	Location in Model	Shape Formula	Shape Tuple Convention	Notes
embedding	Input embedding	`(T, V, H)`	`(num_tokens, vocab_size, embed_dim)`	3-tuple; T = B*S (flattened) — no batch dim
rms_norm	Pre-attention, pre-MLP	`(B, S, H)`	3D: batch × seq × hidden	Normalizes over last dim
rms_norm_without_weight	Variant without learnt weight	`(B, S, H)`	3D: batch × seq × hidden	Same location as rms_norm
fused_add_rms_norm	Residual + norm (post-attn, post-MLP)	`(B, S, H)`	3D: batch × seq × hidden	Fused residual add
rotary_embedding	Q/K in attention	`(B, S, NH, HD)`	4D standard	Last dim must be even
softmax	Attention scores	`(B, NH, S, S)`	4D standard	Softmax over last dim; for GQA use NKV or NH depending on impl
add	Residual connections	`(B, S, H)`	3D: batch × seq × hidden
mul	Various scaling	`(B, S, H)`	3D: batch × seq × hidden
abs	Activation or loss	`(B, S, H)`	3D: batch × seq × hidden
max / reduce_max	Reduction ops	`(B, S, H)`	3D: batch × seq × hidden
sum / reduce_sum	Reduction ops	`(B, S, H)`	3D: batch × seq × hidden
argmax	Decoding / sampling	`(T, V)`	2D: num_logits_tokens × vocab	T ≈ batch_size (last-token logits per request)
swi_glu	MLP activation (LLaMA/Mistral/Qwen)	`(B, S, 2*I)`	3D — last dim must be even	Input is gate+value concat; output is half
hardswish	MLP activation (alternative)	`(B, S, I)`	3D: batch × seq × intermediate
sort	Sampling / top-p	`(T, V)`	2D: num_tokens × vocab	T = B*S flattened
topk	Sampling / MoE routing	`(T, V, K)`	3-tuple: `(num_tokens, V, K)`	K = number of top values
grouped_topk	MoE expert routing	`(T, E, K, G, TG)`	5-tuple: `(num_tokens, num_experts, k, num_expert_group, topk_group)`	MoE-specific
ge	Mask generation	`(B, S, H)`	3D: batch × seq × hidden
where	Conditional selection	`(B, S, H)`	3D: batch × seq × hidden
masked_fill	Attention masking	`(B, NH, S, S)`	4D standard	Causal mask in attention
fill	Initialization	`(B, S, H)`	3D: batch × seq × hidden
arange	Position IDs	`(S,)`	1D
expand	Broadcasting	`(B, S, H)`	3D: batch × seq × hidden
repeat	Tiling	`(B, S, H)`	3D: batch × seq × hidden
index_select	Gather	`(n_indices, src_rows, n_cols)`	3-tuple: gather convention	See config.py for encoding
l2_norm	Normalization	`(B, S, H)`	3D: batch × seq × hidden	Normalize over last dim
concat	KV-cache or multi-head concat	`(N, B, S, H)`	4-tuple: num_tensors × (B, S, H)	First dim is tensor count
stack	Multi-head stacking	`(N, B, S, H)`	4-tuple: num_tensors × (B, S, H)	First dim is tensor count

# Check how the operator's config.py interprets the shape tuple
cat benchmarks/<op>/config.py

Model: hidden_size=4096
batch_sizes=[32, 64, 128], seq_lens=[1024, 2048, 4096]

Shapes (B, S, H) — batch and sequence kept separate:
  (32, 1024, 4096)     # B=32, S=1024
  (32, 2048, 4096)     # B=32, S=2048
  (32, 4096, 4096)     # B=32, S=4096
  (64, 1024, 4096)     # B=64, S=1024
  ...deduplicate if shapes repeat across (B,S) combinations

Model: num_heads=32
batch_sizes=[32, 64, 128], seq_lens=[1024, 2048, 4096]

Shapes (B, NH, S, S):
  (32, 32, 1024, 1024)    # ~4 GB FP32 — OK
  (32, 32, 2048, 2048)    # ~16 GB FP32 — OK
  (64, 32, 1024, 1024)    # ~8 GB FP32 — OK
  (64, 32, 2048, 2048)    # ~32 GB FP32 — OK
  (128, 32, 1024, 1024)   # ~16 GB FP32 — OK
  # (32, 32, 4096, 4096)  # ~64 GB FP32 — skip (>40 GB)

Model: num_experts=256, k=8, num_expert_group=8, topk_group=4
batch_sizes=[32, 64, 128], seq_lens=[1024, 2048, 4096]

Shapes (T, num_experts, k, num_expert_group, topk_group):
  (32768, 256, 8, 8, 4)     # B=32, S=1024
  (65536, 256, 8, 8, 4)     # B=32, S=2048 (or B=64, S=1024)
  (131072, 256, 8, 8, 4)    # B=32, S=4096 (or B=64, S=2048, or B=128, S=1024)
  (262144, 256, 8, 8, 4)    # B=64, S=4096 (or B=128, S=2048)
  (524288, 256, 8, 8, 4)    # B=128, S=4096
  ...

"""Model shapes for <op> operator.

Shapes derived from <model_name> (<hf_model_id>):
  hidden_size={H}, num_heads={NH}, num_kv_heads={NKV},
  head_dim={HD}, intermediate_size={I}, vocab_size={V}

Batch sizes: {batch_sizes}
Sequence lengths: {seq_lens}
"""

SHAPES: list[tuple[int, ...]] = [
    # B=1, S=128
    (128, 4096),
    # B=1, S=1024
    (1024, 4096),
    # ... etc
]

# The file already exists (created when the operator was added)
cat model_shape/<op>_shape.py

python -c "
from model_shape import load_model_shapes
import torch
cases = load_model_shapes('<op>', torch.float32)
print(f'Loaded {len(cases)} model shapes for <op>:')
for c in cases:
    print(f'  {c.label}: shape={c.shape}, dtype={c.dtype}, category={c.workload_category}')
"

nvidia-smi  # find a free GPU
CUDA_VISIBLE_DEVICES=<free_gpu> python scripts/run_bench.py --op <op> --dtype fp16 fp32

"""Model shapes for <op> operator.

Combined shapes from multiple models:
  - LLaMA-3-8B: hidden_size=4096, num_heads=32, ...
  - LLaMA-3-70B: hidden_size=8192, num_heads=64, ...
  - Qwen2-72B: hidden_size=8192, num_heads=64, ...

Batch sizes: [32, 64, 128]
Sequence lengths: [1024, 2048, 4096]
"""

SHAPES: list[tuple[int, ...]] = [
    # --- LLaMA-3-8B ---
    (128, 4096),
    (1024, 4096),
    ...
    # --- LLaMA-3-70B ---
    (128, 8192),
    (1024, 8192),
    ...
]

Operator	LLaMA-2/3	Qwen2	Mistral	DeepSeek-V2/V3	GPT-2
embedding	✓	✓	✓	✓	✓
rms_norm	✓	✓	✓	✓	— (uses LayerNorm)
fused_add_rms_norm	✓	✓	✓	✓	—
rotary_embedding	✓	✓	✓	✓	— (uses learned pos)
softmax	✓	✓	✓	✓	✓
swi_glu	✓	✓	✓	✓	— (uses GELU)
grouped_topk	—	—	—	✓ (MoE)	—
topk	✓	✓	✓	✓	✓
add	✓	✓	✓	✓	✓
mul	✓	✓	✓	✓	✓
sort	✓	✓	✓	✓	✓
argmax	✓	✓	✓	✓	✓
masked_fill	✓	✓	✓	✓	✓
where	✓	✓	✓	✓	✓

Model Shape Gen

Model Shape Gen — Generate Operator Shapes from Model Architectures

Quick Start — Use the Automation Script

Model Shape Gen

Model Shape Gen — Generate Operator Shapes from Model Architectures

Quick Start — Use the Automation Script

Adding a New Model to the Script

Manual Reference (Background for the automation script)

When to Use

Inputs Required

Procedure

1. Look Up Model Architecture Parameters

Method A — vLLM Config (Preferred)

Method B — Known Model Registry

2. Map Operator to Model Component

Operator → Shape Formula Reference

3. Check Operator-Specific Shape Encoding

4. Generate Shape Combinations

5. Build and Validate the SHAPES List

6. Write to `model_shape/<op>_shape.py`

7. Verify Integration

8. (Optional) Run Benchmark with Model Shapes

Multiple Models

Operator Applicability

Handling vLLM-Specific Patterns

Reference Files

Automation script

Project files

vLLM (v0.17.1, installed in project venv at `/root/wch/Hopper_benchmark/.venv/`)

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

Model Shape Gen

Model Shape Gen — Generate Operator Shapes from Model Architectures

Quick Start — Use the Automation Script

Model Shape Gen

Model Shape Gen — Generate Operator Shapes from Model Architectures

Quick Start — Use the Automation Script

Adding a New Model to the Script

Manual Reference (Background for the automation script)

When to Use

Inputs Required

Procedure

1. Look Up Model Architecture Parameters

Method A — vLLM Config (Preferred)

Method B — Known Model Registry

2. Map Operator to Model Component

Operator → Shape Formula Reference

3. Check Operator-Specific Shape Encoding

4. Generate Shape Combinations

5. Build and Validate the SHAPES List

6. Write to model_shape/<op>_shape.py

7. Verify Integration

8. (Optional) Run Benchmark with Model Shapes

Multiple Models

Operator Applicability

Handling vLLM-Specific Patterns

Reference Files

Automation script

Project files

vLLM (v0.17.1, installed in project venv at /root/wch/Hopper_benchmark/.venv/)

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

6. Write to `model_shape/<op>_shape.py`

vLLM (v0.17.1, installed in project venv at `/root/wch/Hopper_benchmark/.venv/`)