Initialize and extend the vocabulary of language models by adding new tokens, domain-specific terminology, or language-specific characters. Handles embedding initialization strategies (random, average, subword-based) for new tokens. Use when: (1) Adding new language tokens to an existing model, (2) Extending tokenizer with domain-specific vocabulary, (3) Initializing embeddings for new tokens before fine-tuning, (4) Supporting multilingual vocabulary extension.
Initialize and extend language model vocabulary with new tokens and embeddings.
exec: Run Python scripts for tokenizer/model modificationread: Load model configuration, tokenizer files, and domain vocabularywrite: Save updated tokenizer and embedding weightsCollect domain-specific tokens, special characters, or language-specific terms:
new_tokens = ["[DOMAIN]", "量子", "纠缠", "[SPECIAL]"]
# Or load from file
with open("domain_vocab.txt") as f:
new_tokens = [line.strip() for line in f]
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model_name")
num_added = tokenizer.add_tokens(new_tokens)
print(f"Added {num_added} new tokens")
tokenizer.save_pretrained("./extended_tokenizer")
Choose initialization strategy:
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("model_name")
model.resize_token_embeddings(len(tokenizer))
# Strategy 1: Average of subword embeddings
def init_by_subword_average(model, tokenizer, new_token):
subwords = tokenizer.tokenize(new_token, add_special_tokens=False)
subword_ids = tokenizer.convert_tokens_to_ids(subwords)
avg_embedding = model.get_input_embeddings().weight[subword_ids].mean(0)
return avg_embedding
# Strategy 2: Random initialization (default)
# Model already initialized randomly after resize_token_embeddings
Fine-tune only the new embeddings initially, then jointly:
# Freeze all except new embeddings
for name, param in model.named_parameters():
if "embed" not in name:
param.requires_grad = False
# Train on domain data
trainer.train()
Collect tokens to add: domain terms, special tokens, language characters, or symbols.
Verify new tokens don't overlap with existing vocabulary; check for subword coverage.
Add tokens to tokenizer; resize model embedding matrix accordingly.
Choose strategy: average of subword embeddings (recommended) or random initialization.
Verify tokenization of new tokens; run domain fine-tuning; report vocabulary coverage improvement.
User: "Add quantum physics Chinese vocabulary to this language model"
Agent:
1. Collect Chinese quantum terms: 量子, 纠缠, 叠加态, 波函数
2. Check tokenizer: verify these are not already single tokens
3. Add tokens to tokenizer; resize model embeddings
4. Initialize via subword average for Chinese characters
5. Fine-tune on quantum physics Chinese corpus
6. Report tokenization coverage improvement
User: "Add [PROTEIN], [GENE], [DRUG] special tokens to biomedical model"
Agent:
1. Define special tokens: [PROTEIN], [GENE], [DRUG], [DISEASE]
2. Add to tokenizer as special_tokens
3. Resize embedding matrix (4 new rows)
4. Initialize randomly; these tokens have no subword basis
5. Fine-tune on biomedical NER dataset
6. Validate token recognition in downstream task
references/: Vocabulary extension techniques and embedding initialization guidesdeclarative-self-improvement, espl-evolutionary-system-prompt