Name: Vocabulary Extension Initializer
Author: hiyenwong

Vocabulary Extension Initializer | Skills Pool

new_tokens = ["[DOMAIN]", "量子", "纠缠", "[SPECIAL]"]
# Or load from file
with open("domain_vocab.txt") as f:
    new_tokens = [line.strip() for line in f]

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("model_name")
num_added = tokenizer.add_tokens(new_tokens)
print(f"Added {num_added} new tokens")
tokenizer.save_pretrained("./extended_tokenizer")

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("model_name")
model.resize_token_embeddings(len(tokenizer))

# Strategy 1: Average of subword embeddings
def init_by_subword_average(model, tokenizer, new_token):
    subwords = tokenizer.tokenize(new_token, add_special_tokens=False)
    subword_ids = tokenizer.convert_tokens_to_ids(subwords)
    avg_embedding = model.get_input_embeddings().weight[subword_ids].mean(0)
    return avg_embedding

# Strategy 2: Random initialization (default)
# Model already initialized randomly after resize_token_embeddings

# Freeze all except new embeddings
for name, param in model.named_parameters():
    if "embed" not in name:
        param.requires_grad = False

# Train on domain data
trainer.train()

User: "Add quantum physics Chinese vocabulary to this language model"

Agent:
1. Collect Chinese quantum terms: 量子, 纠缠, 叠加态, 波函数
2. Check tokenizer: verify these are not already single tokens
3. Add tokens to tokenizer; resize model embeddings
4. Initialize via subword average for Chinese characters
5. Fine-tune on quantum physics Chinese corpus
6. Report tokenization coverage improvement

User: "Add [PROTEIN], [GENE], [DRUG] special tokens to biomedical model"

Agent:
1. Define special tokens: [PROTEIN], [GENE], [DRUG], [DISEASE]
2. Add to tokenizer as special_tokens
3. Resize embedding matrix (4 new rows)
4. Initialize randomly; these tokens have no subword basis
5. Fine-tune on biomedical NER dataset
6. Validate token recognition in downstream task

Vocabulary Extension Initializer

Activation Keywords

Tools Used

Core Workflow

Step 1: Define New Vocabulary

Vocabulary Extension Initializer

Activation Keywords

Tools Used

Core Workflow

Step 1: Define New Vocabulary

Step 2: Extend Tokenizer

Step 3: Initialize New Embeddings

Step 4: Fine-tune on Domain Data

Instructions for Agents

Step 1: Inventory New Vocabulary

Step 2: Check for Conflicts

Step 3: Extend Tokenizer and Model

Step 4: Initialize Embeddings

Step 5: Validate and Fine-tune

Examples

Example 1: Add Chinese Domain Terms

Example 2: Add Special Domain Tokens

Resources

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing