Design, train, and evaluate tokenizers (BPE, SentencePiece unigram) for LLMs. Use when selecting vocabulary size, defining special tokens, training a tokenizer on a corpus, analyzing fertility/compression, or handling multilingual coverage. Covers the HuggingFace `tokenizers` library and `SentencePieceTrainer`.
Guide the design, training, and evaluation of tokenizers for language models—covering algorithm selection (BPE vs unigram), vocabulary sizing, special token definition, pre-tokenization strategy, multilingual script coverage, and compression ratio analysis.
<bos>, <eos>, <pad>, <unk>) or chat template tokens (<|im_start|>, <|im_end|>)model-architecturepretraining-pipelinetokenizers.models.BPE) for merge-rule transparency or SentencePiece unigram for probabilistic subword selection. Use sentencepiece.SentencePieceTrainer.train() for language-agnostic byte-fallback support.hidden_dim parameters to the embedding matrix—quantify the tradeoff.<bos>, <eos>, <pad>, <unk> at fixed IDs. For chat models add role delimiters: <|im_start|>system, <|im_end|>. Reserve a contiguous block for future additions.pre_tokenizers.ByteLevel(add_prefix_space=False) for GPT-style or pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)]) for Llama-style digit splitting.from tokenizers import Tokenizer, models, trainers, pre_tokenizers
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
trainer = trainers.BpeTrainer(vocab_size=64000, special_tokens=["<pad>","<eos>","<bos>","<unk>"])
tokenizer.train(files=["corpus.txt"], trainer=trainer)
For SentencePiece: spm.SentencePieceTrainer.train(input="corpus.txt", model_prefix="tok", vocab_size=64000, character_coverage=0.9999, model_type="bpe").decode(encode(text)) == text for edge cases: Unicode emoji, mixed-script, code blocks, LaTeX math, zero-width joiners.tokenizer.json (HuggingFace fast tokenizer format) or .model (SentencePiece). Commit the trained artifact alongside its training config for reproducibility.byte_fallback=True to avoid <unk> on rare characters.<unk> rate is 0% on a diverse validation set.Tokenizer config — algorithm, vocab size, pre-tokenizer chain, special token map with fixed IDsTraining command — reproducible script or snippet with corpus path, coverage, and vocab sizeFertility report — tokens-per-word and bytes-per-token per language/domain on held-out dataRound-trip test results — pass/fail on Unicode, code, math, and multilingual edge casestokenizers docs: https://huggingface.co/docs/tokenizersmodel-architecture — embedding layer sizing depends on vocab size chosen herepretraining-pipeline — tokenizer must be finalized before data preprocessing beginsllm-creation — end-to-end model creation references tokenizer design decisionscharacter_coverage or add that language's data to the training corpus and retrain.