Name: Tokenizer Design
Author: merceralex397-collab

Purpose

Guide the design, training, and evaluation of tokenizers for language models—covering algorithm selection (BPE vs unigram), vocabulary sizing, special token definition, pre-tokenization strategy, multilingual script coverage, and compression ratio analysis.

When to use this skill

Training a new BPE or unigram tokenizer on a custom corpus
Choosing vocabulary size (32k / 64k / 128k) and analyzing its impact on embedding parameters and sequence length
Defining special tokens (<bos>, <eos>, <pad>, <unk>) or chat template tokens (<|im_start|>, <|im_end|>)
Configuring pre-tokenization: whitespace splitting, digit isolation, byte-level fallback
Measuring tokenizer fertility (tokens per word) and compression ratio (bytes per token)
Ensuring multilingual coverage with character coverage thresholds (e.g., 0.9999)

Do not use this skill when

Purpose

When to use this skill

Training a new BPE or unigram tokenizer on a custom corpus
Choosing vocabulary size (32k / 64k / 128k) and analyzing its impact on embedding parameters and sequence length
Defining special tokens (<bos>, <eos>, <pad>, <unk>) or chat template tokens (<|im_start|>, <|im_end|>)
Configuring pre-tokenization: whitespace splitting, digit isolation, byte-level fallback
Measuring tokenizer fertility (tokens per word) and compression ratio (bytes per token)
Ensuring multilingual coverage with character coverage thresholds (e.g., 0.9999)

Tokenizer Design

Purpose

When to use this skill

Do not use this skill when

Tokenizer Design

Purpose

When to use this skill

Do not use this skill when

Operating procedure

Decision rules

Output requirements

References

Failure handling

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

Tokenizer Design

Purpose

When to use this skill

Do not use this skill when

Tokenizer Design

Purpose

When to use this skill

Do not use this skill when

Operating procedure

Decision rules

Output requirements

References

Related skills

Failure handling

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns