4.1 KiB
4.1 KiB
Tokenization Algorithms
BPE vs Unigram comparison and subword regularization.
BPE (Byte-Pair Encoding)
Algorithm
- Initialize vocabulary with characters
- Count frequency of adjacent token pairs
- Merge most frequent pair
- Repeat until vocabulary size reached
Example
Corpus:
low: 5
lower: 2
newest: 6
widest: 3
Iteration 1:
- Most frequent pair: 'e' + 's' (9 times)
- Merge → 'es'
- Vocabulary: [chars] + ['es']
Iteration 2:
- Most frequent: 'es' + 't' (9 times)
- Merge → 'est'
- Vocabulary: [chars] + ['es', 'est']
Result: newest → new|est, widest → wid|est
Implementation
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_type='bpe',
vocab_size=16000
)
Advantages
- Simple algorithm
- Fast training
- Good compression ratio
Disadvantages
- Deterministic (no sampling)
- May split common words unexpectedly
Unigram
Algorithm
- Start with large vocabulary (all substrings)
- Compute probability of each token
- Remove tokens with minimal loss impact
- Repeat until vocabulary size reached
Probabilistic tokenization
Given vocabulary with probabilities:
P('low') = 0.02
P('est') = 0.03
P('l') = 0.01
P('o') = 0.015
...
Tokenize "lowest":
Option 1: ['low', 'est']
P = 0.02 × 0.03 = 0.0006 ← highest
Option 2: ['l', 'o', 'w', 'est']
P = 0.01 × 0.015 × 0.01 × 0.03 = 0.000000045
Choose option 1 (highest probability)
Implementation
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_type='unigram',
vocab_size=8000
)
Advantages
- Probabilistic (can sample)
- Better for morphologically rich languages
- Supports subword regularization
Disadvantages
- Slower training
- More complex algorithm
Comparison
| Feature | BPE | Unigram |
|---|---|---|
| Training speed | Fast | Slow |
| Tokenization | Deterministic | Probabilistic |
| Sampling | No | Yes |
| Typical vocab size | 16k-32k | 8k-32k |
| Used by | mBART | T5, ALBERT, XLNet |
Subword regularization
Sample different tokenizations during training for robustness.
Enable sampling
sp = spm.SentencePieceProcessor(model_file='m.model')
# Sample different tokenizations
for _ in range(5):
pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
print(pieces)
# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']
# ['▁token', 'iz', 'ation']
# ['▁to', 'ken', 'ization']
# ['▁token', 'ization']
Parameters
alpha: Regularization strength- 0.0 = deterministic (no sampling)
- 0.1 = slight variation
- 0.5 = high variation
- 1.0 = maximum variation
Benefits
- Robustness: Model learns multiple tokenizations
- Data augmentation: More diverse training data
- Better generalization: Less overfitting to specific tokenization
Use case
# Training loop with regularization
for batch in dataloader:
# Sample different tokenizations each epoch
tokens = sp.encode(batch['text'], enable_sampling=True, alpha=0.1)
# Train model...
Used by: mT5, XLM-RoBERTa
NBest encoding
Get multiple tokenization candidates with scores.
sp = spm.SentencePieceProcessor(model_file='m.model')
# Get top-5 tokenizations
nbest = sp.nbest_encode('tokenization', nbest_size=5, out_type=str)
for pieces, score in nbest:
print(f"{pieces} (log prob: {score:.4f})")
# Output:
# ['▁token', 'ization'] (log prob: -2.34)
# ['▁tok', 'en', 'ization'] (log prob: -2.41)
# ['▁token', 'iz', 'ation'] (log prob: -2.57)
Use cases
- Ensemble tokenization: Average over multiple tokenizations
- Uncertainty estimation: Check variance in scores
- Debugging: Understand tokenizer behavior
Best practices
- Use Unigram for multilingual - Better for diverse languages
- Use BPE for speed - Faster training and inference
- Enable subword regularization - Improves model robustness
- Set alpha=0.1 for slight variation - Good balance
- Use deterministic mode for inference - Consistent results