Byte Pair Encoding Merge Process
Byte Pair Encoding (BPE) is the tokenization algorithm behind most modern large language models. It starts from individual characters and repeatedly merges the most frequent adjacent pair, gradually learning subwords and whole words from corpus statistics. This diagram walks through that merge loop step by step.
Interactive Demo
Hover over any node in the flowchart to read an explanation of that step in the right-hand info panel.
To embed this MicroSim in another page, use the following iframe:
1 | |
Overview
This MicroSim visualizes the core loop of subword tokenization. Rather than treating words as fixed units or splitting everything into characters, BPE finds a middle ground by learning which character sequences are worth keeping together.
How It Works
The diagram follows a top-to-bottom flow:
- Training Corpus - words are counted with their frequencies.
- Initial Vocabulary - every word is split into single characters; the starting vocabulary is just the distinct characters.
- Pair Frequency Analysis - BPE counts every adjacent character pair and greedily selects the most frequent one.
- Iterations 1-N - each iteration merges one pair into a new token. Early
merges produce short subwords (
da,ta); later merges combine those into whole words (data,base,database). - Learned Tokens & Final Tokenization - frequent words collapse to a single token, while rare words fall back to learned subwords, so nothing is ever out-of-vocabulary.
The color key distinguishes character tokens, early subword merges, later subword merges, and complete words that became single tokens.
Lesson Plan
- Warm up: Ask students to tokenize "database" by hand into characters, then predict which pair appears most often across the corpus.
- Explore: Step through each iteration node and note how the vocabulary size grows while the token count per word shrinks.
- Discuss: Why does frequency-based merging make common words cheap to encode but still allow rare words to be represented?
- Extend: Have students apply two merge steps to a new word (e.g. "backups") and compare against the diagram's final tokenization rule.