This tokenizer was trained from scratch on the WikiText-2 corpus.
Every token you see below comes from 29,714 learned merge rules.
Tokenization is one of the most fundamental steps in Natural Language Processing. Modern language models like GPT, BERT, and LLaMA cannot process raw text directly — text must first be converted into tokens, then mapped to numerical IDs. This project builds a complete BPE tokenizer from scratch using the WikiText-2 dataset, covering everything from raw text preprocessing to a fully serialized, Hugging Face-compatible tokenizer.
The original WikiText-2 training split had 36,718 rows. Around 35% were empty or whitespace-only. After cleaning and deduplication, the final corpus had 21,337 unique samples.
Early vocabulary entries: common character pairs like th, in, er. Mid-range: common words like state, area, million. High-range: specialized terms like potassium, Ferguson.
The tokenizer achieved an average compression ratio of 5:1 — each token represents roughly five characters of input text, dramatically reducing sequence length.
Unseen words are split into reusable subword units. The tokenizer handles modern terms, email addresses, and rare proper nouns it never saw during training.
BPE breaks unseen words into known subword pieces. These words were not in the training corpus but are decomposed gracefully:
BPE starts with individual characters and iteratively merges the most frequent adjacent pair until the target vocabulary size is reached. It's elegant, greedy, and surprisingly effective.
| Strategy | Example: "electromechanical" | Unknown words | Sequence length | Vocab size |
|---|---|---|---|---|
| Word-level | [UNK] | ✗ Lost as [UNK] | Short | Huge |
| Character-level | e, l, e, c, t, r, o... | ✓ No unknowns | Very long | ~100 chars |
| BPE (this project) | electro + mechanical | ✓ Subwords | Balanced | 30,000 |
Split every word in the corpus into individual characters. Each character becomes a token. The initial vocabulary is the set of all unique characters found in the corpus.
Scan through the entire corpus and count how often each adjacent character pair appears. This frequency count drives every merge decision.
Take the most frequent pair and merge it everywhere it appears in the corpus. Add the merged unit as a new vocabulary token. Record the merge rule.
Go back to step 2 with the updated corpus. Keep merging the most frequent pair each iteration. For this project: 29,714 merge rules were learned to build a 30,000-token vocabulary.
To tokenize unseen text, start with characters and apply the saved merge rules in priority order (most frequent first). This is deterministic — the same text always produces the same tokens.
Click any section to expand its code. All cells from the original Jupyter notebook, formatted for readability.
Load the WikiText-2 dataset from Hugging Face and explore its structure across train, validation, and test splits.
Remove empty rows, strip <unk> placeholders, normalize whitespace, and deduplicate the corpus before training.
Visualize how BPE starts by splitting words into characters before any merges are applied.
Initialize a BPE model with Hugging Face tokenizers, configure special tokens, and train on the cleaned corpus.
Explore how the learned vocabulary progresses from simple character pairs to complex specialized words.
Measure vocab size, average tokens per sentence, and compression ratio on the validation set.
Serialize the tokenizer to a Hugging Face-compatible JSON file and verify it can be reloaded correctly.