Type anything.
Watch BPE tokenize it live.

This tokenizer was trained from scratch on the WikiText-2 corpus.
Every token you see below comes from 29,714 learned merge rules.

Loading tokenizer vocab...
Project Overview

Building a BPE Tokenizer
from Scratch

Tokenization is one of the most fundamental steps in Natural Language Processing. Modern language models like GPT, BERT, and LLaMA cannot process raw text directly — text must first be converted into tokens, then mapped to numerical IDs. This project builds a complete BPE tokenizer from scratch using the WikiText-2 dataset, covering everything from raw text preprocessing to a fully serialized, Hugging Face-compatible tokenizer.

30K
Vocabulary
tokens learned
5:1
Compression
ratio achieved
21K
Unique training
samples used
Objectives
Load and explore the WikiText-2 dataset via Hugging Face
Clean, preprocess and deduplicate the training corpus
Train a custom BPE tokenizer from scratch
Analyze the learned vocabulary hierarchy
Evaluate tokenizer performance on unseen text
Save in Hugging Face-compatible JSON format
Demonstrate encoding and decoding workflows
Visualize compression and data cleaning metrics

Key Findings

Dataset Quality

The original WikiText-2 training split had 36,718 rows. Around 35% were empty or whitespace-only. After cleaning and deduplication, the final corpus had 21,337 unique samples.

Vocabulary Hierarchy

Early vocabulary entries: common character pairs like th, in, er. Mid-range: common words like state, area, million. High-range: specialized terms like potassium, Ferguson.

Compression Efficiency

The tokenizer achieved an average compression ratio of 5:1 — each token represents roughly five characters of input text, dramatically reducing sequence length.

Subword Generalization

Unseen words are split into reusable subword units. The tokenizer handles modern terms, email addresses, and rare proper nouns it never saw during training.


Subword Decomposition Examples

BPE breaks unseen words into known subword pieces. These words were not in the training corpus but are decomposed gracefully:

internationalization
internationalization
transformational
transformational
someone@somemail.com
someone@somemail.com
ChatGPT
ChatGPT
electromechanical
electromechanical

Dataset Cleaning Summary
36,718
Original rows
23,767
After empty removal
21,337
Final unique corpus
Algorithm Explained

How Byte Pair Encoding Works

BPE starts with individual characters and iteratively merges the most frequent adjacent pair until the target vocabulary size is reached. It's elegant, greedy, and surprisingly effective.

Tokenization Strategy Comparison
Strategy Example: "electromechanical" Unknown words Sequence length Vocab size
Word-level [UNK] ✗ Lost as [UNK] Short Huge
Character-level e, l, e, c, t, r, o... ✓ No unknowns Very long ~100 chars
BPE (this project) electro + mechanical ✓ Subwords Balanced 30,000
Step-by-Step Algorithm
1

Initialize with characters

Split every word in the corpus into individual characters. Each character becomes a token. The initial vocabulary is the set of all unique characters found in the corpus.

"low" → l · o · w
"lower" → l · o · w · e · r
"newest" → n · e · w · e · s · t
2

Count all adjacent pairs

Scan through the entire corpus and count how often each adjacent character pair appears. This frequency count drives every merge decision.

(l, o): 2    (o, w): 2    (e, s): 1    (e, r): 1
Most frequent: (l, o) or (o, w)
3

Merge the most frequent pair

Take the most frequent pair and merge it everywhere it appears in the corpus. Add the merged unit as a new vocabulary token. Record the merge rule.

Merge rule: l + o → lo
"low" → lo · w
"lower" → lo · w · e · r
4

Repeat until vocab size reached

Go back to step 2 with the updated corpus. Keep merging the most frequent pair each iteration. For this project: 29,714 merge rules were learned to build a 30,000-token vocabulary.

5

Apply learned merges to new text

To tokenize unseen text, start with characters and apply the saved merge rules in priority order (most frequent first). This is deterministic — the same text always produces the same tokens.

Interactive Toy Example

// BPE merge steps on: low, lower, newest, widest

Initial state — each word split into characters. No merges yet.
l·o·w
l·o·w·e·r
n·e·w·e·s·t
w·i·d·e·s·t
Merge 1: e + s → es — pair (e,s) appears 2× in newest and widest.
l·o·w
l·o·w·e·r
n·e·w·es·t
w·i·d·es·t
Merge 2: es + t → est — pair (es,t) now appears 2×.
l·o·w
l·o·w·e·r
n·e·w·est
w·i·d·est
Merge 3: l + o → lo — pair (l,o) appears 2× in low and lower.
lo·w
lo·w·e·r
n·e·w·est
w·i·d·est
Merge 4: lo + w → low — pair (lo,w) appears 2×. Vocabulary now includes "low" as a unit.
low
low·e·r
n·e·w·est
w·i·d·est
Final vocabulary includes: individual chars + es, est, lo, low as learned subwords. Further merges continue until vocab_size=30,000.
low
low·e·r
n·e·w·est
w·i·d·est
Step 0 / 5
Code Walkthrough

Notebook — Section by Section

Click any section to expand its code. All cells from the original Jupyter notebook, formatted for readability.

Section 4: Dataset Loading 5 cells

Load the WikiText-2 dataset from Hugging Face and explore its structure across train, validation, and test splits.

# Load the dataset from datasets import load_dataset dataset = load_dataset( "Salesforce/wikitext", "wikitext-2-v1" ) python
for split in dataset: print(f"\n{split.upper()}") print(dataset[split]) python
train_df = dataset["train"].to_pandas() valid_df = dataset["validation"].to_pandas() test_df = dataset["test"].to_pandas() print("Train Shape:", train_df.shape) print("Validation Shape:", valid_df.shape) print("Test Shape:", test_df.shape) python
Section 5: Data Cleaning 6 cells

Remove empty rows, strip <unk> placeholders, normalize whitespace, and deduplicate the corpus before training.

# Remove empty rows train_clean = train_df.copy() train_clean = train_clean[ train_clean["text"].str.strip() != "" ] print("Rows after empty removal:", len(train_clean)) python
# Remove <unk> tokens and normalize whitespace train_clean["text"] = ( train_clean["text"] .str.replace("<unk>", "", regex=False) .str.replace(r"\s+", " ", regex=True) .str.strip() ) python
# Deduplicate before = len(train_clean) train_clean = train_clean.drop_duplicates() after = len(train_clean) print("Duplicates removed:", before - after) print("Final rows:", after) # → 21,337 python
Section 7: BPE Toy Example 1 cell

Visualize how BPE starts by splitting words into characters before any merges are applied.

toy_corpus = ["low", "lower", "newest", "widest"] for word in toy_corpus: print(" ".join(list(word))) # Output: # l o w # l o w e r # n e w e s t # w i d e s t python
Section 8: Training the Tokenizer 4 cells

Initialize a BPE model with Hugging Face tokenizers, configure special tokens, and train on the cleaned corpus.

from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.pre_tokenizers import Whitespace from tokenizers.trainers import BpeTrainer python
tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = Whitespace() python
trainer = BpeTrainer( vocab_size=30000, special_tokens=[ "[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]" ] ) python
tokenizer.train_from_iterator(corpus, trainer=trainer) # Learns 29,714 merge rules print("Vocab size:", tokenizer.get_vocab_size()) # 30,000 python
Section 9: Vocabulary Inspection 10 cells

Explore how the learned vocabulary progresses from simple character pairs to complex specialized words.

sample_text = "Machine learning is transforming artificial intelligence." encoding = tokenizer.encode(sample_text) print("Tokens:", encoding.tokens) print("IDs:", encoding.ids) print("Decoded:", tokenizer.decode(encoding.ids)) python
test_words = [ "electromechanical", "internationalization", "transformational", "ChatGPT", "someone@somemail.com" ] for word in test_words: encoding = tokenizer.encode(word) print(f"\n{word}") print("Tokens:", encoding.tokens) python
Section 10: Evaluation Metrics 6 cells

Measure vocab size, average tokens per sentence, and compression ratio on the validation set.

# Average tokens per sentence token_counts = [ len(tokenizer.encode(t).tokens) for t in valid_texts ] avg_tokens = sum(token_counts) / len(token_counts) print(f"Avg tokens/sentence: {avg_tokens:.2f}") python
# Compression ratio: chars / tokens total_chars = sum(len(t) for t in valid_texts) total_tokens = sum(len(tokenizer.encode(t).tokens) for t in valid_texts) compression = total_chars / total_tokens print(f"Compression ratio: {compression:.2f}") # → ~5.0 python
Section 12–13: Save & Reload 5 cells

Serialize the tokenizer to a Hugging Face-compatible JSON file and verify it can be reloaded correctly.

# Save tokenizer.save("wikitext2_bpe_tokenizer.json") python
# Reload and verify from tokenizers import Tokenizer loaded = Tokenizer.from_file("wikitext2_bpe_tokenizer.json") sample = "BPE enables tokenization of unseen words." enc = loaded.encode(sample) print("Tokens:", enc.tokens) print("Decoded:", loaded.decode(enc.ids)) python