BPE Tokenizer — WikiText-2

Project Overview

Building a BPE Tokenizer
from Scratch

Tokenization is one of the most fundamental steps in Natural Language Processing. Modern language models like GPT, BERT, and LLaMA cannot process raw text directly — text must first be converted into tokens, then mapped to numerical IDs. This project builds a complete BPE tokenizer from scratch using the WikiText-2 dataset, covering everything from raw text preprocessing to a fully serialized, Hugging Face-compatible tokenizer.

30K

Vocabulary
tokens learned

5:1

Compression
ratio achieved

21K

Unique training
samples used

Objectives

Load and explore the WikiText-2 dataset via Hugging Face

Clean, preprocess and deduplicate the training corpus

Train a custom BPE tokenizer from scratch

Analyze the learned vocabulary hierarchy

Evaluate tokenizer performance on unseen text

Save in Hugging Face-compatible JSON format

Demonstrate encoding and decoding workflows

Visualize compression and data cleaning metrics

Key Findings

Dataset Quality

The original WikiText-2 training split had 36,718 rows. Around 35% were empty or whitespace-only. After cleaning and deduplication, the final corpus had 21,337 unique samples.

Vocabulary Hierarchy

Early vocabulary entries: common character pairs like th, in, er. Mid-range: common words like state, area, million. High-range: specialized terms like potassium, Ferguson.

Compression Efficiency

The tokenizer achieved an average compression ratio of 5:1 — each token represents roughly five characters of input text, dramatically reducing sequence length.

Subword Generalization

Unseen words are split into reusable subword units. The tokenizer handles modern terms, email addresses, and rare proper nouns it never saw during training.

Subword Decomposition Examples

BPE breaks unseen words into known subword pieces. These words were not in the training corpus but are decomposed gracefully:

internationalization →

internationalization

transformational →

transformational

someone@somemail.com →

someone@somemail.com

ChatGPT →

ChatGPT

electromechanical →

electromechanical

Dataset Cleaning Summary

36,718

Original rows

23,767

After empty removal

21,337

Final unique corpus

Algorithm Explained

How Byte Pair Encoding Works

BPE starts with individual characters and iteratively merges the most frequent adjacent pair until the target vocabulary size is reached. It's elegant, greedy, and surprisingly effective.

Tokenization Strategy Comparison

Strategy	Example: "electromechanical"	Unknown words	Sequence length	Vocab size
Word-level	[UNK]	✗ Lost as [UNK]	Short	Huge
Character-level	e, l, e, c, t, r, o...	✓ No unknowns	Very long	~100 chars
BPE (this project)	electro + mechanical	✓ Subwords	Balanced	30,000

Step-by-Step Algorithm

Initialize with characters

Split every word in the corpus into individual characters. Each character becomes a token. The initial vocabulary is the set of all unique characters found in the corpus.

"low" → l · o · w
"lower" → l · o · w · e · r
"newest" → n · e · w · e · s · t

Count all adjacent pairs

Scan through the entire corpus and count how often each adjacent character pair appears. This frequency count drives every merge decision.

(l, o): 2 (o, w): 2 (e, s): 1 (e, r): 1
Most frequent: (l, o) or (o, w)

Merge the most frequent pair

Take the most frequent pair and merge it everywhere it appears in the corpus. Add the merged unit as a new vocabulary token. Record the merge rule.

Merge rule: l + o → lo
"low" → lo · w
"lower" → lo · w · e · r

Repeat until vocab size reached

Go back to step 2 with the updated corpus. Keep merging the most frequent pair each iteration. For this project: 29,714 merge rules were learned to build a 30,000-token vocabulary.

Apply learned merges to new text

To tokenize unseen text, start with characters and apply the saved merge rules in priority order (most frequent first). This is deterministic — the same text always produces the same tokens.

Interactive Toy Example

// BPE merge steps on: low, lower, newest, widest

Initial state — each word split into characters. No merges yet.

l·o·w

l·o·w·e·r

n·e·w·e·s·t

w·i·d·e·s·t

Merge 1: e + s → es — pair (e,s) appears 2× in newest and widest.

l·o·w

l·o·w·e·r

n·e·w·es·t

w·i·d·es·t

Merge 2: es + t → est — pair (es,t) now appears 2×.

l·o·w

l·o·w·e·r

n·e·w·est

w·i·d·est

Merge 3: l + o → lo — pair (l,o) appears 2× in low and lower.

lo·w

lo·w·e·r

n·e·w·est

w·i·d·est

Merge 4: lo + w → low — pair (lo,w) appears 2×. Vocabulary now includes "low" as a unit.

low

low·e·r

n·e·w·est

w·i·d·est

Final vocabulary includes: individual chars + es, est, lo, low as learned subwords. Further merges continue until vocab_size=30,000.

low

low·e·r

n·e·w·est

w·i·d·est

Step 0 / 5

Code Walkthrough

Notebook — Section by Section

Click any section to expand its code. All cells from the original Jupyter notebook, formatted for readability.

Section 4: Dataset Loading 5 cells

▶

Load the WikiText-2 dataset from Hugging Face and explore its structure across train, validation, and test splits.

# Load the dataset
from datasets import load_dataset

dataset = load_dataset(
    "Salesforce/wikitext",
    "wikitext-2-v1"
)
python

for split in dataset:
    print(f"\n{split.upper()}")
    print(dataset[split])
python

train_df = dataset["train"].to_pandas()
valid_df = dataset["validation"].to_pandas()
test_df  = dataset["test"].to_pandas()

print("Train Shape:",      train_df.shape)
print("Validation Shape:", valid_df.shape)
print("Test Shape:",       test_df.shape)
python

Section 5: Data Cleaning 6 cells

▶

Remove empty rows, strip <unk> placeholders, normalize whitespace, and deduplicate the corpus before training.

# Remove empty rows
train_clean = train_df.copy()
train_clean = train_clean[
    train_clean["text"].str.strip() != ""
]
print("Rows after empty removal:", len(train_clean))
python

# Remove <unk> tokens and normalize whitespace
train_clean["text"] = (
    train_clean["text"]
    .str.replace("<unk>", "", regex=False)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)
python

# Deduplicate
before = len(train_clean)
train_clean = train_clean.drop_duplicates()
after  = len(train_clean)

print("Duplicates removed:", before - after)
print("Final rows:", after)  # → 21,337
python

Section 7: BPE Toy Example 1 cell

▶

Visualize how BPE starts by splitting words into characters before any merges are applied.

toy_corpus = ["low", "lower", "newest", "widest"]

for word in toy_corpus:
    print(" ".join(list(word)))

# Output:
# l o w
# l o w e r
# n e w e s t
# w i d e s t
python

Section 8: Training the Tokenizer 4 cells

▶

Initialize a BPE model with Hugging Face tokenizers, configure special tokens, and train on the cleaned corpus.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
python

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
python

trainer = BpeTrainer(
    vocab_size=30000,
    special_tokens=[
        "[PAD]", "[UNK]", "[CLS]",
        "[SEP]", "[MASK]"
    ]
)
python

tokenizer.train_from_iterator(corpus, trainer=trainer)
# Learns 29,714 merge rules
print("Vocab size:", tokenizer.get_vocab_size())  # 30,000
python

Section 9: Vocabulary Inspection 10 cells

▶

Explore how the learned vocabulary progresses from simple character pairs to complex specialized words.

sample_text = "Machine learning is transforming artificial intelligence."
encoding = tokenizer.encode(sample_text)

print("Tokens:", encoding.tokens)
print("IDs:",    encoding.ids)
print("Decoded:", tokenizer.decode(encoding.ids))
python

test_words = [
    "electromechanical", "internationalization",
    "transformational",   "ChatGPT",
    "someone@somemail.com"
]

for word in test_words:
    encoding = tokenizer.encode(word)
    print(f"\n{word}")
    print("Tokens:", encoding.tokens)
python

Section 10: Evaluation Metrics 6 cells

▶

Measure vocab size, average tokens per sentence, and compression ratio on the validation set.

# Average tokens per sentence
token_counts = [
    len(tokenizer.encode(t).tokens)
    for t in valid_texts
]
avg_tokens = sum(token_counts) / len(token_counts)
print(f"Avg tokens/sentence: {avg_tokens:.2f}")
python

# Compression ratio: chars / tokens
total_chars  = sum(len(t) for t in valid_texts)
total_tokens = sum(len(tokenizer.encode(t).tokens) for t in valid_texts)
compression  = total_chars / total_tokens

print(f"Compression ratio: {compression:.2f}")  # → ~5.0
python

Section 12–13: Save & Reload 5 cells

▶

Serialize the tokenizer to a Hugging Face-compatible JSON file and verify it can be reloaded correctly.

# Save
tokenizer.save("wikitext2_bpe_tokenizer.json")
python

# Reload and verify
from tokenizers import Tokenizer

loaded = Tokenizer.from_file("wikitext2_bpe_tokenizer.json")

sample = "BPE enables tokenization of unseen words."
enc    = loaded.encode(sample)

print("Tokens:", enc.tokens)
print("Decoded:", loaded.decode(enc.ids))
python

Type anything.Watch BPE tokenize it live.

Building a BPE Tokenizerfrom Scratch

Dataset Quality

Vocabulary Hierarchy

Compression Efficiency

Subword Generalization

How Byte Pair Encoding Works

Initialize with characters

Count all adjacent pairs

Merge the most frequent pair

Repeat until vocab size reached

Apply learned merges to new text

// BPE merge steps on: low, lower, newest, widest

Notebook — Section by Section

Type anything.
Watch BPE tokenize it live.

Building a BPE Tokenizer
from Scratch