Loading embeddings…
Word2Vec · Skip-Gram · Negative Sampling · WikiText-2

Words as Vectors in Space

Watch how embeddings evolve across training epochs — from random noise to meaningful semantic structure. Switch epochs above to compare.

Raw Text
Tokenize
Subsample
Skip-Gram
Neg. Sampling
Embeddings
Training Epoch
Epoch 10
— words loaded
🗺 Explore Space
≈ Similarity
⬡ Neighbors
⊕ Analogy
📐 How It Works
⚠ Could not load .
Make sure all three JSON files are in the same folder as this HTML file:
embeddings_epoch2.json  ·  embeddings_epoch5.json  ·  embeddings_epoch10.json

Run export_embeddings.py in your notebook to generate them. Meanwhile the demo runs with mock embeddings.
2D Embedding Space · PCA Projection
Vocabulary:
Embed dim:
PCA var:
Dataset: WikiText-2
Method: Skip-Gram + NS

💡 Switch epochs above to watch clusters form. Hover a word for its label. Early epochs look scattered — later epochs show tight semantic groupings.

Cosine Similarity

Cosine similarity measures the angle between two embedding vectors. 1.0 = same direction, 0 = orthogonal, −1 = opposite. Watch how scores change across epochs as the model learns.

Try:
Current Epoch Result

Enter two words above and click Compute.

Across All Epochs

How does similarity for the same word pair evolve as training progresses?

Compute a similarity above to see the comparison.

Score Guide
0.7 – 1.0 → Strongly related
0.3 – 0.7 → Moderately related
0.0 – 0.3 → Weakly related
−1.0 – 0.0 → Dissimilar
sim(A, B) = (A · B) / (|A| × |B|)
Nearest Neighbors

Words closest in embedding space — i.e. words that appear in similar contexts. Switch epochs above to watch neighbors stabilize as the model trains.

Try:
Vector Arithmetic · Analogy Reasoning

Solve A − B + C = ?. Word relationships are encoded as directions in embedding space. With more training epochs, the model's analogical reasoning becomes more accurate.

video
mtv
+
music
?
Try:
📖
What are Embeddings?
Each word maps to a dense vector of real numbers. Words with similar meanings or usage patterns end up with similar vectors — proximity in vector space = semantic relatedness.
word → [0.23, −0.87, ..., 0.64] ∈ ℝ¹⁰⁰
🪟
Skip-Gram Model
Given a center word, predict surrounding context words within a sliding window of size k. This forces the model to learn which words share similar neighborhoods.
center: "music" → predict:
"album", "guitar", "song", "lyrics"
🎯
Negative Sampling
Instead of softmax over the full vocabulary, update only the true context word (positive) and k randomly sampled "noise" words (negatives).
L = log σ(v_c · v_pos) + Σₖ log σ(−v_c · v_neg)
P(neg) ∝ f(w)^0.75
🔽
Frequency Subsampling
Common words like "the" are randomly dropped during training to prevent them from dominating the signal.
P_discard(w) = max(0, 1 − √(t / f(w)))
t = 10⁻⁵
📐
Cosine Similarity
The cosine of the angle between two vectors — scale-invariant, only direction matters. Range: [−1, +1].
sim(A, B) = (A · B) / (|A| |B|)
🧮
Vector Arithmetic
Relationships are encoded as directions, so you can do arithmetic on meaning. Classic example from Mikolov et al. 2013:
vec("music") − vec("video") + vec("mtv")
≈ vec("queen")
Your Training Pipeline
1. Dataset: WikiText-2 (train + val + test concatenated, ~2M tokens)
2. Cleaning: Remove headings (= markers), lowercase, replace hyphens
3. Tokenization: Whitespace split → build Counter vocabulary
4. Frequency filter: Drop words appearing < 5 times
5. Subsampling: Randomly discard frequent words with P_discard(w)
6. Pair generation: Sliding window (size=3) → (center, context) pairs
7. Negative sampling: k=5 negatives per pair, sampled ∝ f(w)^0.75
8. Model: SkipGram with separate input / output nn.Embedding tables
9. Training: Adam, lr=0.001, batch_size=256, embed_dim=100
10. Checkpoints: Model saved at epoch 2, 5, 10 → exported to JSON