Word2Vec Embeddings — Interactive Demo

⚠ Could not load .
Make sure all three JSON files are in the same folder as this HTML file:
embeddings_epoch2.json · embeddings_epoch5.json · embeddings_epoch10.json

Run export_embeddings.py in your notebook to generate them. Meanwhile the demo runs with mock embeddings.

2D Embedding Space · PCA Projection

Vocabulary: —

Embed dim: —

PCA var: —

Dataset: WikiText-2

Method: Skip-Gram + NS

💡 Switch epochs above to watch clusters form. Hover a word for its label. Early epochs look scattered — later epochs show tight semantic groupings.

Cosine Similarity

Cosine similarity measures the angle between two embedding vectors. 1.0 = same direction, 0 = orthogonal, −1 = opposite. Watch how scores change across epochs as the model learns.

Word A

Word B

Try:

Current Epoch Result

Enter two words above and click Compute.

Across All Epochs

How does similarity for the same word pair evolve as training progresses?

Compute a similarity above to see the comparison.

Score Guide

0.7 – 1.0 → Strongly related

0.3 – 0.7 → Moderately related

0.0 – 0.3 → Weakly related

−1.0 – 0.0 → Dissimilar

sim(A, B) = (A · B) / (|A| × |B|)

Nearest Neighbors

Words closest in embedding space — i.e. words that appear in similar contexts. Switch epochs above to watch neighbors stabilize as the model trains.

Query Word

Top K

Try:

Vector Arithmetic · Analogy Reasoning

Solve A − B + C = ?. Word relationships are encoded as directions in embedding space. With more training epochs, the model's analogical reasoning becomes more accurate.

video

−

mtv

music

≈

Word A

Word B (subtract)

Word C (add)

Try:

📖

What are Embeddings?

Each word maps to a dense vector of real numbers. Words with similar meanings or usage patterns end up with similar vectors — proximity in vector space = semantic relatedness.

word → [0.23, −0.87, ..., 0.64] ∈ ℝ¹⁰⁰

🪟

Skip-Gram Model

Given a center word, predict surrounding context words within a sliding window of size k. This forces the model to learn which words share similar neighborhoods.

center: "music" → predict:
"album", "guitar", "song", "lyrics"

🎯

Negative Sampling

Instead of softmax over the full vocabulary, update only the true context word (positive) and k randomly sampled "noise" words (negatives).

L = log σ(v_c · v_pos) + Σₖ log σ(−v_c · v_neg)
P(neg) ∝ f(w)^0.75

🔽

Frequency Subsampling

Common words like "the" are randomly dropped during training to prevent them from dominating the signal.

P_discard(w) = max(0, 1 − √(t / f(w)))
t = 10⁻⁵

📐

Cosine Similarity

The cosine of the angle between two vectors — scale-invariant, only direction matters. Range: [−1, +1].

sim(A, B) = (A · B) / (|A| |B|)

🧮

Vector Arithmetic

Relationships are encoded as directions, so you can do arithmetic on meaning. Classic example from Mikolov et al. 2013:

vec("music") − vec("video") + vec("mtv")
≈ vec("queen")

Your Training Pipeline

1. Dataset: WikiText-2 (train + val + test concatenated, ~2M tokens)
2. Cleaning: Remove headings (= markers), lowercase, replace hyphens
3. Tokenization: Whitespace split → build Counter vocabulary
4. Frequency filter: Drop words appearing < 5 times
5. Subsampling: Randomly discard frequent words with P_discard(w)
6. Pair generation: Sliding window (size=3) → (center, context) pairs
7. Negative sampling: k=5 negatives per pair, sampled ∝ f(w)^0.75
8. Model: SkipGram with separate input / output nn.Embedding tables
9. Training: Adam, lr=0.001, batch_size=256, embed_dim=100
10. Checkpoints: Model saved at epoch 2, 5, 10 → exported to JSON

Words as Vectors in Space