Embeddings 101
Embeddings transform text into fixed-length vectors of floats where geometric proximity encodes semantic similarity. This chapter traces the evolution from sparse bag-of-words to modern dense embeddings, explaining architectures, training objectives, and practical considerations for choosing and deploying embedding models.
100K→768
From sparse TF-IDF (100K dims) to dense SBERT (768 dims). 130x compression with better quality.
10,000x
Finding similar pairs in 10K sentences: BERT takes 65 hours; SBERT takes 5 seconds.
110M
BERT-base: 12 layers, 768 hidden, 110M parameters. Trained 4 days on 16 TPUs.
1. From Sparse to Dense: The Evolution
Bag-of-Words / TF-IDF (Sparse)
Traditional representations are sparse vectors with dimensions equal to the vocabulary size. A vocabulary of 100,000 words produces 100,000-dimensional vectors where each dimension corresponds to a word, and most values are zero. The word "cat" is as far from "kitten" as it is from "airplane" — every word is orthogonal to every other word. This is the fundamental limitation that dense embeddings solve.
2. Word2Vec (2013) — Static Word Embeddings
Google's Word2Vec introduced dense, low-dimensional word representations (typically 100-300 dimensions). The core idea is the distributional hypothesis: words appearing in similar contexts tend to have similar meanings. "Coffee" and "tea" both appear near "drink," "cup," "hot" — so they get similar vectors.
Predict the center word from its surrounding context words.
Faster training, better for frequent words. Window: 5-10 words.
Predict the context words from a center word.
Slower but better for rare words. Creates more training examples.
Negative Sampling — Making Training Feasible
The full softmax over 100K+ vocabulary is O(V) per training step — prohibitively expensive. Negative sampling turns it into a binary classification problem: for each positive (word, context) pair, sample 5-20 random "negative" words and train the model to distinguish real context from noise. This reduces complexity from O(V) to O(k) where k=5-20.
The Famous Result: Vector Arithmetic
| Dimensions | Use Case | Quality | Memory (1M words) |
|---|---|---|---|
| 50-100 | Small datasets (<100K sentences) | Basic semantic capture | 200-400 MB |
| 200-300 | General NLP (Google's default) | Sweet spot | 800 MB - 1.2 GB |
| 768+ | Only with massive data | Diminishing returns for W2V | 3+ GB |
Word2Vec assigns one vector per word, regardless of context. "Bank" (river) = "bank" (finance). This is the polysemy problem — and it's exactly what BERT's contextual embeddings solve.
3. GloVe and FastText — Incremental Improvements
GloVe (2014) — Stanford
Combined Word2Vec's local context with global co-occurrence statistics. Builds a word-word co-occurrence matrix from the entire corpus and factorizes it. The key insight: the ratio of co-occurrence probabilities encodes meaning.
P(water|ice) / P(water|steam) ≈ 1 → "water" relates equally
FastText (2016) — Facebook
Extended Word2Vec with character n-grams. "Running" → [<ru, run, unn, nni, nin, ing, ng>]. The word vector is the sum of all n-gram vectors. Can generate embeddings for out-of-vocabulary words and handle misspellings.
Great for morphologically rich languages (Finnish, Turkish)
4. The Transformer Revolution: BERT (2018)
BERT (Bidirectional Encoder Representations from Transformers) changed everything. Unlike Word2Vec, BERT produces contextual embeddings where the same word gets different vectors depending on context: "I sat by the river bank" produces a nature-related vector, while "I went to the bank to deposit money" produces a finance-related vector.
BERT is pre-trained on massive corpora (Wikipedia + BookCorpus, 3.3B words), bidirectional (reads text in both directions simultaneously, unlike GPT which is left-to-right), and fine-tunable for specific downstream tasks.
| Variant | Layers | Hidden Size | Parameters | Training |
|---|---|---|---|---|
| BERT-base | 12 | 768 | 110M | 4 days, 16 TPUs |
| BERT-large | 24 | 1024 | 340M | 4 days, 64 TPUs |
BERT produces per-token embeddings (one vector per word). To compare two sentences, you'd need to feed them jointly through BERT — this is O(N²) for N documents. Finding similar pairs in 10K sentences would take 65 hours with raw BERT.
5. Sentence Transformers — SBERT (2019)
Sentence-BERT (Reimers & Gurevych, 2019) solved the per-token problem by training Siamese/Triplet networks that produce fixed-size vectors for entire sentences. The result: semantically similar sentences are close in vector space. The speedup is transformative: 10,000x faster than pure BERT for finding similar pairs in 10K sentences (65 hours → 5 seconds).
6. Training Loss Functions for Search Embeddings
The choice of loss function critically affects embedding quality. The dominant paradigm for search embeddings is contrastive learning: push the query close to its relevant document and away from all other documents in the batch.
MNR / InfoNCE
THE dominant loss for search. Input: (anchor, positive) pairs only — negatives come from other pairs in the batch. All other positives act as negatives. Larger batch = more negatives = better training signal.
batch_size=1024 → 1023 negatives per query
Triplet Loss
Input: (anchor, positive, negative) triplets with explicit negative mining. A margin parameter controls minimum separation. Requires curating negatives, more setup than MNR.
d(a, pos) + margin < d(a, neg)
Cosine Similarity Loss
Input: (sentence_A, sentence_B, similarity_score) triples. Good for Semantic Textual Similarity (STS) tasks. Less effective for retrieval than MNR.
loss = MSE(cos(a, b), target_score)
Hard Negative Mining
Simple random negatives are too easy — the model doesn't learn fine distinctions. Hard negatives are documents that are topically close but not relevant: for "How to train a puppy," a hard negative is "How to train a machine learning model" — similar words, different meaning.
Top BM25 results not marked relevant — lexically similar but semantically different.
Use teacher model to score all pairs, pick highest-scoring non-relevant. Most effective but expensive.
7. Modern Embedding Models & MTEB Benchmark
The Massive Text Embedding Benchmark (MTEB) evaluates models across 8 task types and 56+ datasets. It's the standard leaderboard for comparing embedding models. Always benchmark on your own domain data too — MTEB scores don't always predict domain-specific performance.
| Model | Dims | MTEB Avg | Speed | Params |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 56.3 | Very Fast | 22M |
| all-mpnet-base-v2 | 768 | 57.8 | Medium | 109M |
| e5-large-v2 | 1024 | 62.2 | Slow | 335M |
| gte-large-en-v1.5 | 1024 | 65.4 | Slow | 434M |
| nomic-embed-text-v1.5 | 768 | 62.3 | Fast | 137M |
| text-embedding-3-small | 1536 | — | API | — |
8. Distance Metrics & Dimension Reduction
| Metric | Range | When to Use |
|---|---|---|
| Cosine Similarity | [-1, 1] | Default for text. Ignores magnitude, measures angle only. |
| Dot Product | (-∞, +∞) | If L2-normalized (equivalent to cosine). |
| Euclidean (L2) | [0, +∞) | Sometimes for images. Sensitive to magnitude. |
| Manhattan (L1) | [0, +∞) | Sparse vectors, binary features. |
Important: if vectors are L2-normalized (‖v‖ = 1), cosine similarity = dot product. Most embedding models output normalized vectors, so cosine and dot product give identical rankings. Store normalized vectors and use dot product for speed — it's a single numpy.dot() call.
Matryoshka Embeddings (2022)
The modern approach to dimension reduction: train the model to be useful at any prefix length. The full 768-dim vector is most accurate, but you can truncate to 256 dims for 3x memory savings with only 2-5% quality loss. Remarkably, you just take the first N dimensions — no PCA or retraining needed.
| Technique | How | Quality Loss |
|---|---|---|
| PCA | Linear projection to lower dims | 5-10% at 256-dim |
| Matryoshka | Truncate prefix from trained model | 2-5% at 256-dim |
| Random projection | Preserve distances via JL lemma | 5-15% |
| Scalar quantization | float32 → int8 (not dims) | 2-5% |
Key Takeaways
Sparse → Dense Revolution
From 100K-dim sparse TF-IDF vectors (mostly zeros) to compact 300-768 dim dense vectors that encode meaning. Word2Vec (2013) proved the distributional hypothesis: words in similar contexts get similar vectors.
Context Changes Everything (BERT, 2018)
Word2Vec gives one vector per word ('bank' = same vector in all contexts). BERT produces contextual embeddings where the same word gets different vectors depending on surrounding text — solving the polysemy problem.
Contrastive Training Is the Key
Modern search embeddings use MNR/InfoNCE loss: push query close to its relevant doc, push it away from all other docs in the batch. Batch size is critical — 1024 gives 1023 negatives per query, dramatically improving training signal.
Matryoshka Embeddings Cut Costs at Source
Train the model to be useful at any prefix length: truncate 768→256 dims for 3x memory savings with only 2-5% quality loss. This compounds with PQ for even greater savings. The modern approach to dimension reduction.
Fine-Tune for Your Domain
General models confuse domain terminology: legal 'consideration' ≠ 'thoughtfulness'. Even 500-1000 domain-specific training pairs significantly improve retrieval quality. Use MNR loss + hard negatives mined from BM25.