Systems Atlas
Chapter 6.2Vector & Semantic Search

Embeddings 101

Embeddings transform text into fixed-length vectors of floats where geometric proximity encodes semantic similarity. This chapter traces the evolution from sparse bag-of-words to modern dense embeddings, explaining architectures, training objectives, and practical considerations for choosing and deploying embedding models.

Dimension Evolution

100K→768

From sparse TF-IDF (100K dims) to dense SBERT (768 dims). 130x compression with better quality.

SBERT Speedup

10,000x

Finding similar pairs in 10K sentences: BERT takes 65 hours; SBERT takes 5 seconds.

BERT Params

110M

BERT-base: 12 layers, 768 hidden, 110M parameters. Trained 4 days on 16 TPUs.

Prerequisites: This chapter assumes familiarity with Chapter 6.1 (Why Keyword Search Is Not Enough). Understanding the vocabulary mismatch problem motivates why we need embeddings at all. You should also have a basic understanding of neural networks and gradient descent.

1. From Sparse to Dense: The Evolution

Bag-of-Words / TF-IDF (Sparse)

Traditional representations are sparse vectors with dimensions equal to the vocabulary size. A vocabulary of 100,000 words produces 100,000-dimensional vectors where each dimension corresponds to a word, and most values are zero. The word "cat" is as far from "kitten" as it is from "airplane" — every word is orthogonal to every other word. This is the fundamental limitation that dense embeddings solve.

sparse_vs_dense.py
# TF-IDF: "the cat sat on the mat"
# Vocab: [cat, dog, mat, on, sat, the, ...]
vector = [1.2, 0, 0.8, 0.3, 0.9, 0.1, ..., 0, 0, 0] # 100K dims
# Memory: 100K × 4 bytes = 400 KB per document
# For 1M docs: 400 GB just for vectors (mostly zeros)
# Problem: No semantic awareness
cosine("cat", "kitten") = 0.0 # orthogonal!
cosine("cat", "airplane") = 0.0 # same distance!

2. Word2Vec (2013) — Static Word Embeddings

Google's Word2Vec introduced dense, low-dimensional word representations (typically 100-300 dimensions). The core idea is the distributional hypothesis: words appearing in similar contexts tend to have similar meanings. "Coffee" and "tea" both appear near "drink," "cup," "hot" — so they get similar vectors.

CBOW (Continuous Bag of Words)

Predict the center word from its surrounding context words.

Context: ["the", "cat", "on", "the"]
Predict: "sat"

Faster training, better for frequent words. Window: 5-10 words.

Skip-gram

Predict the context words from a center word.

Center: "sat"
Predict: ["the", "cat", "on", "the"]

Slower but better for rare words. Creates more training examples.

Negative Sampling — Making Training Feasible

The full softmax over 100K+ vocabulary is O(V) per training step — prohibitively expensive. Negative sampling turns it into a binary classification problem: for each positive (word, context) pair, sample 5-20 random "negative" words and train the model to distinguish real context from noise. This reduces complexity from O(V) to O(k) where k=5-20.

negative_sampling.py
def negative_sampling_loss(center, context_word, negative_words, k=5):
"""Binary classification instead of full softmax."""
# Positive pair: center word + true context word
pos_loss = -log(sigmoid(dot(center, context_word)))
# Negative pairs: center word + random words
neg_loss = sum(-log(sigmoid(-dot(center, neg)))
for neg in negative_words)
return pos_loss + neg_loss
# Reduces O(V=100,000) to O(k=5-20) per training step

The Famous Result: Vector Arithmetic

# Word2Vec learns semantic relationships as vector operations
vec("king") - vec("man") + vec("woman") ≈ vec("queen")
vec("Paris") - vec("France") + vec("Germany") ≈ vec("Berlin")
vec("walking") - vec("walk") + vec("swim") ≈ vec("swimming")
Dimensions: 300 (not 100,000). Dense: most values are non-zero.
vec("cat") = [0.21, -0.04, 0.67, 0.15, ...] # 300 dimensions
DimensionsUse CaseQualityMemory (1M words)
50-100Small datasets (<100K sentences)Basic semantic capture200-400 MB
200-300General NLP (Google's default)Sweet spot800 MB - 1.2 GB
768+Only with massive dataDiminishing returns for W2V3+ GB
Critical Limitation: One Vector Per Word

Word2Vec assigns one vector per word, regardless of context. "Bank" (river) = "bank" (finance). This is the polysemy problem — and it's exactly what BERT's contextual embeddings solve.


3. GloVe and FastText — Incremental Improvements

GloVe (2014) — Stanford

Combined Word2Vec's local context with global co-occurrence statistics. Builds a word-word co-occurrence matrix from the entire corpus and factorizes it. The key insight: the ratio of co-occurrence probabilities encodes meaning.

P(solid|ice) / P(solid|steam) = large → "solid" relates more to "ice"
P(water|ice) / P(water|steam) ≈ 1 → "water" relates equally

FastText (2016) — Facebook

Extended Word2Vec with character n-grams. "Running" → [<ru, run, unn, nni, nin, ing, ng>]. The word vector is the sum of all n-gram vectors. Can generate embeddings for out-of-vocabulary words and handle misspellings.

"whre" shares subwords with "where" → similar vector!
Great for morphologically rich languages (Finnish, Turkish)

4. The Transformer Revolution: BERT (2018)

BERT (Bidirectional Encoder Representations from Transformers) changed everything. Unlike Word2Vec, BERT produces contextual embeddings where the same word gets different vectors depending on context: "I sat by the river bank" produces a nature-related vector, while "I went to the bank to deposit money" produces a finance-related vector.

BERT is pre-trained on massive corpora (Wikipedia + BookCorpus, 3.3B words), bidirectional (reads text in both directions simultaneously, unlike GPT which is left-to-right), and fine-tunable for specific downstream tasks.

VariantLayersHidden SizeParametersTraining
BERT-base12768110M4 days, 16 TPUs
BERT-large241024340M4 days, 64 TPUs
Problem for Search

BERT produces per-token embeddings (one vector per word). To compare two sentences, you'd need to feed them jointly through BERT — this is O(N²) for N documents. Finding similar pairs in 10K sentences would take 65 hours with raw BERT.


5. Sentence Transformers — SBERT (2019)

Sentence-BERT (Reimers & Gurevych, 2019) solved the per-token problem by training Siamese/Triplet networks that produce fixed-size vectors for entire sentences. The result: semantically similar sentences are close in vector space. The speedup is transformative: 10,000x faster than pure BERT for finding similar pairs in 10K sentences (65 hours → 5 seconds).

sbert_usage.py
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Internally:
# 1. Tokenize input sentence
# 2. Pass through transformer → per-token embeddings
# 3. Apply POOLING → single vector:
# - Mean pooling (average all tokens) ← default, usually best
# - CLS pooling (use [CLS] token vector)
# - Max pooling (element-wise max across tokens)
query_vec = model.encode("How to fix a flat tire") # [384]
doc_vec = model.encode("Steps for puncture repair") # [384]
# Cosine similarity: ~0.82 (high match despite zero word overlap!)
similarity = cosine_similarity(query_vec, doc_vec)

6. Training Loss Functions for Search Embeddings

The choice of loss function critically affects embedding quality. The dominant paradigm for search embeddings is contrastive learning: push the query close to its relevant document and away from all other documents in the batch.

MNR / InfoNCE

THE dominant loss for search. Input: (anchor, positive) pairs only — negatives come from other pairs in the batch. All other positives act as negatives. Larger batch = more negatives = better training signal.

batch_size=1024 → 1023 negatives per query

Triplet Loss

Input: (anchor, positive, negative) triplets with explicit negative mining. A margin parameter controls minimum separation. Requires curating negatives, more setup than MNR.

d(a, pos) + margin < d(a, neg)

Cosine Similarity Loss

Input: (sentence_A, sentence_B, similarity_score) triples. Good for Semantic Textual Similarity (STS) tasks. Less effective for retrieval than MNR.

loss = MSE(cos(a, b), target_score)

Hard Negative Mining

Simple random negatives are too easy — the model doesn't learn fine distinctions. Hard negatives are documents that are topically close but not relevant: for "How to train a puppy," a hard negative is "How to train a machine learning model" — similar words, different meaning.

hard_negatives.py
def mine_hard_negatives(query, relevant_docs, index, model, top_k=30):
"""Use the model's own retrieval to find tricky negatives."""
query_vec = model.encode(query)
candidates = index.search(query_vec, k=top_k)
# Keep candidates NOT in the relevant set
hard_negatives = [c for c in candidates
if c.doc_id not in relevant_docs]
# These are docs the model THINKS are relevant but aren't
# → Training on these forces finer distinctions
return hard_negatives[:5]
BM25 negatives

Top BM25 results not marked relevant — lexically similar but semantically different.

Cross-encoder negatives

Use teacher model to score all pairs, pick highest-scoring non-relevant. Most effective but expensive.


7. Modern Embedding Models & MTEB Benchmark

The Massive Text Embedding Benchmark (MTEB) evaluates models across 8 task types and 56+ datasets. It's the standard leaderboard for comparing embedding models. Always benchmark on your own domain data too — MTEB scores don't always predict domain-specific performance.

ModelDimsMTEB AvgSpeedParams
all-MiniLM-L6-v238456.3Very Fast22M
all-mpnet-base-v276857.8Medium109M
e5-large-v2102462.2Slow335M
gte-large-en-v1.5102465.4Slow434M
nomic-embed-text-v1.576862.3Fast137M
text-embedding-3-small1536API

8. Distance Metrics & Dimension Reduction

MetricRangeWhen to Use
Cosine Similarity[-1, 1]Default for text. Ignores magnitude, measures angle only.
Dot Product(-∞, +∞)If L2-normalized (equivalent to cosine).
Euclidean (L2)[0, +∞)Sometimes for images. Sensitive to magnitude.
Manhattan (L1)[0, +∞)Sparse vectors, binary features.

Important: if vectors are L2-normalized (‖v‖ = 1), cosine similarity = dot product. Most embedding models output normalized vectors, so cosine and dot product give identical rankings. Store normalized vectors and use dot product for speed — it's a single numpy.dot() call.

Matryoshka Embeddings (2022)

The modern approach to dimension reduction: train the model to be useful at any prefix length. The full 768-dim vector is most accurate, but you can truncate to 256 dims for 3x memory savings with only 2-5% quality loss. Remarkably, you just take the first N dimensions — no PCA or retraining needed.

# Matryoshka: flexible dimension usage
Full: [0.21, -0.04, 0.67, ..., 0.15] # 768 dims (3072 bytes)
Truncated: [0.21, -0.04, 0.67, ..., 0.33] # 256 dims (1024 bytes, 3x less)
Further: [0.21, -0.04, 0.67, ..., 0.88] # 64 dims (256 bytes, 12x less)
truncated_vec = full_vec[:256] # Just take first 256 dims!
TechniqueHowQuality Loss
PCALinear projection to lower dims5-10% at 256-dim
MatryoshkaTruncate prefix from trained model2-5% at 256-dim
Random projectionPreserve distances via JL lemma5-15%
Scalar quantizationfloat32 → int8 (not dims)2-5%

Key Takeaways

01

Sparse → Dense Revolution

From 100K-dim sparse TF-IDF vectors (mostly zeros) to compact 300-768 dim dense vectors that encode meaning. Word2Vec (2013) proved the distributional hypothesis: words in similar contexts get similar vectors.

02

Context Changes Everything (BERT, 2018)

Word2Vec gives one vector per word ('bank' = same vector in all contexts). BERT produces contextual embeddings where the same word gets different vectors depending on surrounding text — solving the polysemy problem.

03

Contrastive Training Is the Key

Modern search embeddings use MNR/InfoNCE loss: push query close to its relevant doc, push it away from all other docs in the batch. Batch size is critical — 1024 gives 1023 negatives per query, dramatically improving training signal.

04

Matryoshka Embeddings Cut Costs at Source

Train the model to be useful at any prefix length: truncate 768→256 dims for 3x memory savings with only 2-5% quality loss. This compounds with PQ for even greater savings. The modern approach to dimension reduction.

05

Fine-Tune for Your Domain

General models confuse domain terminology: legal 'consideration' ≠ 'thoughtfulness'. Even 500-1000 domain-specific training pairs significantly improve retrieval quality. Use MNR loss + hard negatives mined from BM25.