Chapter 6.2Vector & Semantic Search

Embeddings 101

Embeddings transform text into fixed-length vectors of floats where geometric proximity encodes semantic similarity. This chapter traces the evolution from sparse bag-of-words to modern dense embeddings, explaining architectures, training objectives, and practical considerations for choosing and deploying embedding models.

Dimension Evolution

100K→768

From sparse TF-IDF (100K dims) to dense SBERT (768 dims). 130x compression with better quality.

SBERT Speedup

10,000x

Finding similar pairs in 10K sentences: BERT takes 65 hours; SBERT takes 5 seconds.

BERT Params

110M

BERT-base: 12 layers, 768 hidden, 110M parameters. Trained 4 days on 16 TPUs.

Prerequisites: This chapter assumes familiarity with Chapter 6.1 (Why Keyword Search Is Not Enough). Understanding the vocabulary mismatch problem motivates why we need embeddings at all. You should also have a basic understanding of neural networks and gradient descent.

1. From Sparse to Dense: The Evolution

Bag-of-Words / TF-IDF (Sparse)

Traditional representations are sparse vectors with dimensions equal to the vocabulary size. A vocabulary of 100,000 words produces 100,000-dimensional vectors where each dimension corresponds to a word, and most values are zero. The word "cat" is as far from "kitten" as it is from "airplane" — every word is orthogonal to every other word. This is the fundamental limitation that dense embeddings solve.

sparse_vs_dense.py

# TF-IDF: "the cat sat on the mat"

# Vocab: [cat, dog, mat, on, sat, the, ...]

vector = [1.2, 0, 0.8, 0.3, 0.9, 0.1, ..., 0, 0, 0] # 100K dims

# Memory: 100K × 4 bytes = 400 KB per document

# For 1M docs: 400 GB just for vectors (mostly zeros)

# Problem: No semantic awareness

cosine("cat", "kitten") = 0.0 # orthogonal!

cosine("cat", "airplane") = 0.0 # same distance!

2. Word2Vec (2013) — Static Word Embeddings

Google's Word2Vec introduced dense, low-dimensional word representations (typically 100-300 dimensions). The core idea is the distributional hypothesis: words appearing in similar contexts tend to have similar meanings. "Coffee" and "tea" both appear near "drink," "cup," "hot" — so they get similar vectors.

CBOW (Continuous Bag of Words)

Predict the center word from its surrounding context words.

Context: ["the", "cat", "on", "the"]

Predict: "sat"

Faster training, better for frequent words. Window: 5-10 words.

Skip-gram

Predict the context words from a center word.

Center: "sat"

Predict: ["the", "cat", "on", "the"]

Slower but better for rare words. Creates more training examples.

Negative Sampling — Making Training Feasible

The full softmax over 100K+ vocabulary is O(V) per training step — prohibitively expensive. Negative sampling turns it into a binary classification problem: for each positive (word, context) pair, sample 5-20 random "negative" words and train the model to distinguish real context from noise. This reduces complexity from O(V) to O(k) where k=5-20.

negative_sampling.py

def negative_sampling_loss(center, context_word, negative_words, k=5):

"""Binary classification instead of full softmax."""

# Positive pair: center word + true context word

pos_loss = -log(sigmoid(dot(center, context_word)))

# Negative pairs: center word + random words

neg_loss = sum(-log(sigmoid(-dot(center, neg)))

for neg in negative_words)

return pos_loss + neg_loss

# Reduces O(V=100,000) to O(k=5-20) per training step

The Famous Result: Vector Arithmetic

# Word2Vec learns semantic relationships as vector operations

vec("king") - vec("man") + vec("woman") ≈ vec("queen")

vec("Paris") - vec("France") + vec("Germany") ≈ vec("Berlin")

vec("walking") - vec("walk") + vec("swim") ≈ vec("swimming")

Dimensions: 300 (not 100,000). Dense: most values are non-zero.

vec("cat") = [0.21, -0.04, 0.67, 0.15, ...] # 300 dimensions

Dimensions	Use Case	Quality	Memory (1M words)
50-100	Small datasets (<100K sentences)	Basic semantic capture	200-400 MB
200-300	General NLP (Google's default)	Sweet spot	800 MB - 1.2 GB
768+	Only with massive data	Diminishing returns for W2V	3+ GB

Critical Limitation: One Vector Per Word

Word2Vec assigns one vector per word, regardless of context. "Bank" (river) = "bank" (finance). This is the polysemy problem — and it's exactly what BERT's contextual embeddings solve.

3. GloVe and FastText — Incremental Improvements

GloVe (2014) — Stanford

Combined Word2Vec's local context with global co-occurrence statistics. Builds a word-word co-occurrence matrix from the entire corpus and factorizes it. The key insight: the ratio of co-occurrence probabilities encodes meaning.

P(solid|ice) / P(solid|steam) = large → "solid" relates more to "ice"
P(water|ice) / P(water|steam) ≈ 1 → "water" relates equally

FastText (2016) — Facebook

Extended Word2Vec with character n-grams. "Running" → [<ru, run, unn, nni, nin, ing, ng>]. The word vector is the sum of all n-gram vectors. Can generate embeddings for out-of-vocabulary words and handle misspellings.

"whre" shares subwords with "where" → similar vector!
Great for morphologically rich languages (Finnish, Turkish)

4. The Transformer Revolution: BERT (2018)

BERT (Bidirectional Encoder Representations from Transformers) changed everything. Unlike Word2Vec, BERT produces contextual embeddings where the same word gets different vectors depending on context: "I sat by the river bank" produces a nature-related vector, while "I went to the bank to deposit money" produces a finance-related vector.

BERT is pre-trained on massive corpora (Wikipedia + BookCorpus, 3.3B words), bidirectional (reads text in both directions simultaneously, unlike GPT which is left-to-right), and fine-tunable for specific downstream tasks.

Variant	Layers	Hidden Size	Parameters	Training
BERT-base	12	768	110M	4 days, 16 TPUs
BERT-large	24	1024	340M	4 days, 64 TPUs

Problem for Search

BERT produces per-token embeddings (one vector per word). To compare two sentences, you'd need to feed them jointly through BERT — this is O(N²) for N documents. Finding similar pairs in 10K sentences would take 65 hours with raw BERT.

5. Sentence Transformers — SBERT (2019)

Sentence-BERT (Reimers & Gurevych, 2019) solved the per-token problem by training Siamese/Triplet networks that produce fixed-size vectors for entire sentences. The result: semantically similar sentences are close in vector space. The speedup is transformative: 10,000x faster than pure BERT for finding similar pairs in 10K sentences (65 hours → 5 seconds).

sbert_usage.py

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Internally:

# 1. Tokenize input sentence

# 2. Pass through transformer → per-token embeddings

# 3. Apply POOLING → single vector:

# - Mean pooling (average all tokens) ← default, usually best

# - CLS pooling (use [CLS] token vector)

# - Max pooling (element-wise max across tokens)

query_vec = model.encode("How to fix a flat tire") # [384]

doc_vec = model.encode("Steps for puncture repair") # [384]

# Cosine similarity: ~0.82 (high match despite zero word overlap!)

similarity = cosine_similarity(query_vec, doc_vec)

6. Training Loss Functions for Search Embeddings

The choice of loss function critically affects embedding quality. The dominant paradigm for search embeddings is contrastive learning: push the query close to its relevant document and away from all other documents in the batch.

MNR / InfoNCE

THE dominant loss for search. Input: (anchor, positive) pairs only — negatives come from other pairs in the batch. All other positives act as negatives. Larger batch = more negatives = better training signal.

batch_size=1024 → 1023 negatives per query

Triplet Loss

Input: (anchor, positive, negative) triplets with explicit negative mining. A margin parameter controls minimum separation. Requires curating negatives, more setup than MNR.

d(a, pos) + margin < d(a, neg)

Cosine Similarity Loss

Input: (sentence_A, sentence_B, similarity_score) triples. Good for Semantic Textual Similarity (STS) tasks. Less effective for retrieval than MNR.

loss = MSE(cos(a, b), target_score)

Hard Negative Mining

Simple random negatives are too easy — the model doesn't learn fine distinctions. Hard negatives are documents that are topically close but not relevant: for "How to train a puppy," a hard negative is "How to train a machine learning model" — similar words, different meaning.

hard_negatives.py

def mine_hard_negatives(query, relevant_docs, index, model, top_k=30):

"""Use the model's own retrieval to find tricky negatives."""

query_vec = model.encode(query)

candidates = index.search(query_vec, k=top_k)

# Keep candidates NOT in the relevant set

hard_negatives = [c for c in candidates

if c.doc_id not in relevant_docs]

# These are docs the model THINKS are relevant but aren't

# → Training on these forces finer distinctions

return hard_negatives[:5]

BM25 negatives

Top BM25 results not marked relevant — lexically similar but semantically different.

Cross-encoder negatives

Use teacher model to score all pairs, pick highest-scoring non-relevant. Most effective but expensive.

7. Modern Embedding Models & MTEB Benchmark

The Massive Text Embedding Benchmark (MTEB) evaluates models across 8 task types and 56+ datasets. It's the standard leaderboard for comparing embedding models. Always benchmark on your own domain data too — MTEB scores don't always predict domain-specific performance.

Model	Dims	MTEB Avg	Speed	Params
all-MiniLM-L6-v2	384	56.3	Very Fast	22M
all-mpnet-base-v2	768	57.8	Medium	109M
e5-large-v2	1024	62.2	Slow	335M
gte-large-en-v1.5	1024	65.4	Slow	434M
nomic-embed-text-v1.5	768	62.3	Fast	137M
text-embedding-3-small	1536	—	API	—

8. Distance Metrics & Dimension Reduction

Metric	Range	When to Use
Cosine Similarity	[-1, 1]	Default for text. Ignores magnitude, measures angle only.
Dot Product	(-∞, +∞)	If L2-normalized (equivalent to cosine).
Euclidean (L2)	[0, +∞)	Sometimes for images. Sensitive to magnitude.
Manhattan (L1)	[0, +∞)	Sparse vectors, binary features.

Important: if vectors are L2-normalized (‖v‖ = 1), cosine similarity = dot product. Most embedding models output normalized vectors, so cosine and dot product give identical rankings. Store normalized vectors and use dot product for speed — it's a single numpy.dot() call.

Matryoshka Embeddings (2022)

The modern approach to dimension reduction: train the model to be useful at any prefix length. The full 768-dim vector is most accurate, but you can truncate to 256 dims for 3x memory savings with only 2-5% quality loss. Remarkably, you just take the first N dimensions — no PCA or retraining needed.

# Matryoshka: flexible dimension usage

Full: [0.21, -0.04, 0.67, ..., 0.15] # 768 dims (3072 bytes)

Truncated: [0.21, -0.04, 0.67, ..., 0.33] # 256 dims (1024 bytes, 3x less)

Further: [0.21, -0.04, 0.67, ..., 0.88] # 64 dims (256 bytes, 12x less)

truncated_vec = full_vec[:256] # Just take first 256 dims!

Technique	How	Quality Loss
PCA	Linear projection to lower dims	5-10% at 256-dim
Matryoshka	Truncate prefix from trained model	2-5% at 256-dim
Random projection	Preserve distances via JL lemma	5-15%
Scalar quantization	float32 → int8 (not dims)	2-5%

Key Takeaways

Sparse → Dense Revolution

From 100K-dim sparse TF-IDF vectors (mostly zeros) to compact 300-768 dim dense vectors that encode meaning. Word2Vec (2013) proved the distributional hypothesis: words in similar contexts get similar vectors.

Context Changes Everything (BERT, 2018)

Word2Vec gives one vector per word ('bank' = same vector in all contexts). BERT produces contextual embeddings where the same word gets different vectors depending on surrounding text — solving the polysemy problem.

Contrastive Training Is the Key

Modern search embeddings use MNR/InfoNCE loss: push query close to its relevant doc, push it away from all other docs in the batch. Batch size is critical — 1024 gives 1023 negatives per query, dramatically improving training signal.

Matryoshka Embeddings Cut Costs at Source

Train the model to be useful at any prefix length: truncate 768→256 dims for 3x memory savings with only 2-5% quality loss. This compounds with PQ for even greater savings. The modern approach to dimension reduction.

Fine-Tune for Your Domain

General models confuse domain terminology: legal 'consideration' ≠ 'thoughtfulness'. Even 500-1000 domain-specific training pairs significantly improve retrieval quality. Use MNR loss + hard negatives mined from BM25.

6.1 Why Keyword Search Is Not Enough 6.3 Bi-Encoder vs Cross-Encoder