Chapter 6.4Vector & Semantic Search

Vector Indexing Basics

Brute-force search is O(N×d) per query — impossibly slow at scale. This chapter covers the fundamental indexing strategies: IVF (Inverted File), PQ (Product Quantization), and IVF-PQ — including Voronoi boundaries, ADC vs SDC distance computation, and FAISS configuration.

Brute Force at 1B

~13 min

768B distance operations at 1B ops/sec per query. Completely unusable.

IVF Speedup

~100x

Search 10 of 1024 clusters = skip 99% of data with ~1% recall loss.

PQ Compression

32x

768-dim float32 (3072 B) → 96-byte PQ code. 1B vectors: 3 TB → 96 GB.

1. The Brute-Force Baseline

The simplest approach: compare the query against every vector in the database. This gives exact results but costs O(N × d) per query. At scale, this quickly becomes impractical.

brute_force.py

def brute_force_search(query_vec, all_vectors, k=10):

"""Compare query against EVERY vector. O(N × d) per query."""

distances = []

for i, vec in enumerate(all_vectors):

dist = cosine_distance(query_vec, vec)

distances.append((dist, i))

distances.sort()

return distances[:k]

N (docs)	d (dims)	Distance ops	Time	Verdict
10K	768	7.68M	~8ms	✅ Fine
100K	768	76.8M	~77ms	⚠️ Borderline
1M	768	768M	~768ms	❌ Too slow
100M	768	76.8B	~77s	❌ Unusable
1B	768	768B	~768s	❌ Absurd

Three Strategies to Beat Brute Force

1. Partition the space → only search relevant regions (IVF). 2. Compress the vectors → smaller, faster distance computation (PQ). 3. Build a graph → navigate to neighbors efficiently (HNSW, Chapter 6.5).

2. IVF (Inverted File Index)

Partition the vector space into clusters (Voronoi cells) using k-means. Each cluster defines a Voronoi cell: the region of space closer to this centroid than any other. Every vector belongs to exactly one cell. At query time, only search the closest nprobe clusters out ofnlist total.

# Voronoi cells in 2D (conceptual)

ivf_search.py

def ivf_search(query_vec, centroids, inverted_lists, n_probe=10, k=10):

"""Speedup ≈ nlist / nprobe (e.g., 1024/10 = 100x)"""

# Find nearest centroids (fast: compare query to nlist centroids)

nearest = find_k_nearest(query_vec, centroids, k=n_probe)

# Search only in selected clusters (skip 99% of data)

candidates = []

for cluster_id in nearest:

for doc_id, vec in inverted_lists[cluster_id]:

dist = cosine_distance(query_vec, vec)

candidates.append((dist, doc_id))

candidates.sort()

return candidates[:k]

The Boundary Problem

The fundamental weakness of IVF: a query near a cluster boundary may have true nearest neighbors in an adjacent cell that isn't probed. With nprobe=1, you search only the nearest cluster and miss neighbors just across the boundary. Mitigation: higher nprobe (diminishing returns above ~5% of clusters), multi-probe hashing, or over-clustering (more clusters → smaller cells → fewer boundary issues).

Dataset Size	Recommended nlist	Rationale
<1M	4 × √N (≈ 4K)	Rule of thumb for small datasets
1M - 10M	16 × √N (≈ 16K-50K)	Finer partitioning
10M - 1B	65,536 - 262,144	Typical production values
>1B	1M+	Very fine partitioning

3. PQ (Product Quantization)

Compress vectors by splitting them into sub-vectors and quantizing each independently. A 768-dim vector (3072 bytes) is split into 96 sub-vectors of 8 dims each. For each 8-dim sub-space, learn 256 centroids via k-means. Replace each sub-vector with its nearest centroid ID (1 byte). Result: 3072 bytes → 96 bytes = 32x compression.

# PQ Compression Pipeline

Original Float32 Vector (768 dims, 3072 bytes)

[0.21, -0.04, 0.67...]

Sub-vec 1 (8 dims)

...

[0.15, 0.88, -0.32...]

Sub-vec 96 (8 dims)

Split into M=96 sub-vectors

Quantization (Match to nearest centroid)

↓

ID: 42

...

↓

ID: 199

Learn 256 centroids/sub-space via k-means. Replace vec with 1-byte ID

Final PQ Code (96 bytes total)

[42, 187, 5, 220, ..., 73, 18, 199]

Scale	float32	PQ (M=96)	Savings
1M vectors	3 GB	96 MB	31x
100M vectors	300 GB	9.6 GB	31x
1B vectors	3 TB	96 GB	31x

ADC vs SDC Distance Computation

ADC (Asymmetric) — Preferred

Query stays uncompressed. Pre-compute distance from each query sub-vector to all 256 centroids (M lookup tables). Then: per-vector distance = 96 table lookupsinstead of 768 multiplications. ~8x faster with higher accuracy.

Per-query pre-comp: O(M × 256 × d_sub) — negligible

SDC (Symmetric) — Faster but Less Accurate

Both query AND database vectors are quantized. Distances come from pre-calculated centroid-to-centroid tables (built once at index time). Faster for streaming but loses accuracy because the query is also compressed.

Pre-comp: M × 256 × 256 distances — one-time

OPQ: Optimized Product Quantization

Standard PQ splits vectors into contiguous blocks, but correlations between dimensions may span block boundaries. OPQ learns an optimal rotation of the vector space before quantization, reducing cross-dimension correlations. Result: 5-15% recall improvementat the cost of ~2x longer training time.

4. IVF-PQ: The Billion-Scale Combination

Combine IVF's cluster pruning with PQ's compression. The key insight: PQ is applied to residual vectors (vector - centroid), not the original vectors. Residuals are smaller in magnitude and more uniformly distributed, so PQ works better on them.

IVF-PQ Pipeline

Training Phase

1. k-means → nlist centroids (Voronoi partition)

2. Compute residual = vector - nearest_centroid

3. Train PQ codebooks on residuals

4. Store PQ-encoded residuals in inverted lists

Query Phase

1. Find nearest centroids (nprobe)

2. Compute query_residual = query - centroid

3. ADC search within cluster using residual PQ codes

4. Return top-k across all probed clusters

faiss_ivfpq.py

import faiss

nlist = 4096 # Number of Voronoi cells

M = 96 # Number of sub-vectors for PQ

nbits = 8 # Bits per sub-vector (256 centroids per sub-space)

d = 768 # Vector dimension

# Create quantizer + IVF-PQ index

quantizer = faiss.IndexFlatL2(d)

index = faiss.IndexIVFPQ(quantizer, d, nlist, M, nbits)

index.train(training_vectors) # Learns centroids + PQ codebooks

index.add(all_vectors) # Assigns + PQ-encodes residuals

index.nprobe = 64 # Search 64 of 4096 clusters

distances, indices = index.search(query_vectors, k=10)

5. FAISS Index Decision Matrix

Index	Memory/Vec (768d)	Recall@10	Best For
Flat	3,072 B	100%	Ground truth, <100K vecs
IVF-Flat	3,072 B	95-99%	Medium datasets, high accuracy
IVF-PQ	~96 B	85-95%	Billion-scale, memory-constrained
IVF-PQ+refine	~96 B + disk	95-98%	Best accuracy/memory tradeoff
HNSW (Ch 6.5)	~3,500 B	98-99%	High recall, RAM available

practical_config.py

# Small (<100K): Flat — don't overthink it

index = faiss.IndexFlatL2(768)

# Medium (100K-10M): IVF-Flat for accuracy

index = faiss.IndexIVFFlat(quantizer, 768, n_clusters)

# Large (10M-1B): IVF-PQ for memory efficiency

index = faiss.IndexIVFPQ(quantizer, 768, 65536, 96, 8)

# Billion-scale: IVF-PQ with OPQ preprocessing

index = faiss.index_factory(768, "OPQ96_768,IVF262144,PQ96")

Key Takeaways

Brute Force Is Exact but Unscalable

O(N×d) per query: fine for <100K vectors, borderline at 100K, and completely unusable above 1M. At 1B vectors with 768 dims, a single query takes ~768 seconds.

IVF: Partition → Search Subsets

K-means splits the vector space into Voronoi cells. At query time, only search nprobe cells out of nlist total → ~100x speedup. The boundary problem (nearest neighbor in adjacent cell) is mitigated by increasing nprobe.

PQ: 32x Compression via Subspace Quantization

Split 768-dim vector into 96 sub-vectors of 8 dims each. Learn 256 centroids per sub-space. Replace each sub-vector with a 1-byte centroid ID. Result: 3072 bytes → 96 bytes.

ADC Is Always Preferred over SDC

Asymmetric distance computation keeps the query at full float32 precision and only quantizes database vectors. The accuracy gain over SDC (both quantized) is significant and the per-query table computation is negligible.

IVF-PQ: The Billion-Scale Default

Combine IVF cluster pruning + PQ compression. Key: PQ is applied to residual vectors (vector - centroid), not originals. Residuals are smaller and more uniform → PQ works better. FAISS makes this production-ready.

6.3 Bi-Encoder vs Cross-Encoder 6.5 HNSW Deep Dive