Chapter 6.6Vector & Semantic Search

Latency vs Recall Tradeoffs

Every vector search system faces the same fundamental tradeoff: better accuracy costs more time and memory. This chapter quantifies these relationships with concrete benchmarks and provides a decision framework for choosing your operating point.

HNSW at 98% Recall

1,200 QPS

ef_search=128, 1M vectors, 768-dim. 3x faster than IVF-Flat at same recall.

IVF-PQ Memory

17x less

200 MB vs 3.5 GB for HNSW (1M vectors). The deciding factor at billion scale.

Two-Phase Recall

98%+

PQ retrieval + full-precision rescore. PQ memory with near-exact accuracy.

1. Defining the Metrics

Before evaluating search algorithms, you need a rigorous way to measure success. A semantic search system is not a binary "working" or "broken" state; it exists on a spectrum of accuracy, speed, and cost. Engineering a production search system requires defining a Service Level Agreement (SLA) around these exact metrics.

Recall@K measures the quality of your retrieval. QPS measures your system's throughput capacity (which directly translates to infrastructure cost). Latency (P50 and P99) measures the user experience, and Memory measures the physical RAM required to hold your index.

recall_at_k.py

def recall_at_k(approx_results, exact_results, k):

"""

Exact brute-force returns [A, B, C, D, E, F, G, H, I, J]

ANN returns [A, B, C, D, E, F, G, X, Y, Z]

recall@10 = 7/10 = 0.7 (missed H, I, J)

"""

approx_set = set(approx_results[:k])

exact_set = set(exact_results[:k])

return len(approx_set & exact_set) / k

Metric	Measures	Target	Why It Matters
Recall@10	Accuracy of top-10	>0.95	Result quality
QPS	Throughput	App-dependent	Revenue capacity
P50 Latency	Median query time	<10ms	User experience
P99 Latency	Worst-case 1%	<50ms	SLA compliance
Memory/vec	Index efficiency	Budget-dependent	Infrastructure cost

Why P99 Matters More Than P50

At 1000 QPS, P99=100ms means 10 queries per second are painfully slow. HNSW's latency distribution is tight (small P50-P99 spread). IVF can have long tails when queries fall near cluster boundaries and require probing many cells.

2. Algorithm Benchmark (1M Vectors, 768-dim)

To make the tradeoff concrete, consider a dataset of 1 million 768-dimensional vectors (a standard size for many enterprise applications). The table below compares the performance of exact search (Flat) against the two most popular Approximate Nearest Neighbor algorithms: Inverted File Index (IVF) and Hierarchical Navigable Small World (HNSW).

Notice how HNSW achieves the highest combination of Recall and QPS, making it the default choice for low-latency applications. However, look at the memory column: HNSW requires roughly the same RAM as exact search, while IVF paired with Product Quantization (IVF-PQ) fits the same 1 million vectors into just 200MB, trading a small amount of recall for a massive reduction in infrastructure cost.

Algorithm	Recall@10	QPS	P50	Memory
Flat (exact)	1.000	~50	~20ms	3 GB
IVF-Flat (nprobe=10)	0.92	~2,000	~0.5ms	3 GB
IVF-Flat (nprobe=64)	0.98	~400	~2.5ms	3 GB
IVF-PQ (nprobe=32)	0.85	~5,000	~0.2ms	200 MB
HNSW (ef=32)	0.92	~3,500	~0.3ms	3.5 GB
HNSW (ef=128)	0.98	~1,200	~0.8ms	3.5 GB
HNSW (ef=256)	0.995	~500	~2ms	3.5 GB
HNSW (ef=512)	0.999	~200	~5ms	3.5 GB

3. Tuning ef_search (HNSW)

If you choose HNSW as your algorithm, your primary runtime parameter is ef_search (the size of the dynamic candidate list during search). This single integer controls exactly where your system sits on the Latency vs Recall curve.

A small ef_search value makes the algorithm greedy, returning results instantly but often getting stuck in local minima (low recall). As you increase ef_search, the algorithm explores a wider breadth of the graph, finding better matches at the cost of computing more distance comparisons.

# ef_search sweep: recall vs latency

ef_search = 16: Recall@10 ≈ 0.85 Latency ≈ 0.15ms (fast but misses a lot)

ef_search = 32: Recall@10 ≈ 0.92 Latency ≈ 0.3ms (acceptable for many apps)

ef_search = 64: Recall@10 ≈ 0.96 Latency ≈ 0.5ms (good for production)

ef_search = 128: Recall@10 ≈ 0.98 Latency ≈ 1ms (high quality)

ef_search = 256: Recall@10 ≈ 0.995 Latency ≈ 2ms (near-perfect)

ef_search = 512: Recall@10 ≈ 0.999 Latency ≈ 5ms (virtually exact)

Rule of thumb: Set ef_search ≥ K (results requested). For production, 4-10× K is a good balance. For top-10 search, ef_search=64-128 is common. The curve has diminishing returns: going from ef=32 to ef=128 (4x) improves recall by 6%, but ef=256 to ef=512 (2x) improves by only 0.4%.

4. Tuning nprobe (IVF)

If you choose an Inverted File (IVF) index to save memory, your primary runtime parameter isnprobe. IVF works by clustering your vectors into hundreds or thousands of "buckets" (Voronoi cells) during training.

At search time, a query vector is first compared to the centroids of all buckets. nprobedetermines how many of the "closest" buckets are actually opened and searched. Setting nprobe=1is incredibly fast but highly inaccurate—if the true nearest neighbor happens to sit just across the boundary in an adjacent cell, the algorithm will never see it.

# nprobe sweep: recall vs latency

nprobe = 1: Recall@10 ≈ 0.45 (useless — misses most neighbors)

nprobe = 4: Recall@10 ≈ 0.75 (still too low for production)

nprobe = 8: Recall@10 ≈ 0.85 (acceptable for best-effort)

nprobe = 32: Recall@10 ≈ 0.95 (good production setting)

nprobe = 64: Recall@10 ≈ 0.97 (high quality)

nprobe = 128: Recall@10 ≈ 0.99 (near-perfect)

Why nprobe=1 is terrible: Up to 50% of true nearest neighbors can lie in neighboring Voronoi cells. The nprobe/nlist interaction matters: with nlist=1024, nprobe=32 searches 3.1% of clusters; with nlist=65536, nprobe=32 searches only 0.05% — usually insufficient. Heuristic: nprobe should search at least 1% of clusters.

5. Memory vs Accuracy: Quantization Levels

Graph algorithms like HNSW solve the CPU bottleneck, but they do nothing to solve the memory bottleneck. A billion 768-dimensional float32 vectors require enormous amounts of RAM, making uncompressed storage prohibitively expensive for large-scale applications.

Quantization is the process of compressing these vectors by reducing precision. You can cast floats to smaller data types (float16 or int8) to halve or quarter your memory footprint, or you can use Product Quantization (PQ) to compress vectors into tiny byte arrays. Every layer of compression saves RAM and increases speed, but introduces mathematical noise that lowers maximum recall.

Method	Bytes/vec (768d)	Recall Loss	Memory (1B vecs)	Speed Impact
float32	3,072	—	3,000 GB	Baseline
float16	1,536	<0.5%	1,500 GB	1.5x faster
int8 (scalar)	768	2-5%	750 GB	3x faster
PQ (M=96)	96	5-15%	96 GB	8x faster
Binary quant	96	10-30%	96 GB	32x faster

Recall losses are additive

If HNSW achieves 98% recall with float32, the same index with int8 might achieve 95%, and with PQ it might be 88%. Planning must account for both quantization loss AND approximate search loss.

Two-Phase Search: Oversample and Rescore

two_phase_search.py

def two_phase_search(query, compressed_index, full_vectors, k=10, oversample=10):

# Phase 1: Fast search with PQ/int8 in RAM → top k×oversample

candidates = compressed_index.search(query, k=k * oversample) # top 100

# Phase 2: Load full-precision vectors from SSD, re-score

rescored = []

for doc_id in candidates:

full_vec = full_vectors.load(doc_id) # From SSD: ~0.1ms per vec

exact_score = cosine_similarity(query, full_vec)

rescored.append((exact_score, doc_id))

return sorted(rescored, reverse=True)[:k]

# SSD cost: 100 vecs × 3KB = 300KB ≈ 0.5ms on NVMe

# Result: PQ-level memory with 98%+ recall@10

6. Decision Framework

Given the myriad combinations of algorithms (HNSW, IVF), quantization methods (int8, PQ), and architectures (Two-Phase Search), choosing the right stack can feel paralyzing. The table below provides a practical, heuristic-based framework for selecting an architecture based on your specific use case, dataset scale, and latency requirements.

The general rule is: use Flat search until latency hurts, use HNSW until RAM gets too expensive, and then switch to IVF-PQ or Two-Phase search for massive scale.

Application	Scale	Recall Target	Latency	Recommended
Product search	1M-50M	>0.95	<20ms	HNSW (ef=128)
Document search	100K-10M	>0.90	<50ms	HNSW (ef=64)
RAG / LLM retrieval	10K-1M	>0.98	<100ms	HNSW (ef=256) or flat
Image similarity	10M-1B	>0.90	<10ms	IVF-PQ + rescore
Recommendations	100M-10B	>0.85	<20ms	IVF-PQ, sharded
Duplicate detection	1M-100M	>0.99	<1sec	Flat on batches

Don't Over-Engineer Small Data

Below 100K vectors, flat (exact) search takes <10ms. Adding an ANN index introduces complexity without meaningful benefit. Even at 100K × 768 dims, exact brute-force leaves well within acceptable latency bounds.

Key Takeaways

The Fundamental Tradeoff

Better accuracy (higher recall) costs more time and memory. This isn't a limitation — it's a mathematical property of ANN search. The art is finding the operating point that balances your application's requirements.

HNSW Dominates Speed-Recall Frontier

At every recall level, HNSW achieves the best QPS. At recall=0.98, HNSW delivers ~1200 QPS vs IVF-Flat's ~400 QPS — 3x faster. But IVF-PQ uses 17x less memory, which matters at billion scale.

Diminishing Returns Above 95% Recall

ef_search 32→128 (4x increase) improves recall 0.92→0.98 (6% gain). ef_search 256→512 (2x increase) improves recall 0.995→0.999 (0.4% gain). Most of the easy recall is cheap; the last few percent are exponentially expensive.

Two-Phase Search Is the Production Standard

Retrieve top k×oversample using compressed index (PQ), then re-score those candidates with full-precision vectors from SSD. Result: PQ-level memory with 98%+ recall@10. SSD cost: 100 × 3KB = 300KB ≈ 0.5ms on NVMe.

P99 Matters More Than P50

At 1000 QPS, P99=100ms means 10 queries per second are painfully slow. HNSW has tight P50-P99 spread; IVF can have long tails when queries fall near cluster boundaries.

6.5 HNSW Deep Dive 6.7 Hybrid Ranking Pipelines