Latency vs Recall Tradeoffs
Every vector search system faces the same fundamental tradeoff: better accuracy costs more time and memory. This chapter quantifies these relationships with concrete benchmarks and provides a decision framework for choosing your operating point.
1,200 QPS
ef_search=128, 1M vectors, 768-dim. 3x faster than IVF-Flat at same recall.
17x less
200 MB vs 3.5 GB for HNSW (1M vectors). The deciding factor at billion scale.
98%+
PQ retrieval + full-precision rescore. PQ memory with near-exact accuracy.
1. Defining the Metrics
Before evaluating search algorithms, you need a rigorous way to measure success. A semantic search system is not a binary "working" or "broken" state; it exists on a spectrum of accuracy, speed, and cost. Engineering a production search system requires defining a Service Level Agreement (SLA) around these exact metrics.
Recall@K measures the quality of your retrieval. QPS measures your system's throughput capacity (which directly translates to infrastructure cost). Latency (P50 and P99) measures the user experience, and Memory measures the physical RAM required to hold your index.
| Metric | Measures | Target | Why It Matters |
|---|---|---|---|
| Recall@10 | Accuracy of top-10 | >0.95 | Result quality |
| QPS | Throughput | App-dependent | Revenue capacity |
| P50 Latency | Median query time | <10ms | User experience |
| P99 Latency | Worst-case 1% | <50ms | SLA compliance |
| Memory/vec | Index efficiency | Budget-dependent | Infrastructure cost |
At 1000 QPS, P99=100ms means 10 queries per second are painfully slow. HNSW's latency distribution is tight (small P50-P99 spread). IVF can have long tails when queries fall near cluster boundaries and require probing many cells.
2. Algorithm Benchmark (1M Vectors, 768-dim)
To make the tradeoff concrete, consider a dataset of 1 million 768-dimensional vectors (a standard size for many enterprise applications). The table below compares the performance of exact search (Flat) against the two most popular Approximate Nearest Neighbor algorithms: Inverted File Index (IVF) and Hierarchical Navigable Small World (HNSW).
Notice how HNSW achieves the highest combination of Recall and QPS, making it the default choice for low-latency applications. However, look at the memory column: HNSW requires roughly the same RAM as exact search, while IVF paired with Product Quantization (IVF-PQ) fits the same 1 million vectors into just 200MB, trading a small amount of recall for a massive reduction in infrastructure cost.
| Algorithm | Recall@10 | QPS | P50 | Memory |
|---|---|---|---|---|
| Flat (exact) | 1.000 | ~50 | ~20ms | 3 GB |
| IVF-Flat (nprobe=10) | 0.92 | ~2,000 | ~0.5ms | 3 GB |
| IVF-Flat (nprobe=64) | 0.98 | ~400 | ~2.5ms | 3 GB |
| IVF-PQ (nprobe=32) | 0.85 | ~5,000 | ~0.2ms | 200 MB |
| HNSW (ef=32) | 0.92 | ~3,500 | ~0.3ms | 3.5 GB |
| HNSW (ef=128) | 0.98 | ~1,200 | ~0.8ms | 3.5 GB |
| HNSW (ef=256) | 0.995 | ~500 | ~2ms | 3.5 GB |
| HNSW (ef=512) | 0.999 | ~200 | ~5ms | 3.5 GB |
3. Tuning ef_search (HNSW)
If you choose HNSW as your algorithm, your primary runtime parameter is ef_search (the size of the dynamic candidate list during search). This single integer controls exactly where your system sits on the Latency vs Recall curve.
A small ef_search value makes the algorithm greedy, returning results instantly but often getting stuck in local minima (low recall). As you increase ef_search, the algorithm explores a wider breadth of the graph, finding better matches at the cost of computing more distance comparisons.
Rule of thumb: Set ef_search ≥ K (results requested). For production, 4-10× K is a good balance. For top-10 search, ef_search=64-128 is common. The curve has diminishing returns: going from ef=32 to ef=128 (4x) improves recall by 6%, but ef=256 to ef=512 (2x) improves by only 0.4%.
4. Tuning nprobe (IVF)
If you choose an Inverted File (IVF) index to save memory, your primary runtime parameter isnprobe. IVF works by clustering your vectors into hundreds or thousands of "buckets" (Voronoi cells) during training.
At search time, a query vector is first compared to the centroids of all buckets. nprobedetermines how many of the "closest" buckets are actually opened and searched. Setting nprobe=1is incredibly fast but highly inaccurate—if the true nearest neighbor happens to sit just across the boundary in an adjacent cell, the algorithm will never see it.
Why nprobe=1 is terrible: Up to 50% of true nearest neighbors can lie in neighboring Voronoi cells. The nprobe/nlist interaction matters: with nlist=1024, nprobe=32 searches 3.1% of clusters; with nlist=65536, nprobe=32 searches only 0.05% — usually insufficient. Heuristic: nprobe should search at least 1% of clusters.
5. Memory vs Accuracy: Quantization Levels
Graph algorithms like HNSW solve the CPU bottleneck, but they do nothing to solve the memory bottleneck. A billion 768-dimensional float32 vectors require enormous amounts of RAM, making uncompressed storage prohibitively expensive for large-scale applications.
Quantization is the process of compressing these vectors by reducing precision. You can cast floats to smaller data types (float16 or int8) to halve or quarter your memory footprint, or you can use Product Quantization (PQ) to compress vectors into tiny byte arrays. Every layer of compression saves RAM and increases speed, but introduces mathematical noise that lowers maximum recall.
| Method | Bytes/vec (768d) | Recall Loss | Memory (1B vecs) | Speed Impact |
|---|---|---|---|---|
| float32 | 3,072 | — | 3,000 GB | Baseline |
| float16 | 1,536 | <0.5% | 1,500 GB | 1.5x faster |
| int8 (scalar) | 768 | 2-5% | 750 GB | 3x faster |
| PQ (M=96) | 96 | 5-15% | 96 GB | 8x faster |
| Binary quant | 96 | 10-30% | 96 GB | 32x faster |
If HNSW achieves 98% recall with float32, the same index with int8 might achieve 95%, and with PQ it might be 88%. Planning must account for both quantization loss AND approximate search loss.
Two-Phase Search: Oversample and Rescore
6. Decision Framework
Given the myriad combinations of algorithms (HNSW, IVF), quantization methods (int8, PQ), and architectures (Two-Phase Search), choosing the right stack can feel paralyzing. The table below provides a practical, heuristic-based framework for selecting an architecture based on your specific use case, dataset scale, and latency requirements.
The general rule is: use Flat search until latency hurts, use HNSW until RAM gets too expensive, and then switch to IVF-PQ or Two-Phase search for massive scale.
| Application | Scale | Recall Target | Latency | Recommended |
|---|---|---|---|---|
| Product search | 1M-50M | >0.95 | <20ms | HNSW (ef=128) |
| Document search | 100K-10M | >0.90 | <50ms | HNSW (ef=64) |
| RAG / LLM retrieval | 10K-1M | >0.98 | <100ms | HNSW (ef=256) or flat |
| Image similarity | 10M-1B | >0.90 | <10ms | IVF-PQ + rescore |
| Recommendations | 100M-10B | >0.85 | <20ms | IVF-PQ, sharded |
| Duplicate detection | 1M-100M | >0.99 | <1sec | Flat on batches |
Below 100K vectors, flat (exact) search takes <10ms. Adding an ANN index introduces complexity without meaningful benefit. Even at 100K × 768 dims, exact brute-force leaves well within acceptable latency bounds.
Key Takeaways
The Fundamental Tradeoff
Better accuracy (higher recall) costs more time and memory. This isn't a limitation — it's a mathematical property of ANN search. The art is finding the operating point that balances your application's requirements.
HNSW Dominates Speed-Recall Frontier
At every recall level, HNSW achieves the best QPS. At recall=0.98, HNSW delivers ~1200 QPS vs IVF-Flat's ~400 QPS — 3x faster. But IVF-PQ uses 17x less memory, which matters at billion scale.
Diminishing Returns Above 95% Recall
ef_search 32→128 (4x increase) improves recall 0.92→0.98 (6% gain). ef_search 256→512 (2x increase) improves recall 0.995→0.999 (0.4% gain). Most of the easy recall is cheap; the last few percent are exponentially expensive.
Two-Phase Search Is the Production Standard
Retrieve top k×oversample using compressed index (PQ), then re-score those candidates with full-precision vectors from SSD. Result: PQ-level memory with 98%+ recall@10. SSD cost: 100 × 3KB = 300KB ≈ 0.5ms on NVMe.
P99 Matters More Than P50
At 1000 QPS, P99=100ms means 10 queries per second are painfully slow. HNSW has tight P50-P99 spread; IVF can have long tails when queries fall near cluster boundaries.