Chapter 3.4: Indexing & Infrastructure
Vector Indices (HNSW)
Moving beyond keyword matching. Vector search finds semantically similar content by navigating high-dimensional space using graph-based algorithms.
Meaning as Mathematical Coordinates
How do you explain the meaning of "Queen" to a computer? You don't. You convert it into coordinates. Modern embedding models (like OpenAI's text-embedding-3 or BERT) convert text into "vectors" fixed-length arrays of numbers that position concepts in a multi-dimensional semantic space. In this space, similarity isn't about shared letters; it's about proximity. "Cat" and "Kitten" are neighbors, even though they share no characters.
The Brute Force Problem
To find the closest match in this space, the naive approach is to measure the distance from your query vector to every single document vector in your database. This is called "Exact kNN". It works perfectly for small datasets, but the math is heavy. With 100 million documents and 768 dimensions per vector, a single query requires ~76 billion floating-point operations. That takes seconds far too slow for search.
HNSW: Hierarchical Navigable Small World
To solve the scalability problem, we trade a tiny bit of accuracy for massive speed. HNSW is the state-of-the-art algorithm for this. It builds a multi-layered graph, similar to a highway system. The top layers are "expressways" with few stops (long links) that let you jump across the semantic universe quickly. As you get closer to your destination, you drop down to "local roads" (dense connections) to find the exact neighborhood.
What is Actually Stored? (HNSW Internals)
It's easy to think of an index as "just the vectors". But HNSW is a graph. On disk, it stores:
The raw floating-point arrays (e.g., 768 floats per document). Heavy on storage.
Adjacency lists connecting each node to its M closest neighbors. The graph approximates a Delaunay Triangulation.
How is it built? One vector at a time. When you insert a document, the system "searches" for its nearest neighbors in the existing graph and draws edges to them. This greedy construction is fast but approximate.
Where It Fits in the Search System
Critical Concept: Vector Search is for Recall, not Ranking.
A common misconception is that vector search replaces the whole search engine. It doesn't. It replaces the Recall Stage (generating candidates).
Vector Search in Elasticsearch
Elasticsearch supports vector search natively through the dense_vector field type. You need to generate embeddings externally (using models like sentence-transformers or OpenAI) and store them alongside your documents. Here's how to configure the mapping and run a kNN query.
PUT /products
{
"mappings": {
"properties": {
"embedding": {
"type": "dense_vector",
"dims": 768, // Match your model output
"index": true, // Enable HNSW indexing
"similarity": "cosine" // or "dot_product", "l2_norm"
}
}
}
}POST /products/_search
{
"knn": {
"field": "embedding",
"query_vector": [0.12, 0.45, -0.33, ...], // 768 dims
"k": 10, // Return top 10
"num_candidates": 100 // Speed vs accuracy
}
}Higher num_candidates = more accurate but slower.
num_candidates: 50 fast, ~85% recall
num_candidates: 200 slower, ~98% recall
HNSW Parameters (Tuning Guide)
HNSW has three knobs that control the "Triangle of Trade-offs": Speed, Accuracy (Recall), and Memory. Understanding these is key to tuning your cluster.
m (Max Connections)
Build-timeAnalogy: The number of friends each person keeps in their phone book.
- High M (e.g., 64): Each node connects to many neighbors. You can jump to a destination in fewer hops. Great recall, but the index structure effectively consumes significantly more RAM per document.
- Low M (e.g., 16): Each node knows few others. Small index footprint and fast updates, but you might get "stuck" in a local neighborhood and miss the true nearest neighbor.
ef_construction
Build-timeAnalogy: How thorough you are when vetting new friends.
- High ef (e.g., 200): When adding a document, the system scans 200 candidates to find the absolute optimal `M` connections. The graph quality is pristine, but indexing CPU usage spikes.
- Low ef (e.g., 50): We just grab the first decent neighbors we find. Indexing is fast, but the graph is "messy", leading to potentially worse search results later.
ef_search / num_candidates
Query-timeAnalogy: How many people you ask for directions at each intersection.
- High (e.g., 100): You carefully explore many possible paths. You almost certainly find the target, but latency increases.
- Low (e.g., 10): You rush down the first promising path. Ultra-fast, but you might mistakenly arrive at a local optimum (incorrect match) rather than the global optimum.
| Parameter | What It Does | Trade-off |
|---|---|---|
| M | Max connections per node | Higher = better recall, more memory |
| ef_construction | Beam width during build | Higher = better graph, slower build |
| num_candidates (ef) | Candidates to explore at query time | Higher = better recall, slower search |
Product Search
Log Analysis
Recommendation
Hybrid Search: Keywords + Vectors
Why use both? Because they solve opposite problems:
- Keyword Search (BM25) is precise but brittle. It finds "iPhone 15" but misses "Apple Phone".
- Vector Search (HNSW) is broad but fuzzy. It finds "Apple Phone" but might return "Samsung Phone" (semantically close).
- Hybrid Search combines them: vectors for recall (finding concepts) and keywords for precision (matching exact models).
{
"query": {
"match": { "title": "wireless keyboard", "boost": 0.3 }
},
"knn": {
"field": "title_vector",
"query_vector": [...],
"k": 50,
"boost": 0.7
}
}
// final_score = 0.3 × BM25 + 0.7 × vector_similarity| Use Case | Keyword Weight | Vector Weight |
|---|---|---|
| E-commerce (exact models matter) | 0.5 | 0.5 |
| FAQ search (meaning matters) | 0.2 | 0.8 |
| Code search (exact tokens matter) | 0.7 | 0.3 |
Memory & Quantization
Vectors are heavy. A single 768-dimensional vector takes 3KB (floats). Multiply by 100 million documents, and you need 300GB of RAM just for vectors. This gets expensive fast.Quantization solves this by compressing 32-bit floats into 8-bit integers (or even 1-bit binaries), sacrificing a tiny bit of precision for massive memory savings.
"type": "int8_hnsw"Common Pitfalls
Implementing vector search is deceptive. It looks easy ("just index the vectors!"), but the complexity hides in the operational details. Here are the two ways teams most often destroy their search relevance.
Pitfall: Post-Filtering
"knn": { "filter": ... }Pitfall: Stale Embeddings
When Vector Search Fails
Vector search is magical, but it's not a silver bullet. If you remove keyword search entirely, you will likely degrade the user experience in these specific scenarios:
1. Exact SKU/ID Search
Query: "iPhone 15 Pro Max 256GB"
Vectors typically treat "128GB" and "256GB" as semantically identical synonyms. Users searching for parts or IDs need exact character matching.
2. Out-of-Vocabulary Jargon
Query: "X7000-Z Turbo"
If the model has never seen this specific model number, it will hallucinate the closest concept (e.g., "fast car"). Keywords would correctly return 0 results.
3. Short/Ambiguous Queries
Query: "GAP"
A vector model might map this to "Space", "Interval", or "Clothing". Without behavioral data, the vector space is often too broad.
4. Distribution Shift
Models are trained on Wikipedia/Internet data. Your documents are niche technical manuals. If you don't fine-tune, the "meaning" of words might not align with your domain.
Performance Benchmarks
How fast is HNSW? Incredibly fast. Because search is logarithmic `O(log N)`, doubling the data size only adds a tiny constant amount of time. You can search billions of vectors in milliseconds.
Benchmark Spec: 10M products, 768 dimensions (m=16, ef_c=100)
| Configuration | Recall@10 | Latency (p50) |
|---|---|---|
| ef=50 | 92% | 2ms |
| ef=100 | 96% | 4ms |
| ef=200 | 98% | 8ms |
| Brute force | 100% | 500ms |
10M vectors: 4ms • 100M vectors: 8ms • 1B vectors: 15ms
(Logarithmic scaling 10x data ≠ 10x latency)
Key Takeaways
Meaning encoding
Vectors encode semantic meaning. Similar text produces nearby vectors in high-dimensional space.
HNSW
Uses hierarchical layers to navigate from coarse to fine, achieving O(log N) search.
Tuning
Tune M (recall) and ef (speed) to balance memory, build time, and search latency.
Hybrid Search
Combininig BM25 (keyword precision) and Vectors (semantic recall) usually yields the best results.
Quantization
Use int8 quantization to reduce memory 4x with minimal quality loss.