Chapter 3.4: Indexing & Infrastructure

Vector Indices (HNSW)

Moving beyond keyword matching. Vector search finds semantically similar content by navigating high-dimensional space using graph-based algorithms.

Meaning as Mathematical Coordinates

How do you explain the meaning of "Queen" to a computer? You don't. You convert it into coordinates. Modern embedding models (like OpenAI's text-embedding-3 or BERT) convert text into "vectors" fixed-length arrays of numbers that position concepts in a multi-dimensional semantic space. In this space, similarity isn't about shared letters; it's about proximity. "Cat" and "Kitten" are neighbors, even though they share no characters.

Text → 768-dimensional vector (simplified to 3D)

"cat"[0.21, 0.89, -0.52]

"kitten"[0.23, 0.85, -0.48] ← Close!

"laptop"[0.91, -0.12, 0.78] ← Far apart

cat

kitten

dog

Animals

laptop

computer

Technology

Large Distance

The Brute Force Problem

To find the closest match in this space, the naive approach is to measure the distance from your query vector to every single document vector in your database. This is called "Exact kNN". It works perfectly for small datasets, but the math is heavy. With 100 million documents and 768 dimensions per vector, a single query requires ~76 billion floating-point operations. That takes seconds far too slow for search.

Exact kNN (Brute Force)

for each vector in corpus:

distance = cosine(query, vector)

100M vectors × 768 dims = 5 seconds

Approximate kNN (HNSW)

navigate graph of connections:

~150 distance calculations

100M vectors = 5 milliseconds

HNSW: Hierarchical Navigable Small World

To solve the scalability problem, we trade a tiny bit of accuracy for massive speed. HNSW is the state-of-the-art algorithm for this. It builds a multi-layered graph, similar to a highway system. The top layers are "expressways" with few stops (long links) that let you jump across the semantic universe quickly. As you get closer to your destination, you drop down to "local roads" (dense connections) to find the exact neighborhood.

Layer 2: Express (Long Jumps)

Layer 1: Regional

Layer 0: All Nodes (Local Connections)

Search starts at top layer (big jumps), then drills down (smaller jumps).

Result: ~150 distance comparisons instead of 100,000,000

What is Actually Stored? (HNSW Internals)

It's easy to think of an index as "just the vectors". But HNSW is a graph. On disk, it stores:

Nodes (Vectors)

The raw floating-point arrays (e.g., 768 floats per document). Heavy on storage.

Edges (Proximity)

Adjacency lists connecting each node to its M closest neighbors. The graph approximates a Delaunay Triangulation.

How is it built? One vector at a time. When you insert a document, the system "searches" for its nearest neighbors in the existing graph and draws edges to them. This greedy construction is fast but approximate.

Where It Fits in the Search System

Critical Concept: Vector Search is for Recall, not Ranking.

A common misconception is that vector search replaces the whole search engine. It doesn't. It replaces the Recall Stage (generating candidates).

1. Recall

Fetch top 1,000 candidates via Vectors (Meaning) + Keywords (Exact)

2. Blend

Normalize scores and merge lists (RRF or Linear Combination)

3. Rank

Apply business rules (In Stock?) and strict LTR models

Vector Search in Elasticsearch

Elasticsearch supports vector search natively through the dense_vector field type. You need to generate embeddings externally (using models like sentence-transformers or OpenAI) and store them alongside your documents. Here's how to configure the mapping and run a kNN query.

Mapping with dense_vector

PUT /products
{
  "mappings": {
    "properties": {
      "embedding": {
        "type": "dense_vector",
        "dims": 768,           // Match your model output
        "index": true,          // Enable HNSW indexing
        "similarity": "cosine"  // or "dot_product", "l2_norm"
      }
    }
  }
}

kNN Query

POST /products/_search
{
  "knn": {
    "field": "embedding",
    "query_vector": [0.12, 0.45, -0.33, ...],  // 768 dims
    "k": 10,                                // Return top 10
    "num_candidates": 100                  // Speed vs accuracy
  }
}

The num_candidates Tradeoff

Higher num_candidates = more accurate but slower.
num_candidates: 50 fast, ~85% recall
num_candidates: 200 slower, ~98% recall

HNSW Parameters (Tuning Guide)

HNSW has three knobs that control the "Triangle of Trade-offs": Speed, Accuracy (Recall), and Memory. Understanding these is key to tuning your cluster.

m (Max Connections)

Build-time

Analogy: The number of friends each person keeps in their phone book.

High M (e.g., 64): Each node connects to many neighbors. You can jump to a destination in fewer hops. Great recall, but the index structure effectively consumes significantly more RAM per document.
Low M (e.g., 16): Each node knows few others. Small index footprint and fast updates, but you might get "stuck" in a local neighborhood and miss the true nearest neighbor.

ef_construction

Build-time

Analogy: How thorough you are when vetting new friends.

High ef (e.g., 200): When adding a document, the system scans 200 candidates to find the absolute optimal `M` connections. The graph quality is pristine, but indexing CPU usage spikes.
Low ef (e.g., 50): We just grab the first decent neighbors we find. Indexing is fast, but the graph is "messy", leading to potentially worse search results later.

ef_search / num_candidates

Query-time

Analogy: How many people you ask for directions at each intersection.

High (e.g., 100): You carefully explore many possible paths. You almost certainly find the target, but latency increases.
Low (e.g., 10): You rush down the first promising path. Ultra-fast, but you might mistakenly arrive at a local optimum (incorrect match) rather than the global optimum.

Parameter	What It Does	Trade-off
M	Max connections per node	Higher = better recall, more memory
ef_construction	Beam width during build	Higher = better graph, slower build
num_candidates (ef)	Candidates to explore at query time	Higher = better recall, slower search

Product Search

M = 32, ef = 200

99% recall, 10ms

Log Analysis

M = 16, ef = 50

92% recall, 2ms

Recommendation

M = 24, ef = 100

97% recall, 5ms

Hybrid Search: Keywords + Vectors

Why use both? Because they solve opposite problems:

Keyword Search (BM25) is precise but brittle. It finds "iPhone 15" but misses "Apple Phone".
Vector Search (HNSW) is broad but fuzzy. It finds "Apple Phone" but might return "Samsung Phone" (semantically close).
Hybrid Search combines them: vectors for recall (finding concepts) and keywords for precision (matching exact models).

Hybrid Query

{
  "query": {
    "match": { "title": "wireless keyboard", "boost": 0.3 }
  },
  "knn": {
    "field": "title_vector",
    "query_vector": [...],
    "k": 50,
    "boost": 0.7
  }
}
// final_score = 0.3 × BM25 + 0.7 × vector_similarity

Use Case	Keyword Weight	Vector Weight
E-commerce (exact models matter)	0.5	0.5
FAQ search (meaning matters)	0.2	0.8
Code search (exact tokens matter)	0.7	0.3

Memory & Quantization

Vectors are heavy. A single 768-dimensional vector takes 3KB (floats). Multiply by 100 million documents, and you need 300GB of RAM just for vectors. This gets expensive fast.Quantization solves this by compressing 32-bit floats into 8-bit integers (or even 1-bit binaries), sacrificing a tiny bit of precision for massive memory savings.

Storage Formula: num_docs × dims × bytes_per_float

10M products × 768 dims × 4 bytes

= 30.7 GB (vectors only)

+ HNSW graph (~50% overhead)

= ~46 GB total

Quantization Reduces Memory 4x

float32: 307 GB

int8: 77 GB (4x smaller)

ES 8.11+: "type": "int8_hnsw"

Common Pitfalls

Implementing vector search is deceptive. It looks easy ("just index the vectors!"), but the complexity hides in the operational details. Here are the two ways teams most often destroy their search relevance.

Pitfall: Post-Filtering

// BAD: Find top 10, then filter

"knn": { "k": 10 }, "post_filter": ...

// Might return only 3 results!

Fix: Pre-filter in knn

"knn": { "filter": ... }

Pitfall: Stale Embeddings

Day 1: "Blue Shoes $99" indexed

Day 30: Text updated to "Red Shoes $299"

Vector still encodes "blue, shoes, affordable"!

Fix: Re-embed on content change

When Vector Search Fails

Vector search is magical, but it's not a silver bullet. If you remove keyword search entirely, you will likely degrade the user experience in these specific scenarios:

1. Exact SKU/ID Search

Query: "iPhone 15 Pro Max 256GB"
Vectors typically treat "128GB" and "256GB" as semantically identical synonyms. Users searching for parts or IDs need exact character matching.

2. Out-of-Vocabulary Jargon

Query: "X7000-Z Turbo"
If the model has never seen this specific model number, it will hallucinate the closest concept (e.g., "fast car"). Keywords would correctly return 0 results.

3. Short/Ambiguous Queries

Query: "GAP"
A vector model might map this to "Space", "Interval", or "Clothing". Without behavioral data, the vector space is often too broad.

4. Distribution Shift

Models are trained on Wikipedia/Internet data. Your documents are niche technical manuals. If you don't fine-tune, the "meaning" of words might not align with your domain.

Performance Benchmarks

How fast is HNSW? Incredibly fast. Because search is logarithmic `O(log N)`, doubling the data size only adds a tiny constant amount of time. You can search billions of vectors in milliseconds.

Benchmark Spec: 10M products, 768 dimensions (m=16, ef_c=100)

Configuration	Recall@10	Latency (p50)
ef=50	92%	2ms
ef=100	96%	4ms
ef=200	98%	8ms
Brute force	100%	500ms

Scaling Behavior

10M vectors: 4ms • 100M vectors: 8ms • 1B vectors: 15ms
(Logarithmic scaling 10x data ≠ 10x latency)

Key Takeaways

Meaning encoding

Vectors encode semantic meaning. Similar text produces nearby vectors in high-dimensional space.

HNSW

Uses hierarchical layers to navigate from coarse to fine, achieving O(log N) search.

Tuning

Tune M (recall) and ef (speed) to balance memory, build time, and search latency.

Hybrid Search

Combininig BM25 (keyword precision) and Vectors (semantic recall) usually yields the best results.

Quantization

Use int8 quantization to reduce memory 4x with minimal quality loss.

← 3.3 BKD Trees Next: Segments & Immutability