Chapter 6.7Vector & Semantic Search

Hybrid Ranking Pipelines

Neither keyword search nor semantic search alone delivers optimal results for all query types. Hybrid search combines the precision of BM25 with the conceptual understanding of embeddings. This chapter covers RRF, linear combination, score normalization, and query routing.

RRF Constant

k=60

The "set it and forget it" constant from Cormack et al. (2009). Stable across domains.

Alpha Sweet Spot

0.6-0.7

Slightly favoring semantic. Short queries → lower alpha; long queries → higher.

Fusion Latency

~1ms

Merging two ranked lists of 100 candidates each. Negligible overhead.

1. Why Hybrid Search?

If semantic search is so powerful, why do we still need old-school keyword search like BM25? The reality of production search systems is that neither approach alone is sufficient to handle the incredible diversity of human queries. Dense retrieval (embeddings) and sparse retrieval (keywords) have orthogonal strengths and strictly complementary weaknesses.

BM25 breaks on conceptual queries: "affordable noise cancelling headphones" won't match "budget over-ear ANC headphones" because "affordable" ≠ "budget" and "noise cancelling" ≠ "ANC." Embeddings break on precision queries: searching "SKU-2847-B" produces vague vectors that might loosely match hundreds of unrelated products with similar-looking codes. Every query type that trips up one system is handled perfectly by the other.

Query Type	BM25	Vector	Hybrid
"iPhone 15 Pro Max" (exact product)	✅	⚠️	✅
"affordable smartphones" (conceptual)	❌	✅	✅
"SKU-2847-B" (exact code)	✅	❌	✅
"how to fix my car not starting"	⚠️	✅	✅
"error code 0x8004005"	✅	❌	✅
"alternatives to Slack for team chat"	❌	✅	✅

2. Architecture: Parallel Retrieval + Fusion

A hybrid search engine physically executes two entirely separate searches. The user types a query; the system sends the raw text to the sparse engine (Elasticsearch, OpenSearch) scoring via BM25, and simultaneously embeds the text and sends the vector to the dense engine (Pinecone, Qdrant) scoring via cosine similarity.

Both engines return their top candidates (e.g., top 100), yielding two separate ranked lists of documents. The final, critical step is fusion: mathematically merging these two lists into a single, cohesive top 10 to present to the user.

# Hybrid search pipeline

Query: "affordable noise cancelling headphones"

↓

BM25

Inverted Index

↓

Top 100 by BM25

kNN/ANN

Vector Index

↓

Top 100 by vector

↓

Fusion

(RRF or Linear)

↓

Top 10 results

The parallel execution is critical: both BM25 and vector search return in ~5-15ms each, so total retrieval = max(BM25_time, vector_time), not the sum. Fusion adds ~1ms.

3. Reciprocal Rank Fusion (RRF)

How do you merge a BM25 score of 24.5 with a cosine similarity score of 0.82? The scores are on completely different scales, have different distributions, and mean different things. Reciprocal Rank Fusion (RRF) bypasses this problem entirely by ignoring scores and looking only at ranks.

RRF is the most widely used fusion logic in production (native to Elasticsearch and Azure AI Search) because it's elegant, robust, and parameter-free. It calculates a score based purely on a document's position in the retrieved lists. A document ranking highly in multiple retrievers is probably highly relevant, regardless of what its absolute scores were.

rrf.py

from collections import defaultdict

def reciprocal_rank_fusion(ranked_lists, k=60):

"""k=60 from Cormack et al., 2009. Higher = less top-emphasis."""

rrf_scores = defaultdict(float)

for ranked_list in ranked_lists:

for rank, doc_id in enumerate(ranked_list, start=1):

rrf_scores[doc_id] += 1.0 / (k + rank)

return sorted(rrf_scores.items(), key=lambda x: -x[1])

Worked Example

BM25: [Doc_A (rank 1), Doc_B (rank 2), Doc_C (rank 3)]

Vector: [Doc_C (rank 1), Doc_D (rank 2), Doc_A (rank 3)]

RRF scores (k=60):

Doc_A: 1/(60+1) + 1/(60+3) = 0.01639 + 0.01587 = 0.03227

Doc_B: 1/(60+2) + 0 = 0.01613 (only in BM25)

Doc_C: 1/(60+3) + 1/(60+1) = 0.01587 + 0.01639 = 0.03227

Doc_D: 0 + 1/(60+2) = 0.01613 (only in vector)

Final: [Doc_A, Doc_C, Doc_B, Doc_D]

→ Docs in BOTH lists get boosted. Single-list docs rank lower.

4. Linear Combination

While RRF is robust, discarding absolute scores means throwing away valuable information. A document with a BM25 score of 120 (a phenomenally strong exact match) gets the same rank-based treatment as a document with a BM25 score of 15 if they both happen to be rank #1 in their respective lists.

Linear combination solves this by keeping the scores, but it introduces a massive engineering headache:score normalization. You must mathematically squash the unbounded BM25 scores (0 to ∞) and the bounded vector scores (0 to 1) into the same distribution (typically applying a min-max scaling or z-score normalization over the retrieved set) before you can multiply them by a tunable weight (alpha).

linear_fusion.py

def hybrid_linear(bm25_score, vector_score, alpha=0.7):

"""

alpha=0.0 → pure BM25 alpha=0.5 → equal weight

alpha=1.0 → pure semantic Typical: 0.6-0.7

"""

return alpha * normalize(vector_score) + (1 - alpha) * normalize(bm25_score)

Aspect	RRF	Linear Combination
Normalization	Not needed	Required (and tricky)
Tuning effort	Minimal (k=60 works)	Must tune α per domain + normalize
Score info	Discards magnitude	Preserves magnitude
Labeled data	No	Yes (for optimal α)
Robustness	Very robust to noise	Sensitive to normalization
Adoption	Elasticsearch, Azure AI Search	Weaviate, Pinecone

5. Advanced: Query Routing

Even with tuned linear combination, using a single global alpha weight (e.g., 60% semantic, 40% keyword) is mathematically suboptimal. An exact SKU search doesn't need 60% semantic noise; it needs 100% keyword precision. A philosophical question doesn't need 40% keyword matching; it needs 100% semantic understanding.

State-of-the-art production systems go beyond static fusion by routing queries to different strategies dynamically. A fast Query Understanding layer (often a small classifier or rules engine) analyzes the query string before retrieval and passes specific instructions down to the fusion layer regarding how to blend the results.

query_routing.py

def route_query(query):

"""Choose search strategy based on query characteristics."""

if looks_like_id(query): # "SKU-2847-B", "ISBN 978-3-16"

return bm25_only(query) # Exact match is what's needed

if is_short_query(query, max_words=2): # "python", "best laptop"

return hybrid_search(query, alpha=0.4) # Favor BM25

if is_question(query): # "how do I..."

return hybrid_search(query, alpha=0.8) # Favor semantic

return hybrid_search(query, alpha=0.6) # Default: balanced

Key Takeaways

Neither Alone Is Sufficient

BM25 excels at exact matching (SKU codes, error codes, product names) but fails on conceptual queries. Embeddings excel at intent and vocabulary bridging but struggle with precise identifiers. Hybrid covers all query types.

RRF Is the Recommended Starting Point

Reciprocal Rank Fusion operates on ranks (not scores), needs no normalization, is parameter-free (k=60), requires no labeled data, and works with any number of retrievers. Used by Elasticsearch and Azure AI Search.

Linear Combination Needs Careful Normalization

BM25 scores range 0-25, cosine similarity ranges 0-1. Without normalization, one retriever dominates. Min-max is sensitive to outliers; z-score produces negatives. This complexity is why many systems prefer RRF.

Parallel Execution Is Critical

Run BM25 + vector search in parallel. Total latency = max(BM25, vector) + fusion overhead, NOT the sum. Both typically return in 5-15ms, so total retrieval is ~15ms + ~1ms fusion.

Query Routing Beats Static Weights

Route by query type: ID queries → BM25 only; short queries → alpha=0.4 (favor BM25); questions → alpha=0.8 (favor semantic). Dramatically outperforms a single static alpha value.

6.6 Latency vs Recall Tradeoffs 6.8 Common Failure Modes