Systems Atlas
Chapter 6.7Vector & Semantic Search

Hybrid Ranking Pipelines

Neither keyword search nor semantic search alone delivers optimal results for all query types. Hybrid search combines the precision of BM25 with the conceptual understanding of embeddings. This chapter covers RRF, linear combination, score normalization, and query routing.

RRF Constant

k=60

The "set it and forget it" constant from Cormack et al. (2009). Stable across domains.

Alpha Sweet Spot

0.6-0.7

Slightly favoring semantic. Short queries → lower alpha; long queries → higher.

Fusion Latency

~1ms

Merging two ranked lists of 100 candidates each. Negligible overhead.

1. Why Hybrid Search?

If semantic search is so powerful, why do we still need old-school keyword search like BM25? The reality of production search systems is that neither approach alone is sufficient to handle the incredible diversity of human queries. Dense retrieval (embeddings) and sparse retrieval (keywords) have orthogonal strengths and strictly complementary weaknesses.

BM25 breaks on conceptual queries: "affordable noise cancelling headphones" won't match "budget over-ear ANC headphones" because "affordable" ≠ "budget" and "noise cancelling" ≠ "ANC." Embeddings break on precision queries: searching "SKU-2847-B" produces vague vectors that might loosely match hundreds of unrelated products with similar-looking codes. Every query type that trips up one system is handled perfectly by the other.

Query TypeBM25VectorHybrid
"iPhone 15 Pro Max" (exact product)⚠️
"affordable smartphones" (conceptual)
"SKU-2847-B" (exact code)
"how to fix my car not starting"⚠️
"error code 0x8004005"
"alternatives to Slack for team chat"

2. Architecture: Parallel Retrieval + Fusion

A hybrid search engine physically executes two entirely separate searches. The user types a query; the system sends the raw text to the sparse engine (Elasticsearch, OpenSearch) scoring via BM25, and simultaneously embeds the text and sends the vector to the dense engine (Pinecone, Qdrant) scoring via cosine similarity.

Both engines return their top candidates (e.g., top 100), yielding two separate ranked lists of documents. The final, critical step is fusion: mathematically merging these two lists into a single, cohesive top 10 to present to the user.

# Hybrid search pipeline
Query: "affordable noise cancelling headphones"
BM25
Inverted Index
Top 100 by BM25
kNN/ANN
Vector Index
Top 100 by vector
Fusion
(RRF or Linear)
Top 10 results

The parallel execution is critical: both BM25 and vector search return in ~5-15ms each, so total retrieval = max(BM25_time, vector_time), not the sum. Fusion adds ~1ms.


3. Reciprocal Rank Fusion (RRF)

How do you merge a BM25 score of 24.5 with a cosine similarity score of 0.82? The scores are on completely different scales, have different distributions, and mean different things. Reciprocal Rank Fusion (RRF) bypasses this problem entirely by ignoring scores and looking only at ranks.

RRF is the most widely used fusion logic in production (native to Elasticsearch and Azure AI Search) because it's elegant, robust, and parameter-free. It calculates a score based purely on a document's position in the retrieved lists. A document ranking highly in multiple retrievers is probably highly relevant, regardless of what its absolute scores were.

rrf.py
from collections import defaultdict
def reciprocal_rank_fusion(ranked_lists, k=60):
"""k=60 from Cormack et al., 2009. Higher = less top-emphasis."""
rrf_scores = defaultdict(float)
for ranked_list in ranked_lists:
for rank, doc_id in enumerate(ranked_list, start=1):
rrf_scores[doc_id] += 1.0 / (k + rank)
return sorted(rrf_scores.items(), key=lambda x: -x[1])

Worked Example

BM25: [Doc_A (rank 1), Doc_B (rank 2), Doc_C (rank 3)]
Vector: [Doc_C (rank 1), Doc_D (rank 2), Doc_A (rank 3)]
RRF scores (k=60):
Doc_A: 1/(60+1) + 1/(60+3) = 0.01639 + 0.01587 = 0.03227
Doc_B: 1/(60+2) + 0 = 0.01613 (only in BM25)
Doc_C: 1/(60+3) + 1/(60+1) = 0.01587 + 0.01639 = 0.03227
Doc_D: 0 + 1/(60+2) = 0.01613 (only in vector)
Final: [Doc_A, Doc_C, Doc_B, Doc_D]
→ Docs in BOTH lists get boosted. Single-list docs rank lower.

4. Linear Combination

While RRF is robust, discarding absolute scores means throwing away valuable information. A document with a BM25 score of 120 (a phenomenally strong exact match) gets the same rank-based treatment as a document with a BM25 score of 15 if they both happen to be rank #1 in their respective lists.

Linear combination solves this by keeping the scores, but it introduces a massive engineering headache:score normalization. You must mathematically squash the unbounded BM25 scores (0 to ∞) and the bounded vector scores (0 to 1) into the same distribution (typically applying a min-max scaling or z-score normalization over the retrieved set) before you can multiply them by a tunable weight (alpha).

linear_fusion.py
def hybrid_linear(bm25_score, vector_score, alpha=0.7):
"""
alpha=0.0 → pure BM25 alpha=0.5 → equal weight
alpha=1.0 → pure semantic Typical: 0.6-0.7
"""
return alpha * normalize(vector_score) + (1 - alpha) * normalize(bm25_score)
AspectRRFLinear Combination
NormalizationNot neededRequired (and tricky)
Tuning effortMinimal (k=60 works)Must tune α per domain + normalize
Score infoDiscards magnitudePreserves magnitude
Labeled dataNoYes (for optimal α)
RobustnessVery robust to noiseSensitive to normalization
AdoptionElasticsearch, Azure AI SearchWeaviate, Pinecone

5. Advanced: Query Routing

Even with tuned linear combination, using a single global alpha weight (e.g., 60% semantic, 40% keyword) is mathematically suboptimal. An exact SKU search doesn't need 60% semantic noise; it needs 100% keyword precision. A philosophical question doesn't need 40% keyword matching; it needs 100% semantic understanding.

State-of-the-art production systems go beyond static fusion by routing queries to different strategies dynamically. A fast Query Understanding layer (often a small classifier or rules engine) analyzes the query string before retrieval and passes specific instructions down to the fusion layer regarding how to blend the results.

query_routing.py
def route_query(query):
"""Choose search strategy based on query characteristics."""
if looks_like_id(query): # "SKU-2847-B", "ISBN 978-3-16"
return bm25_only(query) # Exact match is what's needed
if is_short_query(query, max_words=2): # "python", "best laptop"
return hybrid_search(query, alpha=0.4) # Favor BM25
if is_question(query): # "how do I..."
return hybrid_search(query, alpha=0.8) # Favor semantic
return hybrid_search(query, alpha=0.6) # Default: balanced

Key Takeaways

01

Neither Alone Is Sufficient

BM25 excels at exact matching (SKU codes, error codes, product names) but fails on conceptual queries. Embeddings excel at intent and vocabulary bridging but struggle with precise identifiers. Hybrid covers all query types.

02

RRF Is the Recommended Starting Point

Reciprocal Rank Fusion operates on ranks (not scores), needs no normalization, is parameter-free (k=60), requires no labeled data, and works with any number of retrievers. Used by Elasticsearch and Azure AI Search.

03

Linear Combination Needs Careful Normalization

BM25 scores range 0-25, cosine similarity ranges 0-1. Without normalization, one retriever dominates. Min-max is sensitive to outliers; z-score produces negatives. This complexity is why many systems prefer RRF.

04

Parallel Execution Is Critical

Run BM25 + vector search in parallel. Total latency = max(BM25, vector) + fusion overhead, NOT the sum. Both typically return in 5-15ms, so total retrieval is ~15ms + ~1ms fusion.

05

Query Routing Beats Static Weights

Route by query type: ID queries → BM25 only; short queries → alpha=0.4 (favor BM25); questions → alpha=0.8 (favor semantic). Dramatically outperforms a single static alpha value.