Chapter 5.1: Retrieval Architecture
Why Retrieval is about Recall, Not Precision
Retrieval is the art of casting a net, not throwing a spear. In a multi-stage search architecture, the primary job of Stage 1 is to ensure the right answer is in the building. Retrieval exists to satisfy latency, memory, and CPU budgets under a recall constraint not to optimize relevance. Everything else is secondary.
The First Law of Search
"You cannot rank what you do not find."
The Recruiter Metaphor
To understand the distinct roles of retrieval and ranking, imagine a large tech company hiring for a job.
Retrieval (Stage 1) acts as the Recruiter. Their job is to source a pool of 1,000 potential candidates from LinkedIn. The Recruiter uses broad keywords like "Engineer" or "Software". Their goal is to maximize Recall. If they are too strict ("Must have 5 years of Rust experience and live in Boston"), they might accidentally filter out the perfect candidate who has 4.5 years of experience or lives nearby. Being too precise here is dangerous.
Ranking (Stage 2) acts as the Hiring Manager. Their job is to interview the 10 candidates passed to them. Their goal is Precision. They spend hours (compute) with each candidate to ensure they are the perfect fit.
The critical lesson: If the Recruiter is "too smart" and skips a candidate, the Hiring Manager never meets them. The system fails, regardless of how good the Hiring Manager is.
Architecture of Constraints
We design the system backwards from constraints. We know the Ranker is expensive (e.g., tens of milliseconds per batch), so it can only score 500 documents. Therefore, Retrieval must filter 1,000,000,000 documents down to 500.
The Mathematics of Loss
Search is often visualized as a funnel, but mathematically it is a Chain of Independent Probabilities. A funnel implies flow; a chain implies dependency. For a user to find their answer, a sequence of events must occur successfully, and the probability of the final success is the product of the probabilities of each step.
P(Indexed)
Is the document technically ingestible? Is it in the shard we are querying? If not, P=0.
P(Retrieved)
Does the query match the document? Does it survive top-K selection under BM25 scoring?
P(Ranked)
Does the neural network think this is relevant enough to be in slot #1?
The "Unrecoverable Error" Rule
If P(Retrieved) = 0 then P(Success) = 0
It doesn't matter if your Ranker is 100% accurate (P=1.0). 1.0 × 0 = 0.
This mathematical reality forces Stage 1 to prioritize inclusion (Recall) over exclusion (Precision).
Failure Mode A: Noise (False Positive)
The system returns "Apple Pie" when you search for "iPhone". This is a False Positive. Crucially, this is not a disaster. The subsequent Ranking stage (which is smarter) will examine "Apple Pie", realize it has low semantic similarity to "iPhone", and push it to page 10.
Failure Mode B: Silence (False Negative)
The system returns nothing when you search for "Running Kicks", even though "Nike Running Shoes" exist.The game is over. The Ranker never receives the correct document. It cannot fix what it cannot see. The user sees an empty page or irrelevant results.
The Candidate Set Contract
Retrieval is a Budget Allocator
Retrieval does not return "results" in the traditional sense. It returns a Candidate Set under a strict contract. It is not an accuracy system; it is a budget allocator.
- Must contain the true positives (Recall).
- Must be small enough to rank (Top-K).
- Must be cheap to compute (Latency).
The Vocabulary Gap
A major reason for low recall is the Vocabulary Gap. This is the linguistic mismatch between how users describe a problem and how your database describes the solution. Users use colloquialisms, slang, or vague concepts. Databases contain formal titles, part numbers, and marketing copy.
In the example below, a "Precision-First" system fails because the tokens do not overlap."cheap" != "budget" and "laptop" != "notebook". An exact match system (like a basic Inverted Index) returns 0 results, causing high churn.
The Cost of Intelligence
A common question is: "Why can't we just use the smart AI model for everything?"Why do we need this "dumb" retrieval stage at all? The answer is scale. Intelligence is computationally expensive.
The table below illustrates the orders of magnitude difference in cost.BM25 (Inverted Index) is 100,000x cheaper than a Cross-Encoder (BERT). To search 1 billion documents in <100ms, we must use the cheap algorithm first to filter the list down.
| Algorithm | Complexity | Cost per Query | Ideal Role |
|---|---|---|---|
Exact Match (Boolean) Inverted Index | Sublinear | $0.000001 | Stage 1. Posting-list bounded, skip-based. Good for filtering IDs/Names. |
Vector Search (HNSW) Approximate Nearest Neighbor | Sublinear* | $0.001 (relative) | Stage 1. Bounded graph walk. Catches semantic misses. |
Cross-Encoder (BERT) Deep Neural Network | O(K * model) | $0.10 | Stage 2. Expensive per-pair scoring. Used only on top 50 results. |
Two-Phase Retrieval (The Loop)
To balance cost and recall, many systems use a Two-Phase Retrieval process within Stage 1 itself.
Phase 1: Coarse Retrieval
Scans the entire index using the cheapest possible logic (Boolean/BM25). Returns a top-N of ~10,000 document IDs.
Cost: MicrosecondsPhase 2: Refined Retrieval (Rescoring)
Takes those 10,000 IDs and rescores them using more expensive logic (e.g., phrase matching, lightweight vector distance) to pick the best 500.
Cost: MillisecondsMulti-Channel Recall
The Union of Channels
Modern retrieval is not one algorithm. It is a Union of Recall Channels. No single algorithm catches every user intent.Calculate Final Set = Union(Lexical, Semantic, Personalized, Curated)
- Lexical (BM25): Catches exact part numbers and names.
- Semantic (ANN): Catches concepts ("cheap" = "budget").
- Personalized: Bias towards user's gender/size/history.
When Recall Goes Too Far (Drift)
Is "More Recall" always the answer? No. There is a tipping point called Recall Drift. This happens when you widen the net so much that the valid results become statistically insignificant compared to the noise.
For example, if a user searches "Java Developer", and you aggressively expand synonyms to include "Coffee", you might retrieve 5,000 job descriptions and 5,000 coffee shop locations. If your Ranker (Stage 2) is only looking at the top 500 candidates, and the coffee shops randomly fill those slots, the valid job descriptions are pushed out.
The Signal-to-Noise Ratio (SNR)
Strictly speaking, if RelCount / TotalRetrieved < 0.01% (or even 0.1%), the Ranker begins to fail purely due to probability. You have buried the needle in the haystack.
Recall Gating & Budget Allocation
Because we have multiple channels, we must enforce Per-Channel Quotas. You never let one channel flood the candidate set.
Mechanisms like WAND (Weak AND) exist to enforce compute budget during scoring, not just during retrieval.
Query Understanding Before Retrieval
Strategies depend on Intent
Retrieval is not one fixed pipeline. The strategy changes based on Query Classification.
Head vs. Tail: How to Solve It
This brings us to the ultimate implementation strategy. How do we balance the Vocabulary Gap (Need Recall) with Drift (Need Precision)? We cannot use a single static configuration. We must use a Dynamic Strategy based on the query volume of the user's input.
The Head (Popular Queries)
The Characteristics
A "Head" query is short, popular, and unambiguous. E.g., "Nike" or "iPhone". We have massive amounts of data for these. The risk is not missing a result; the risk is performance and noise.
The Solution Logic
- Volume Analysis: Billions of results exist.
- Recall Risk: Near zero.
- Precision Risk: High (Drift).
- Action: Use
operator: "AND". Tends towards strict matching.
The Tail (Rare Queries)
The Characteristics
A "Tail" query is long, specific, and rare. E.g., "red nike hiking waterproof mens 12". We have almost no exact matches. The risk is showing zero results.
The Solution Logic
- Volume Analysis: 0-5 results exist.
- Recall Risk: Extreme (The Vocabulary Gap).
- Action: Use
operator: "OR". Tends towards expansion. - Safety: Use
minimum_should_match: "75%"to prevent total garbage.
Diversity as a Recall Constraint
Why "Best" is Subjective
Traditional recall metrics assume there is a single "right" answer. In search, intent is often ambiguous. If a user searches for "Jaguar", what should the system return?
- A car enthusiast wants the luxury vehicle.
- A student researching biology wants the animal.
- A sports fan wants the Jacksonville Jaguars NFL team.
A Precision-First system will blindly guess the most statistically probable topic (likely Cars) and fill the page with 10 cars. This achieves 100% precision for the car lover, but 0% Recall for the other two groups. It is a failure.
A Recall-First system acknowledges the ambiguity. It purposely returns a mix: 5 Cars, 3 Animals, and 2 Teams. It delegates the final decision to the Ranker (which might know the user is a football fan) or simply presents the diversity to the user.
It cannot sort what isn't there."
Negative Constraints (Hard Filters)
Hard Filters vs Soft Recall
There is one exception to the "Maximize Recall" rule: Hard Constraints. If a user applies a filter (e.g., "In Stock Only"), it is illegal to return out-of-stock items, even if they are relevant.
Note: These filters live inside the Retrieval Layer (as boolean clauses), not the Ranker. Why? Because you cannot rank what you are legally forbidden to show.
Key Takeaways
The Unrecoverable Error
If Retrieval misses a document, the Ranker cannot save it. Retrieval sets the quality ceiling.
Vocabulary Gap
Users rarely search for the exact words in the document. We must bridge this gap.
Dynamic Logic
We treat Head (popular) and Tail (rare) queries completely differently to balance noise vs results.
Precision Costs Compute
We cannot run expensive models on 1B documents. We buy Recall cheaply at Stage 1.
Diversity is Safety
When intent is ambiguous, returning a diverse set protects against zero-recall failures.