Chapter 5.1: Retrieval Architecture

Why Retrieval is about Recall, Not Precision

Retrieval is the art of casting a net, not throwing a spear. In a multi-stage search architecture, the primary job of Stage 1 is to ensure the right answer is in the building. Retrieval exists to satisfy latency, memory, and CPU budgets under a recall constraint not to optimize relevance. Everything else is secondary.

The First Law of Search

"You cannot rank what you do not find."

The Recruiter Metaphor

To understand the distinct roles of retrieval and ranking, imagine a large tech company hiring for a job.

Retrieval (Stage 1) acts as the Recruiter. Their job is to source a pool of 1,000 potential candidates from LinkedIn. The Recruiter uses broad keywords like "Engineer" or "Software". Their goal is to maximize Recall. If they are too strict ("Must have 5 years of Rust experience and live in Boston"), they might accidentally filter out the perfect candidate who has 4.5 years of experience or lives nearby. Being too precise here is dangerous.

Ranking (Stage 2) acts as the Hiring Manager. Their job is to interview the 10 candidates passed to them. Their goal is Precision. They spend hours (compute) with each candidate to ensure they are the perfect fit.

The critical lesson: If the Recruiter is "too smart" and skips a candidate, the Hiring Manager never meets them. The system fails, regardless of how good the Hiring Manager is.

Architecture of Constraints

We design the system backwards from constraints. We know the Ranker is expensive (e.g., tens of milliseconds per batch), so it can only score 500 documents. Therefore, Retrieval must filter 1,000,000,000 documents down to 500.

Stage 1: RetrievalThe "Loose" Filter

Max Recall

↓ 1,000 Candidates

Stage 2: RankingThe "Tight" Sort

Max Precision

The Mathematics of Loss

Search is often visualized as a funnel, but mathematically it is a Chain of Independent Probabilities. A funnel implies flow; a chain implies dependency. For a user to find their answer, a sequence of events must occur successfully, and the probability of the final success is the product of the probabilities of each step.

1. Index Coverage

P(Indexed)

Is the document technically ingestible? Is it in the shard we are querying? If not, P=0.

The Choke Point

2. Retrieval

P(Retrieved)

Does the query match the document? Does it survive top-K selection under BM25 scoring?

3. Ranking

P(Ranked)

Does the neural network think this is relevant enough to be in slot #1?

The "Unrecoverable Error" Rule

If P(Retrieved) = 0 then P(Success) = 0

It doesn't matter if your Ranker is 100% accurate (P=1.0).
1.0 × 0 = 0.
This mathematical reality forces Stage 1 to prioritize inclusion (Recall) over exclusion (Precision).

Failure Mode A: Noise (False Positive)

The system returns "Apple Pie" when you search for "iPhone". This is a False Positive. Crucially, this is not a disaster. The subsequent Ranking stage (which is smarter) will examine "Apple Pie", realize it has low semantic similarity to "iPhone", and push it to page 10.

Status: RECOVERABLE ✅

Failure Mode B: Silence (False Negative)

The system returns nothing when you search for "Running Kicks", even though "Nike Running Shoes" exist.The game is over. The Ranker never receives the correct document. It cannot fix what it cannot see. The user sees an empty page or irrelevant results.

Status: FATAL ❌

The Candidate Set Contract

Retrieval is a Budget Allocator

Retrieval does not return "results" in the traditional sense. It returns a Candidate Set under a strict contract. It is not an accuracy system; it is a budget allocator.

Must contain the true positives (Recall).
Must be small enough to rank (Top-K).
Must be cheap to compute (Latency).

"How many candidates can Stage 2 afford? 100? 500? This number dictates your entire architecture."

The Vocabulary Gap

A major reason for low recall is the Vocabulary Gap. This is the linguistic mismatch between how users describe a problem and how your database describes the solution. Users use colloquialisms, slang, or vague concepts. Databases contain formal titles, part numbers, and marketing copy.

In the example below, a "Precision-First" system fails because the tokens do not overlap."cheap" != "budget" and "laptop" != "notebook". An exact match system (like a basic Inverted Index) returns 0 results, causing high churn.

User Types

"cheap laptop"

Intent: Low cost portable computer

Database Has

"Budget Notebook"

Content: Low cost portable computer

0 Results

The cost of demanding precision at Stage 1. Revenue = $0.

The Cost of Intelligence

A common question is: "Why can't we just use the smart AI model for everything?"Why do we need this "dumb" retrieval stage at all? The answer is scale. Intelligence is computationally expensive.

The table below illustrates the orders of magnitude difference in cost.BM25 (Inverted Index) is 100,000x cheaper than a Cross-Encoder (BERT). To search 1 billion documents in <100ms, we must use the cheap algorithm first to filter the list down.

Algorithm	Complexity	Cost per Query	Ideal Role
Exact Match (Boolean) Inverted Index	Sublinear	$0.000001	Stage 1. Posting-list bounded, skip-based. Good for filtering IDs/Names.
Vector Search (HNSW) Approximate Nearest Neighbor	Sublinear*	$0.001 (relative)	Stage 1. Bounded graph walk. Catches semantic misses.
Cross-Encoder (BERT) Deep Neural Network	O(K * model)	$0.10	Stage 2. Expensive per-pair scoring. Used only on top 50 results.

Two-Phase Retrieval (The Loop)

To balance cost and recall, many systems use a Two-Phase Retrieval process within Stage 1 itself.

Phase 1: Coarse Retrieval

Scans the entire index using the cheapest possible logic (Boolean/BM25). Returns a top-N of ~10,000 document IDs.

Cost: Microseconds

Phase 2: Refined Retrieval (Rescoring)

Takes those 10,000 IDs and rescores them using more expensive logic (e.g., phrase matching, lightweight vector distance) to pick the best 500.

Cost: Milliseconds

Multi-Channel Recall

The Union of Channels

Modern retrieval is not one algorithm. It is a Union of Recall Channels. No single algorithm catches every user intent.
Calculate Final Set = Union(Lexical, Semantic, Personalized, Curated)

Lexical (BM25): Catches exact part numbers and names.
Semantic (ANN): Catches concepts ("cheap" = "budget").
Personalized: Bias towards user's gender/size/history.

Simplified Architecture

BM25 (300 docs)

HNSW (200 docs)

↓

Ranker (500 Unique)

When Recall Goes Too Far (Drift)

Is "More Recall" always the answer? No. There is a tipping point called Recall Drift. This happens when you widen the net so much that the valid results become statistically insignificant compared to the noise.

For example, if a user searches "Java Developer", and you aggressively expand synonyms to include "Coffee", you might retrieve 5,000 job descriptions and 5,000 coffee shop locations. If your Ranker (Stage 2) is only looking at the top 500 candidates, and the coffee shops randomly fill those slots, the valid job descriptions are pushed out.

The Signal-to-Noise Ratio (SNR)

Strictly speaking, if RelCount / TotalRetrieved < 0.01% (or even 0.1%), the Ranker begins to fail purely due to probability. You have buried the needle in the haystack.

Recall Gating & Budget Allocation

Because we have multiple channels, we must enforce Per-Channel Quotas. You never let one channel flood the candidate set.

Why? If Vector Search returns 500 "somewhat relevant" items, it pushes out the 5 "exact matches" from BM25. The Ranker needs diversity to work.

Mechanisms like WAND (Weak AND) exist to enforce compute budget during scoring, not just during retrieval.

// Example Budget Config

const RETRIEVAL_BUDGET = 1000;

// Hard Caps per channel

bm25_quota: 600, // Lexical base

vector_quota: 300, // Semantic expansion

rule_quota: 100 // "Featured" items

Query Understanding Before Retrieval

Strategies depend on Intent

Retrieval is not one fixed pipeline. The strategy changes based on Query Classification.

Navigational

"Login", "Returns"

→ Disable Vector Search. Exact match only.

Broad / Exploration

"Summer dress"

→ Boost Vector Search quota.

Head vs. Tail: How to Solve It

This brings us to the ultimate implementation strategy. How do we balance the Vocabulary Gap (Need Recall) with Drift (Need Precision)? We cannot use a single static configuration. We must use a Dynamic Strategy based on the query volume of the user's input.

The Head (Popular Queries)

The Characteristics

A "Head" query is short, popular, and unambiguous. E.g., "Nike" or "iPhone". We have massive amounts of data for these. The risk is not missing a result; the risk is performance and noise.

The Solution Logic

Volume Analysis: Billions of results exist.
Recall Risk: Near zero.
Precision Risk: High (Drift).
Action: Use operator: "AND". Tends towards strict matching.

// 1. Head Strategy: Strict AND

GET

/products/_search

{

"query": {

"match": {

"title": {

"query": "nike shoes",

"operator": "and"

}

The Tail (Rare Queries)

The Characteristics

A "Tail" query is long, specific, and rare. E.g., "red nike hiking waterproof mens 12". We have almost no exact matches. The risk is showing zero results.

The Solution Logic

Volume Analysis: 0-5 results exist.
Recall Risk: Extreme (The Vocabulary Gap).
Action: Use operator: "OR". Tends towards expansion.
Safety: Use minimum_should_match: "75%" to prevent total garbage.

// 2. Tail Strategy: Loose OR with Threshold

GET

/products/_search

{

"query": {

"match": {

"title": {

"query": "red nike hiking waterproof",

"operator": "or",

"minimum_should_match": "75%" // See WAND

}

Diversity as a Recall Constraint

Why "Best" is Subjective

Traditional recall metrics assume there is a single "right" answer. In search, intent is often ambiguous. If a user searches for "Jaguar", what should the system return?

A car enthusiast wants the luxury vehicle.
A student researching biology wants the animal.
A sports fan wants the Jacksonville Jaguars NFL team.

A Precision-First system will blindly guess the most statistically probable topic (likely Cars) and fill the page with 10 cars. This achieves 100% precision for the car lover, but 0% Recall for the other two groups. It is a failure.

A Recall-First system acknowledges the ambiguity. It purposely returns a mix: 5 Cars, 3 Animals, and 2 Teams. It delegates the final decision to the Ranker (which might know the user is a football fan) or simply presents the diversity to the user.

Goal: Diverse Recall Set

CAR (50%)

ANIMAL (30%)

TEAM

"The Ranker can sort this.
It cannot sort what isn't there."

Negative Constraints (Hard Filters)

Hard Filters vs Soft Recall

There is one exception to the "Maximize Recall" rule: Hard Constraints. If a user applies a filter (e.g., "In Stock Only"), it is illegal to return out-of-stock items, even if they are relevant.

Inventory

Stock = 0

Geography

Country Mismatch

Security

No Permissions

Note: These filters live inside the Retrieval Layer (as boolean clauses), not the Ranker. Why? Because you cannot rank what you are legally forbidden to show.

Key Takeaways

The Unrecoverable Error

If Retrieval misses a document, the Ranker cannot save it. Retrieval sets the quality ceiling.

Vocabulary Gap

Users rarely search for the exact words in the document. We must bridge this gap.

Dynamic Logic

We treat Head (popular) and Tail (rare) queries completely differently to balance noise vs results.

Precision Costs Compute

We cannot run expensive models on 1B documents. We buy Recall cheaply at Stage 1.

Diversity is Safety

When intent is ambiguous, returning a diverse set protects against zero-recall failures.

← Retrieval Overview