Systems Atlas
Chapter 6.12Vector & Semantic Search

Evaluating Search Quality

Building a search system without measuring its quality is like driving blindfolded. This chapter covers the standard metrics (Recall, Precision, MRR, nDCG, MAP), how to build evaluation datasets, and the offline/online pipeline production teams use to continuously measure and improve quality.

1. Recall@K

Recall measures completeness: of all the documents in your corpus that are actually relevant to the query, what fraction did your system find in its top K results? It is the most important metric for the retrieval stage of a search pipeline. Why? Because retrieval is a bottleneck. If a relevant document isn't found in the initial retrieval step, no downstream reranker or LLM can magically conjure it.

A typical architecture retrieves 1,000 documents using fast vector search, then reranks the top 100 using a slower, more accurate cross-encoder. In this setup, your system's ultimate quality ceiling is strictly bounded by the Recall@1000 of the first stage.

recall_at_k.py
def recall_at_k(retrieved, relevant, k):
"""What fraction of relevant docs did we find in top K?"""
retrieved_at_k = set(retrieved[:k])
found = retrieved_at_k & relevant
return len(found) / len(relevant)
# Relevant: {A, B, C, D, E}
# Retrieved top-10: [A, X, B, Y, Z, C, W, V, U, T]
# recall@10 = |{A,B,C}| / |{A,B,C,D,E}| = 3/5 = 0.60

2. Precision@K

While recall asks "did we find everything?", precision asks "is what we found useful?" It measures noise: what fraction of the top K results returned to the user are actually relevant to their query? A system that just returns the entire database has perfect recall, but terrible precision.

Precision is what users actually experience. If they look at the top 10 results and 8 of them are junk, they perceive the engine as broken, regardless of whether the 2 good results were the only relevant documents in the entire corpus. Precision@10 is heavily correlated with user satisfaction.

precision_at_k.py
def precision_at_k(retrieved, relevant, k):
"""What fraction of top K results are relevant?"""
found = set(retrieved[:k]) & relevant
return len(found) / k
# precision@10 = |{A,B,C}| / 10 = 0.30
# precision@3 = |{A,B}| / 3 = 0.67

3. Mean Reciprocal Rank (MRR)

MRR cares about exactly one thing: how far down the page did the user have to scroll to find the first relevant result? It takes the reciprocal of that rank (1/rank) and averages it across all queries. If the first relevant result is at position 1, the score is 1.0. If it's at position 10, the score is 0.1.

This is the perfect metric for "known-item" or navigational searches — instances where the user is looking for a very specific fact, FAQ answer, or product page, and they stop reading as soon as they find it. It penalizes systems heavily for burying the right answer.

Query 1: first relevant at rank 1 → 1/1 = 1.0
Query 2: first relevant at rank 3 → 1/3 = 0.33
Query 3: first relevant at rank 5 → 1/5 = 0.20
MRR = (1.0 + 0.33 + 0.20) / 3 = 0.51

4. nDCG@K (Gold Standard)

Binary metrics (relevant vs. irrelevant) fail to capture nuance. A perfect match document is better than a tangentially related one, and putting the perfect match at rank 1 is much better than putting it at rank 10. Normalized Discounted Cumulative Gain (nDCG) handles both of these requirements.

It uses graded relevance (e.g., 0=irrelevant, 1=marginal, 2=good, 3=perfect match) and applies a logarithmic discount based on position, meaning scores at the top of the list contribute far more to the final metric than scores at the bottom. The resulting number is normalized between 0 (worst) and 1 (best possible ranking). This is the primary metric used by major search companies and academic IR competitions.

ndcg.py
import math
def dcg_at_k(relevance_scores, k):
"""Sum of relevance / log2(rank+1). Higher ranks contribute more."""
dcg = 0.0
for i in range(min(k, len(relevance_scores))):
dcg += relevance_scores[i] / math.log2(i + 2)
return dcg
def ndcg_at_k(retrieved_rel, ideal_rel, k):
"""Normalize DCG by ideal (perfect) ranking. Returns 0.0-1.0."""
actual = dcg_at_k(retrieved_rel, k)
ideal = dcg_at_k(sorted(ideal_rel, reverse=True), k)
return actual / ideal if ideal > 0 else 0.0
# System returns: [3, 2, 0, 1, 0]
# Ideal: [3, 2, 1, 0, 0]
# nDCG@5 = 4.70 / 4.77 = 0.985 (nearly perfect!)

5. MAP (Mean Average Precision)

If MRR focuses only on the first relevant document, and Precision@K ignores everything past K, Mean Average Precision (MAP) tries to evaluate the entire returned list. For a single query, it calculates the precision at every rank where a relevant document is found, and averages those precisions. Then it averages that number across all queries.

Because it requires knowing the total number of relevant documents for a query (to properly penalize for missing them), MAP is very difficult to compute accurately in production systems with millions of documents. It is mostly used in academic benchmarks where datasets are fully annotated.

# Relevant: {A, B, C} Retrieved: [A, X, B, Y, C]
A at rank 1: precision = 1/1 = 1.00
B at rank 3: precision = 2/3 = 0.67
C at rank 5: precision = 3/5 = 0.60
AP = (1.00 + 0.67 + 0.60) / 3 = 0.76

6. Choosing the Right Metric

MetricRelevancePosition?Best For
Recall@KBinaryNoRetrieval stage evaluation
Precision@KBinaryNoUser-facing result quality
MRRBinaryYes (first hit)Q&A, navigation
nDCG@KGradedYes (log)Overall ranking (gold standard)
MAPBinaryYes (avg)Academic benchmarks
Recommendation

Use Recall@K for retrieval (did we find the right candidates?) and nDCG@K for ranking (are they in the right order?). If you can only track one metric, use nDCG@10.


7. Building Evaluation Datasets

Metrics are meaningless without a dataset to run them against. You cannot evaluate a search engine by typing in 5 queries and eyeballing the results — you will inevitably overfit to those 5 queries and introduce regressions elsewhere. You need a static dataset of queries and known-good document mappings to run CI/CD pipelines against.

Building this dataset is often the highest-ROI activity a search team can do. It doesn't need to be massive: a golden set of 50-100 representative queries (mixed head, torso, and tail) with 10-20 judged documents per query is enough to detect meaningful regressions. There are three main ways to source these judgments:

Human Annotation

Gold standard. 50-100 queries, 10-20 docs per query, 2+ annotators (Cohen's κ > 0.7). 1K-4K judgments in 1-2 days.

Include queries from real logs, especially hard/ambiguous ones.

Click-Through Data

Infer from behavior: clicked + long dwell (>30s) = relevant. Scales to millions of queries automatically.

Biased by position (rank 1 gets more clicks regardless).

LLM-as-Judge

~80-90% agreement with humans at ~$0.01-0.05 per judgment. Thousands in minutes. No annotator fatigue.

Has own biases, can be fooled by superficially relevant docs.


8. Online Evaluation Metrics

Offline metrics (nDCG, Recall) tell you if the algorithm is matching your expected judgments. Online metrics tell you if real users are actually finding value. The ultimate test of a search system is how humans interact with it in production.

These metrics are gathered entirely from telemetry and clickstream data. They are noisy — users click things by accident, or abandon searches because their phone rang — but in aggregate across thousands of queries, they provide the ground truth for A/B testing.

MetricMeasuresSignal
Click-through rate% queries with clicksHigher = results look relevant
Zero-result rate% queries with no resultsLower = better coverage
Abandonment rate% queries without clicksLower = results useful
Mean dwell timeTime on clicked resultsLonger = content matched intent
Reformulation rate% immediate re-searchesLower = first results good

9. The Offline → Online Pipeline

Mature search teams never deploy a ranking change directly to users. They follow a rigorous, multi-stage pipeline to ensure that systemic quality improves and catastrophic failures are caught early. Skipping steps in this pipeline is the primary cause of silent search degradation.

The process begins offline against the golden dataset (fast, safe), graduates to shadow deployment to catch latency/infrastructure issues, moves to a limited A/B test to measure actual user behavior, and only then rolls out fully.

1. Develop change (new model, params, chunking)
2. Offline evaluation against eval dataset
If metrics improve → proceed. If degrade → iterate.
3. Shadow deployment — run alongside old, compare
Log both results, don't show new to users.
4. A/B test — show new results to 5-10% of users
Monitor CTR, abandonment, dwell. Run 1-2 weeks.
5. Full rollout if online metrics improve. Rollback if not.

Key Takeaways

01

Recall@K Is the Foundation Metric

If relevant documents aren't in the candidate set, nothing downstream can fix it. For multi-stage pipelines (retrieve→rerank), recall@K of the retrieval stage determines the ceiling for the entire system.

02

nDCG@K Is the Gold Standard for Ranking

Handles graded relevance (not just binary) and position-aware weighting. If you can only track one metric, use nDCG@10. Primary metric used by academic IR competitions and major search companies.

03

Build an Evaluation Dataset (50-100 Queries)

Human annotation is gold standard. Include head/torso/tail queries from real query logs. 2+ annotators, Cohen's kappa > 0.7. Total: 1K-4K judgments, achievable in 1-2 days.

04

LLM-as-Judge: Scalable Alternative

~80-90% agreement with human judgments at ~$0.01-0.05 per judgment. Thousands of judgments in minutes. But: has its own biases, can be fooled by superficially relevant content.

05

The Full Pipeline: Offline → Shadow → A/B → Rollout

Offline eval on static dataset → shadow deploy (run alongside, don't show) → A/B test 5-10% of users for 1-2 weeks → full rollout if online metrics improve. Skipping steps causes silent degradation.