Systems Atlas
Chapter 7.5Training Embedding Models

Evaluation Metrics for Embedding Models

How to measure whether your embedding model actually retrieves better results — nDCG, MRR, Recall@K, building reliable benchmarks, avoiding offline pitfalls, and the gap between offline metrics and live search quality.

Benchmark Size

1K–5K queries

Good production benchmarks have 1K–5K queries with multi-document relevance judgments, not just single-positive pairs.

Query Slices

Essential

Aggregate metrics hide regressions. Head, tail, exact-match, and semantic slices each expose different model failure patterns.

Core Metrics

nDCG, Recall, MRR

Together these three capture ranking quality, coverage, and first-hit utility. None is sufficient on its own.

Key Distinction: Retrieval evaluation is not classification evaluation. You are measuring ranked lists, not binary predictions. The position of a relevant document in the result set matters. A correct answer at rank 10 is much less valuable than the same answer at rank 1.

1. What Are We Measuring?

The goal of evaluation is to measure whether the embedding model returns the right documents at the top of the ranked list, across a representative sample of queries. This sounds simple, but it involves several moving pieces that can each fail silently: the quality of the ground truth labels, the representativeness of the query sample, the presence of false negatives in the annotation, and the difference between measuring rank quality vs. coverage.

A retrieval evaluation framework has three components: a set of queries, a document corpus, and relevance judgments (which documents are relevant for each query). The metrics compute some function of where the relevant documents appear in the model's ranked output. The higher the relevant documents rank, the better the metrics.

What good metrics capture
  • • Rank position of relevant documents
  • • Fraction of all relevant docs retrieved (coverage)
  • • Whether the very first result is useful
  • • Graded relevance (very relevant vs marginally relevant)
What metrics cannot capture alone
  • • Whether the retrieval model generalizes to new query types
  • • Real user satisfaction vs annotator preference
  • • Multi-turn or context-dependent relevance
  • • Latency or serving cost impact

2. nDCG@K: Normalized Discounted Cumulative Gain

nDCG is the most comprehensive retrieval metric. It measures both whether a document is retrieved and where in the ranking it appears. Higher-positioned results receive more credit. Results are discounted by log2(rank + 1), so rank-1 is worth much more than rank-10. The DCG score is then normalized by the ideal DCG (the best possible ranking for that query) to produce a score in [0, 1].

ndcg.py
import math
def dcg(relevances):
# relevances: list of graded relevance scores at each rank
return sum(
rel / math.log2(rank + 2)
for rank, rel in enumerate(relevances)
)
def ndcg_at_k(retrieved_relevances, ideal_relevances, k=10):
actual = dcg(retrieved_relevances[:k])
best = dcg(sorted(ideal_relevances, reverse=True)[:k])
return actual / best if best > 0 else 0.0
When to use: nDCG is the primary metric when documents have graded relevance (very relevant, partially relevant, not relevant). It penalizes rank position continuously. nDCG@10 is the most common production metric for search ranking evaluation.

3. MRR: Mean Reciprocal Rank

Mean Reciprocal Rank measures how high up the first correct result appears in the ranked list. It gives 1/rank of the first relevant document. If the first relevant document is at rank 1, MRR = 1.0. At rank 3, MRR = 0.33. At rank 10, MRR = 0.1.

mrr.py
def reciprocal_rank(ranked_docs, relevant_docs):
for rank, doc in enumerate(ranked_docs, start=1):
if doc in relevant_docs:
return 1 / rank
return 0.0
def mean_reciprocal_rank(queries):
return sum(reciprocal_rank(q.results, q.relevant) for q in queries) / len(queries)
When to use: MRR is best when there is a single best answer (FAQ systems, customer support). It strongly rewards getting the right answer at rank 1 and heavily penalizes it appearing lower. Less appropriate when multiple good answers exist.

4. Recall@K, Precision@K, and MAP

Recall@K

Fraction of all relevant documents that appear in the top-K results. Measures coverage rather than ordering.

Recall@K = |relevant ∩ top-K| / |relevant|

Critical for RAG pipelines where you need to ensure the right context is always retrieved.

Precision@K

Fraction of the top-K results that are relevant. High Precision@K means fewer false positives shown to the user.

Precision@K = |relevant ∩ top-K| / K

Important when the user sees all K results (not just the top 1), such as in grid-based UIs.

MAP

Mean Average Precision. Averages Precision@K over all ranks where a relevant document appears. Sensitive to both rank position and completeness.

Less commonly used in modern practice — nDCG handles graded relevance better.


5. Building a Reliable Benchmark

A benchmark is only as good as its query coverage, annotation quality, and leakage controls. Small benchmarks with biased queries give misleading feedback. The benchmark must represent the full distribution of production queries, not just the easy or common ones.

Benchmark SizeReliabilityWhat It Can Distinguish
< 200 queriesUnreliableOnly large improvements. High noise from query sampling variance.
200–500 queriesMinimal viableDirectionally correct for major changes. Not suitable for final rollout decisions.
1K–5K queriesProduction gradeCan distinguish +0.01 nDCG improvements across query slices. Required for gating releases.
5K+High confidenceRare to build this large. Useful for regression testing across many model versions.

Evaluation Slices That Matter

Aggregate nDCG hides regressions on specific query types. Good evaluation always reports numbers per slice:

Head queries (top 5% by frequency)

High-volume, usually easier. The model will likely already perform well here. Watch for regression from aggressive domain adaptation.

Tail queries (bottom 50% by frequency)

Most queries in the distribution are here. Models often degrade significantly on tail queries when over-fitted to head traffic.

Exact-match intent

User is looking for a specific named document. Benchmark model vs BM25 here — embeddings sometimes lose on exact-match unless trained with lexical overlap.

Semantic/paraphrase intent

User describes a concept without knowing the vocabulary. This is where embedding models should win over BM25 baseline.


6. Offline Evaluation Pitfalls

False Negatives in Ground Truth

If a truly relevant document is not annotated, every retrieval of that document counts as a mistake. This understates model quality and distorts fine-tuning signal. Mitigation: use pooling annotation strategies, annotate top-K from multiple models, not just one.

Train/Test Leakage

If the same query (or near-identical paraphrases) appear in both training and benchmark data, metrics are inflated. Split by query family, not by random row selection. Chronological splits prevent future test data from leaking into training.

Wrong Unit of Measurement

If your benchmark uses passage-level relevance but your model is evaluated at document level (or vice versa), metric values are incomparable. Always align the unit of the annotation with the unit of retrieval.

ANN Index Approximation Effects

Approximate Nearest Neighbor indexes (HNSW, IVF) introduce recall loss at retrieval time that is not present during offline evaluation (which often uses exact search). ANN recall of 95% means 5% of optimal results are never scored — this is a realistic production baseline, not zero.


7. From Offline Metrics to Live Search

Offline evaluation gates which models are worth testing. A model that regresses on offline metrics should not proceed to an A/B experiment — it almost certainly performs worse in production. But a model that improves offline does not automatically deliver live improvements:

Offline nDCG improvement → Experiment

If a model improves nDCG@10 by ≥ 0.01–0.02 across all query slices (without cherry-picking), it is a strong candidate for A/B testing. The improvement should hold across head and tail, and both exact-match and semantic query types.

Offline improvement ≠ guaranteed live win

Stale labels, position bias in ground truth, retraining on label distribution that does not match real users, or downstream re-rankers mitigating the embedding's weaknesses can all decouple offline and online metrics. Live A/B on CTR, session success, and zero-result rate is the final decision point.

Practical Evaluation Workflow

evaluate_model.py
def evaluate_model(model, benchmark, k=10):
results = {}
for slice_name, queries in benchmark.slices.items():
ndcg_scores, mrr_scores, recall_scores = [], [], []
for q in queries:
ranked = model.retrieve(q.text, k=k)
ndcg_scores.append(ndcg_at_k(ranked, q.relevant, k))
mrr_scores.append(reciprocal_rank(ranked, q.relevant))
recall_scores.append(recall_at_k(ranked, q.relevant, k))
results[slice_name] = {
"nDCG@K": mean(ndcg_scores),
"MRR": mean(mrr_scores),
"Recall@K": mean(recall_scores)
}
return results

8. How to Read Metric Changes

Knowing that nDCG improved by 0.02 is not enough. Good evaluation analysis reads the direction of change across all slices and metrics simultaneously to understand what actually shifted.

PatternLikely Explanation
nDCG +0.02, Recall +0.03, MRR flatModel retrieves more complete result sets, but first-hit position unchanged
MRR +0.05, nDCG flat, Recall flatModel improves first-hit accuracy but not comprehensive recall
nDCG +0.03 head, nDCG -0.01 tailHead over-fitting — training signal dominated by popular queries
All metrics improve offline, regress onlineBenchmark is stale, biased, or the model succeeds on annotation artifacts

Key Takeaways

01

Retrieval evaluation is not classification evaluation

Accuracy and F1 do not apply to ranked retrieval results. The metrics that matter are nDCG@K (for graded ranking quality), Recall@K (for coverage), and MRR (for first-hit usefulness). Each measures a different property of search quality.

02

An offline benchmark that does not include query family slices is incomplete

A model can show a 3-point nDCG improvement on head queries while regressing on tail queries — and the aggregate metric will hide that regression. Good evaluation splits results by query type: head, tail, exact-match, semantic, navigational, informational.

03

Offline metrics gate experiments; online metrics decide deployment

Offline evaluation is a necessary filter — if a model fails offline, it almost certainly fails online. But it is not a sufficient predictor of online success. A model that wins offline can still lose live due to stale labels, position bias in ground truth, or real user behavior not captured in the benchmark.

04

False negatives in the benchmark understate your model's quality

If an actually-relevant document is not in your benchmark annotations, counting it as a miss means you are penalizing the model unfairly. The larger your corpus and the more queries you evaluate, the more your benchmark is contaminated with false negatives unless you take explicit mitigation steps.