Evaluation Metrics for Embedding Models
How to measure whether your embedding model actually retrieves better results — nDCG, MRR, Recall@K, building reliable benchmarks, avoiding offline pitfalls, and the gap between offline metrics and live search quality.
1K–5K queries
Good production benchmarks have 1K–5K queries with multi-document relevance judgments, not just single-positive pairs.
Essential
Aggregate metrics hide regressions. Head, tail, exact-match, and semantic slices each expose different model failure patterns.
nDCG, Recall, MRR
Together these three capture ranking quality, coverage, and first-hit utility. None is sufficient on its own.
1. What Are We Measuring?
The goal of evaluation is to measure whether the embedding model returns the right documents at the top of the ranked list, across a representative sample of queries. This sounds simple, but it involves several moving pieces that can each fail silently: the quality of the ground truth labels, the representativeness of the query sample, the presence of false negatives in the annotation, and the difference between measuring rank quality vs. coverage.
A retrieval evaluation framework has three components: a set of queries, a document corpus, and relevance judgments (which documents are relevant for each query). The metrics compute some function of where the relevant documents appear in the model's ranked output. The higher the relevant documents rank, the better the metrics.
- • Rank position of relevant documents
- • Fraction of all relevant docs retrieved (coverage)
- • Whether the very first result is useful
- • Graded relevance (very relevant vs marginally relevant)
- • Whether the retrieval model generalizes to new query types
- • Real user satisfaction vs annotator preference
- • Multi-turn or context-dependent relevance
- • Latency or serving cost impact
2. nDCG@K: Normalized Discounted Cumulative Gain
nDCG is the most comprehensive retrieval metric. It measures both whether a document is retrieved and where in the ranking it appears. Higher-positioned results receive more credit. Results are discounted by log2(rank + 1), so rank-1 is worth much more than rank-10. The DCG score is then normalized by the ideal DCG (the best possible ranking for that query) to produce a score in [0, 1].
3. MRR: Mean Reciprocal Rank
Mean Reciprocal Rank measures how high up the first correct result appears in the ranked list. It gives 1/rank of the first relevant document. If the first relevant document is at rank 1, MRR = 1.0. At rank 3, MRR = 0.33. At rank 10, MRR = 0.1.
4. Recall@K, Precision@K, and MAP
Recall@K
Fraction of all relevant documents that appear in the top-K results. Measures coverage rather than ordering.
Recall@K = |relevant ∩ top-K| / |relevant|Critical for RAG pipelines where you need to ensure the right context is always retrieved.
Precision@K
Fraction of the top-K results that are relevant. High Precision@K means fewer false positives shown to the user.
Precision@K = |relevant ∩ top-K| / KImportant when the user sees all K results (not just the top 1), such as in grid-based UIs.
MAP
Mean Average Precision. Averages Precision@K over all ranks where a relevant document appears. Sensitive to both rank position and completeness.
Less commonly used in modern practice — nDCG handles graded relevance better.
5. Building a Reliable Benchmark
A benchmark is only as good as its query coverage, annotation quality, and leakage controls. Small benchmarks with biased queries give misleading feedback. The benchmark must represent the full distribution of production queries, not just the easy or common ones.
| Benchmark Size | Reliability | What It Can Distinguish |
|---|---|---|
| < 200 queries | Unreliable | Only large improvements. High noise from query sampling variance. |
| 200–500 queries | Minimal viable | Directionally correct for major changes. Not suitable for final rollout decisions. |
| 1K–5K queries | Production grade | Can distinguish +0.01 nDCG improvements across query slices. Required for gating releases. |
| 5K+ | High confidence | Rare to build this large. Useful for regression testing across many model versions. |
Evaluation Slices That Matter
Aggregate nDCG hides regressions on specific query types. Good evaluation always reports numbers per slice:
High-volume, usually easier. The model will likely already perform well here. Watch for regression from aggressive domain adaptation.
Most queries in the distribution are here. Models often degrade significantly on tail queries when over-fitted to head traffic.
User is looking for a specific named document. Benchmark model vs BM25 here — embeddings sometimes lose on exact-match unless trained with lexical overlap.
User describes a concept without knowing the vocabulary. This is where embedding models should win over BM25 baseline.
6. Offline Evaluation Pitfalls
False Negatives in Ground Truth
If a truly relevant document is not annotated, every retrieval of that document counts as a mistake. This understates model quality and distorts fine-tuning signal. Mitigation: use pooling annotation strategies, annotate top-K from multiple models, not just one.
Train/Test Leakage
If the same query (or near-identical paraphrases) appear in both training and benchmark data, metrics are inflated. Split by query family, not by random row selection. Chronological splits prevent future test data from leaking into training.
Wrong Unit of Measurement
If your benchmark uses passage-level relevance but your model is evaluated at document level (or vice versa), metric values are incomparable. Always align the unit of the annotation with the unit of retrieval.
ANN Index Approximation Effects
Approximate Nearest Neighbor indexes (HNSW, IVF) introduce recall loss at retrieval time that is not present during offline evaluation (which often uses exact search). ANN recall of 95% means 5% of optimal results are never scored — this is a realistic production baseline, not zero.
7. From Offline Metrics to Live Search
Offline evaluation gates which models are worth testing. A model that regresses on offline metrics should not proceed to an A/B experiment — it almost certainly performs worse in production. But a model that improves offline does not automatically deliver live improvements:
Offline nDCG improvement → Experiment
If a model improves nDCG@10 by ≥ 0.01–0.02 across all query slices (without cherry-picking), it is a strong candidate for A/B testing. The improvement should hold across head and tail, and both exact-match and semantic query types.
Offline improvement ≠ guaranteed live win
Stale labels, position bias in ground truth, retraining on label distribution that does not match real users, or downstream re-rankers mitigating the embedding's weaknesses can all decouple offline and online metrics. Live A/B on CTR, session success, and zero-result rate is the final decision point.
Practical Evaluation Workflow
8. How to Read Metric Changes
Knowing that nDCG improved by 0.02 is not enough. Good evaluation analysis reads the direction of change across all slices and metrics simultaneously to understand what actually shifted.
| Pattern | Likely Explanation |
|---|---|
| nDCG +0.02, Recall +0.03, MRR flat | Model retrieves more complete result sets, but first-hit position unchanged |
| MRR +0.05, nDCG flat, Recall flat | Model improves first-hit accuracy but not comprehensive recall |
| nDCG +0.03 head, nDCG -0.01 tail | Head over-fitting — training signal dominated by popular queries |
| All metrics improve offline, regress online | Benchmark is stale, biased, or the model succeeds on annotation artifacts |
Key Takeaways
Retrieval evaluation is not classification evaluation
Accuracy and F1 do not apply to ranked retrieval results. The metrics that matter are nDCG@K (for graded ranking quality), Recall@K (for coverage), and MRR (for first-hit usefulness). Each measures a different property of search quality.
An offline benchmark that does not include query family slices is incomplete
A model can show a 3-point nDCG improvement on head queries while regressing on tail queries — and the aggregate metric will hide that regression. Good evaluation splits results by query type: head, tail, exact-match, semantic, navigational, informational.
Offline metrics gate experiments; online metrics decide deployment
Offline evaluation is a necessary filter — if a model fails offline, it almost certainly fails online. But it is not a sufficient predictor of online success. A model that wins offline can still lose live due to stale labels, position bias in ground truth, or real user behavior not captured in the benchmark.
False negatives in the benchmark understate your model's quality
If an actually-relevant document is not in your benchmark annotations, counting it as a miss means you are penalizing the model unfairly. The larger your corpus and the more queries you evaluate, the more your benchmark is contaminated with false negatives unless you take explicit mitigation steps.