Chapter 1.6
Defining Success Metrics for Search
How do you know if your search is good? It's not as simple as "users find what they want."
Offline Metrics
Measured on a fixed dataset, before deployment.
- ✓ Used for model development
- ✓ Doesn't require production traffic
- ✗ Doesn't capture real user behavior
Online Metrics
Measured on live traffic, after deployment.
- ✓ Reflects real user satisfaction
- ✓ Requires A/B testing infrastructure
- ✗ More expensive to measure
Offline Metrics
Precision@K
Of the top K results, how many are relevant?
Example: Top 5 results: 4 relevant, 1 not. Precision@5 = 0.80
Recall@K
Of all relevant items, how many are in top K?
Example: 100 relevant items exist, 40 in top 50. Recall@50 = 0.40
NDCG
Quality of ranking, giving more weight to top positions.
Intuition:
Relevant item at #1 is better than at #10
Range: 0 to 1 (1 is perfect ranking)
MRR (Mean Reciprocal Rank)
How high is the first relevant result?
Use: When user typically only wants one result
Online Metrics
| Metric | What It Measures | Benchmarks |
|---|---|---|
| CTR | % of searches that result in a click | E-commerce: 30-50% |
| Zero Result Rate | % of searches with no results | Good: <5%, Bad: >10% |
| Reformulation Rate | % of searches followed by another search | Good: <15%, Bad: >30% |
| Time to First Click | How long until user clicks a result | Good: <5 seconds |
| Conversion Rate | % of searches that lead to purchase | Average: 2-5%, Good: 5-10% |
The Offline-Online Tradeoff
Offline and online metrics often disagree. Understanding why is critical.
| Scenario | Offline Result | Online Result | Why? |
|---|---|---|---|
| Better reranker model | NDCG +5% | CTR unchanged | Latency increased 200ms, users bounced |
| Add popular items boost | NDCG -2% | CTR +8% | Users prefer familiar items, labels are stale |
| Personalization | Can't measure | Conv +12% | Offline labels don't capture user preferences |
| Remove duplicates | Recall -10% | CTR +15% | Fewer items, but user sees diversity |
When to Trust Offline
- • Comparing ranking algorithms (A vs B)
- • Fast iteration during development
- • When labels are high-quality and fresh
- • Regression testing before deploy
When to Trust Online
- • Measuring actual user satisfaction
- • Personalization features
- • Speed/latency tradeoffs
- • Final go/no-go decision
🎯 The Golden Rule
Offline metrics gate deployment. Online metrics confirm success.
Never ship something that regresses offline AND online. But a small offline regression might be acceptable if online wins big.
Metric Traps
Goodhart's Law
"When a measure becomes a target, it ceases to be a good measure."
Vanity Metrics
Metrics that look good but don't matter.
Missing Context
Same metric, different meaning.
Position Bias
Higher results get more clicks regardless of relevance.
Recommended Dashboard
Daily Monitoring
| Metric | Target | Alert If |
|---|---|---|
| ZRR | <5% | >8% |
| CTR | >35% | <25% |
| P99 Latency | <200ms | >500ms |
| Search Volume | Baseline | -20% |
Weekly Review
- Top zero-result queries → Add synonyms
- Low CTR queries → Check ranking
- Slow queries → Optimize or cache
- Conversion by query type → Business opportunities
Key Takeaways
Offline vs Online
Use offline metrics (NDCG) for development, online metrics (CTR, ZRR) for production.
Zero Result Rate
The most critical metric to fix first. Direct indicator of user failure.
CTR Trap
High CTR + Low Conversion = Clickbait. Trust conversion over clicks.
Golden Rule
Never ship something that regresses offline AND online. Use online to confirm offline wins.