Chapter 1.6

Defining Success Metrics for Search

How do you know if your search is good? It's not as simple as "users find what they want."

Offline Metrics

Measured on a fixed dataset, before deployment.

✓ Used for model development
✓ Doesn't require production traffic
✗ Doesn't capture real user behavior

Online Metrics

Measured on live traffic, after deployment.

✓ Reflects real user satisfaction
✓ Requires A/B testing infrastructure
✗ More expensive to measure

Offline Metrics

Precision@K

Of the top K results, how many are relevant?

Precision@K = Relevant in top K / K

Example: Top 5 results: 4 relevant, 1 not. Precision@5 = 0.80

Recall@K

Of all relevant items, how many are in top K?

Recall@K = Relevant in top K / Total relevant

Example: 100 relevant items exist, 40 in top 50. Recall@50 = 0.40

NDCG

Quality of ranking, giving more weight to top positions.

Intuition:

Relevant item at #1 is better than at #10

Range: 0 to 1 (1 is perfect ranking)

MRR (Mean Reciprocal Rank)

How high is the first relevant result?

MRR = Avg(1 / position of first relevant)

Use: When user typically only wants one result

Online Metrics

Metric	What It Measures	Benchmarks
CTR	% of searches that result in a click	E-commerce: 30-50%
Zero Result Rate	% of searches with no results	Good: <5%, Bad: >10%
Reformulation Rate	% of searches followed by another search	Good: <15%, Bad: >30%
Time to First Click	How long until user clicks a result	Good: <5 seconds
Conversion Rate	% of searches that lead to purchase	Average: 2-5%, Good: 5-10%

The Offline-Online Tradeoff

Offline and online metrics often disagree. Understanding why is critical.

Scenario	Offline Result	Online Result	Why?
Better reranker model	NDCG +5%	CTR unchanged	Latency increased 200ms, users bounced
Add popular items boost	NDCG -2%	CTR +8%	Users prefer familiar items, labels are stale
Personalization	Can't measure	Conv +12%	Offline labels don't capture user preferences
Remove duplicates	Recall -10%	CTR +15%	Fewer items, but user sees diversity

When to Trust Offline

• Comparing ranking algorithms (A vs B)
• Fast iteration during development
• When labels are high-quality and fresh
• Regression testing before deploy

When to Trust Online

• Measuring actual user satisfaction
• Personalization features
• Speed/latency tradeoffs
• Final go/no-go decision

🎯 The Golden Rule

Offline metrics gate deployment. Online metrics confirm success.
Never ship something that regresses offline AND online. But a small offline regression might be acceptable if online wins big.

Metric Traps

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure."

Example: Optimize for CTR → Show clickbait titles → Users click but don't convert

Vanity Metrics

Metrics that look good but don't matter.

Example: "We have 1M searches per day!" (But most return garbage)

Missing Context

Same metric, different meaning.

Example: 0% ZRR could mean great coverage OR returning irrelevant results instead of "no match"

Position Bias

Higher results get more clicks regardless of relevance.

Example: High CTR on #1 doesn't mean #1 is best users just click top results

Recommended Dashboard

Daily Monitoring

Metric	Target	Alert If
ZRR	<5%	>8%
CTR	>35%	<25%
P99 Latency	<200ms	>500ms
Search Volume	Baseline	-20%

Weekly Review

Top zero-result queries → Add synonyms
Low CTR queries → Check ranking
Slow queries → Optimize or cache
Conversion by query type → Business opportunities

Key Takeaways

Offline vs Online

Use offline metrics (NDCG) for development, online metrics (CTR, ZRR) for production.

Zero Result Rate

The most critical metric to fix first. Direct indicator of user failure.

CTR Trap

High CTR + Low Conversion = Clickbait. Trust conversion over clicks.

Golden Rule

Never ship something that regresses offline AND online. Use online to confirm offline wins.

Search as a Funnel Next Chapter: Query Understanding