Systems Atlas

Chapter 1.6

Defining Success Metrics for Search

How do you know if your search is good? It's not as simple as "users find what they want."


Offline Metrics

Measured on a fixed dataset, before deployment.

  • ✓ Used for model development
  • ✓ Doesn't require production traffic
  • ✗ Doesn't capture real user behavior

Online Metrics

Measured on live traffic, after deployment.

  • ✓ Reflects real user satisfaction
  • ✓ Requires A/B testing infrastructure
  • ✗ More expensive to measure

Offline Metrics

Precision@K

Of the top K results, how many are relevant?

Precision@K = Relevant in top K / K

Example: Top 5 results: 4 relevant, 1 not. Precision@5 = 0.80

Recall@K

Of all relevant items, how many are in top K?

Recall@K = Relevant in top K / Total relevant

Example: 100 relevant items exist, 40 in top 50. Recall@50 = 0.40

NDCG

Quality of ranking, giving more weight to top positions.

Intuition:

Relevant item at #1 is better than at #10

Range: 0 to 1 (1 is perfect ranking)

MRR (Mean Reciprocal Rank)

How high is the first relevant result?

MRR = Avg(1 / position of first relevant)

Use: When user typically only wants one result

Online Metrics

MetricWhat It MeasuresBenchmarks
CTR% of searches that result in a clickE-commerce: 30-50%
Zero Result Rate% of searches with no resultsGood: <5%, Bad: >10%
Reformulation Rate% of searches followed by another searchGood: <15%, Bad: >30%
Time to First ClickHow long until user clicks a resultGood: <5 seconds
Conversion Rate% of searches that lead to purchaseAverage: 2-5%, Good: 5-10%

The Offline-Online Tradeoff

Offline and online metrics often disagree. Understanding why is critical.

ScenarioOffline ResultOnline ResultWhy?
Better reranker modelNDCG +5%CTR unchangedLatency increased 200ms, users bounced
Add popular items boostNDCG -2%CTR +8%Users prefer familiar items, labels are stale
PersonalizationCan't measureConv +12%Offline labels don't capture user preferences
Remove duplicatesRecall -10%CTR +15%Fewer items, but user sees diversity

When to Trust Offline

  • • Comparing ranking algorithms (A vs B)
  • • Fast iteration during development
  • • When labels are high-quality and fresh
  • • Regression testing before deploy

When to Trust Online

  • • Measuring actual user satisfaction
  • • Personalization features
  • • Speed/latency tradeoffs
  • • Final go/no-go decision

🎯 The Golden Rule

Offline metrics gate deployment. Online metrics confirm success.
Never ship something that regresses offline AND online. But a small offline regression might be acceptable if online wins big.

Metric Traps

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure."

Example: Optimize for CTR → Show clickbait titles → Users click but don't convert

Vanity Metrics

Metrics that look good but don't matter.

Example: "We have 1M searches per day!" (But most return garbage)

Missing Context

Same metric, different meaning.

Example: 0% ZRR could mean great coverage OR returning irrelevant results instead of "no match"

Position Bias

Higher results get more clicks regardless of relevance.

Example: High CTR on #1 doesn't mean #1 is best users just click top results

Recommended Dashboard

Daily Monitoring

MetricTargetAlert If
ZRR<5%>8%
CTR>35%<25%
P99 Latency<200ms>500ms
Search VolumeBaseline-20%

Weekly Review

  • Top zero-result queries → Add synonyms
  • Low CTR queries → Check ranking
  • Slow queries → Optimize or cache
  • Conversion by query type → Business opportunities

Key Takeaways

01

Offline vs Online

Use offline metrics (NDCG) for development, online metrics (CTR, ZRR) for production.

02

Zero Result Rate

The most critical metric to fix first. Direct indicator of user failure.

03

CTR Trap

High CTR + Low Conversion = Clickbait. Trust conversion over clicks.

04

Golden Rule

Never ship something that regresses offline AND online. Use online to confirm offline wins.