Systems Atlas
Chapter 5.3Scoring

Scoring: From TF-IDF to BM25

Boolean retrieval tells us what matches. Scoring tells us how well it matches.

This chapter traces the evolution from the naive TF-IDF model to the industry-standard BM25 algorithm. Understanding this progression is essential: BM25 is not magic — it is TF-IDF with two critical fixes.

Default Since

2015

When Lucene/Elasticsearch switched from TF-IDF to BM25 as default.

k₁ Default

1.2

The sweet spot for saturation. Lower = faster saturation.

b Default

0.75

Length normalization strength. 0 = off, 1 = full penalty.


1. From Boolean to Ranking

Boolean retrieval produces a set: {DocA, DocB, DocC}. It tells us which documents contain our query terms, but not which one is most relevant. Scoring transitions us from Sets to Lists.

We assume an additive model: if a query has 3 terms, the document's total score is the sum of contributions from each term. This is why the formula you'll see below has a "Σ" (sum) at the front we compute a score per query term and add them all up.


2. The Ancestor: TF-IDF

Before BM25, there was TF-IDF (Term Frequency - Inverse Document Frequency). It was the first principled attempt to turn "counting words" into a mathematical relevance score. The idea is simple: words that appear often in a document should boost its score, but words that appear everywhere should be discounted.

2.1 Term Frequency (TF)

"How often does the word appear in this document?"

TF = count(term, document)

Intuition: A document mentioning "Jeans" 5 times is likely more relevant than one mentioning it once.

Every occurrence is treated as evidence of relevance. More evidence → higher score.

Problem: What if someone repeats "Jeans" 1,000 times? They get 1,000× the score. This is called keyword stuffing.

2.2 Inverse Document Frequency (IDF)

"How rare is the word across the entire corpus?"

IDF = log(N / n)

N = Total documents in corpus. n = Documents containing this term.

Intuition: Rooted in Information Theory. Rare events carry more information (Shannon Entropy).

The word "Zolpidem" appearing in 0.01% of docs is far more informative than "The" appearing in 100% of docs.

The Classic TF-IDF Formula
Score = Σ (TF × IDF)

For each query term, multiply its frequency in the document (local) by its rarity in the corpus (global), then sum.


3. Why TF-IDF Fails

TF-IDF works for small demos, but it breaks catastrophically in production systems. It has three fundamental flaws that allow bad actors to game the system and cause irrelevant documents to rank highly.

No Saturation

A document with 1,000 "Jeans" gets 1,000× the score. Keyword stuffing and spam win easily. There's no concept of "diminishing returns" — the 100th mention is as valuable as the 1st.

No Length Normalization

A 10,000-word essay will naturally contain more matches than a 10-word title just by random chance. Long documents dominate results even when less relevant.

Linear Growth

Going from 1→2 matches is a huge signal. Going from 100→101 is almost noise. TF-IDF treats both jumps identically (+1 to score).


4. BM25: The Industrial Fix

BM25 (Best Match 25) is not a new algorithm it's TF-IDF with two critical patches. It keeps the good parts (Frequency + Rareness) but fixes the flaws by adding Saturation and Length Normalization.

BM25 has been the default scoring algorithm in Lucene, Elasticsearch, and Solr since 2015. When you search Google, Amazon, or any major search engine, the first-pass retrieval almost certainly uses BM25 or a close variant.

Key Insight

BM25 is a probabilistic model derived from the Robertson-Spärck Jones framework. The math is principled, not arbitrary — each component corresponds to a statistical assumption about relevance.


5. The Three BM25 Components

BM25 decomposes into three factors that are multiplied together for each query term. Understanding each component independently makes the full formula much less intimidating.

5.1 TF Saturation

Instead of using raw term count, BM25 applies a saturation curve. The curve has an asymptote no matter how many times a word appears, its contribution approaches a maximum and never exceeds it. This prevents keyword stuffing from working.

Saturated TF =
TFTF + k₁

k₁ (typically 1.2) controls the curve shape. Low k₁ → saturates quickly (binary-like: term is present or not). High k₁ → saturates slowly (closer to raw TF). The default of 1.2 is a sweet spot for most corpora.

Example (k₁ = 1.2):
TF = 1
= 1 / (1 + 1.2)
= 0.45
TF = 5
= 5 / (5 + 1.2)
= 0.81
TF = 100
= 100 / (100 + 1.2)
= 0.99

Notice: Going from 1→5 adds 0.36. Going from 5→100 only adds 0.18. Diminishing returns!

The Saturation Curve

Term Frequency (TF) vs. Score Contribution

k1 Parameter
1.2
0.1 (Fast Saturation)k1 = 1.2 (Default)5.0 (Slow Saturation)
What to notice: Even with huge TF (e.g., 20 spammy mentions), the score never exceeds 1.0. Lower k1 creates an L-shape (instant saturation). Higher k1 looks more linear.

5.2 IDF (Rareness)

Same core idea as TF-IDF: rare words carry more signal than common words. BM25 uses a slightly different formula with a probabilistic derivation that avoids negative weights for very common terms.

IDF = log(1 +N - n + 0.5n + 0.5)
N = total docs, n = docs containing term

The +0.5 smoothing ensures mathematical stability when a term appears in more than 50% of documents. Without it, IDF could go negative, which would mean common words hurt relevance not what we want.

Common Word
"The"
IDF ≈ 0 (no signal)
Rare Word
"Zolpidem"
IDF ≈ High (strong signal)
Example (N = 1,000,000 documents):
"the" in 950,000 docs
IDF = log(1 +
(1,000,000 - 950,000 + 0.5)
/ (950,000 + 0.5)
)
≈ 0.05
"zolpidem" in 100 docs
IDF = log(1 +
(1,000,000 - 100 + 0.5)
/ (100 + 0.5)
)
≈ 9.21

The rare medical term carries ~180× more signal than the common word "the".

5.3 Length Normalization

Long documents naturally contain more term occurrences just by having more words. We need to penalize them to give short, dense documents a fair chance. A 2-word title saying "Blue Jeans" should beat a 10,000-word essay that happens to mention "blue" and "jeans" once each.

Length Factor =1 - b+b×|D|avgdl
|D| = doc length, avgdl = average doc length

|D| = document length. avgdl = average document length in corpus.b (typically 0.75) controls normalization strength: b = 0 means no normalization (length ignored). b = 1 means full normalization (strict penalty for long docs).

Example (avgdl = 100 words, b = 0.75):
Short Doc (10 words)
= 1 - 0.75 + 0.75 ×
(10 / 100)
= 0.325
Score boosted 3×
Average Doc (100 words)
= 1 - 0.75 + 0.75 ×
(100 / 100)
= 1.0
No change
Long Doc (500 words)
= 1 - 0.75 + 0.75 ×
(500 / 100)
= 4.0
Score divided by 4

This factor goes in the denominator, so larger values = lower scores. Short docs get a boost!

BM25 Score Simulator
Length Normalization
Ratio: 0.10x
10 words
Calculated Score
0.0000
Impact of TF0.72/ 1.0
Length Penalty0.10x Avg

6. The Full BM25 Formula

Now we can assemble the complete equation. For each query term, we compute an IDF weight multiplied by a saturated TF that's normalized by document length. The final score is the sum across all query terms.

Score(D, Q) = Σ IDF(qᵢ) × TF × (k₁ + 1)TF + k₁ × (1 - b + b × |D|/avgdl)
IDF
Global (corpus)
Saturation
Per-term
Length
Per-document

7. Execution Model

The search engine doesn't loop over all documents. That would be O(N) which is far too slow. Instead, it loops over query terms and uses the inverted index to find only documents that contain each term. Scores are accumulated in a sparse data structure.

bm25_scoring.py
def score_query(query: list[str], index: InvertedIndex) -> dict[int, float]:
"""Score all documents matching query terms using BM25."""
scores: dict[int, float] = {} # sparse accumulator
 
for term in query:
idf = get_idf(term) # cached, O(1)
 
for doc_id, tf in index.postings(term):
norm = norms[doc_id] # 1-byte lookup
scores[doc_id] = scores.get(doc_id, 0) + bm25(tf, idf, norm)
 
return scores

8. Worked Example: "Blue Jeans"

Let's compare TF-IDF and BM25 on a simple query. We have two documents: Doc A is a short product title. Doc B is a long blog post that happens to mention the keywords more often.

TF-IDF Result

Naive
Doc A: "Blue Jeans" (2 words)2.0
Doc B: Long essay (1000 words, 10 matches)20.0
⚠ Long Doc B wins unfairly due to more matches.

BM25 Result

Robust
Doc A: "Blue Jeans" (2 words)2.2
Doc B: Long essay (1000 words, 10 matches)0.8
✓ Concise Doc A wins. Length normalization works.

9. Parameter Behavior

BM25 has two tunable parameters: k₁ and b. The defaults (k₁=1.2, b=0.75) work well for most cases. Only tune these if you have a proper evaluation dataset and metrics pipeline otherwise you're shooting in the dark.

k₁ = 0
Binary mode. Only term presence matters.
k₁ = 1.2
Default. Balanced saturation.
b = 0
No length normalization.
b = 0.75
Default. Moderate penalty for long docs.

10. Lucene Implementation

In practice, BM25 is implemented in BM25Similarity.java inside Lucene. The implementation is highly optimized for cache efficiency and minimal memory access.

BM25Similarity.java

All scores are computed as 32-bit floats. IDF is pre-computed once at query time. The saturation curve is evaluated per posting.

Norms (.nvd files)

Document lengths are not stored as raw integers. They are compressed to 1 byte per document. This trades slight precision for massive RAM savings (billions of docs × 1 byte is manageable).


11. What BM25 Does NOT Do

BM25 is a lexical matcher. It counts words. It has no understanding of meaning, context, or the world. These limitations are fundamental they're not bugs that can be patched.

  • No semantic understanding. "laptop" and "computer" are completely different words to BM25.
  • No phrase support. "New York" is just two words: "New" and "York". Word order is ignored.
  • No authority signals. A spam page and Wikipedia are treated equally if they have the same TF/IDF.
  • No freshness. A document from 1990 and one from today score identically.

12. Position in the Stack

BM25 is typically the "Stage 1.5" scorer it runs after boolean retrieval and before expensive re-ranking. Its job is to take millions of candidates and narrow them down to thousands that can be re-ranked by slower, more accurate models.

Retrieval
Boolean / Index
BM25
Top 1000
Re-Ranking
AI / LTR

Key Takeaways

01

TF-IDF is the Ancestor

It introduced the core ideas (Count + Rareness), but failed at scale due to lack of controls.

02

BM25 is the Fix

It adds Saturation (k1) and Length Normalization (b) to make TF-IDF robust against spam and length bias.

03

Saturation Matters

The first mention of a word is infinitely more valuable than the hundredth.

04

Unbeatable Baseline

Despite AI advances, BM25 remains the fastest, cheapest, and most effective zero-shot retriever.