Scoring: From TF-IDF to BM25
Boolean retrieval tells us what matches. Scoring tells us how well it matches.
This chapter traces the evolution from the naive TF-IDF model to the industry-standard BM25 algorithm. Understanding this progression is essential: BM25 is not magic — it is TF-IDF with two critical fixes.
2015
When Lucene/Elasticsearch switched from TF-IDF to BM25 as default.
1.2
The sweet spot for saturation. Lower = faster saturation.
0.75
Length normalization strength. 0 = off, 1 = full penalty.
1. From Boolean to Ranking
Boolean retrieval produces a set: {DocA, DocB, DocC}. It tells us which documents contain our query terms, but not which one is most relevant. Scoring transitions us from Sets to Lists.
We assume an additive model: if a query has 3 terms, the document's total score is the sum of contributions from each term. This is why the formula you'll see below has a "Σ" (sum) at the front we compute a score per query term and add them all up.
2. The Ancestor: TF-IDF
Before BM25, there was TF-IDF (Term Frequency - Inverse Document Frequency). It was the first principled attempt to turn "counting words" into a mathematical relevance score. The idea is simple: words that appear often in a document should boost its score, but words that appear everywhere should be discounted.
2.1 Term Frequency (TF)
"How often does the word appear in this document?"
Intuition: A document mentioning "Jeans" 5 times is likely more relevant than one mentioning it once.
Every occurrence is treated as evidence of relevance. More evidence → higher score.
Problem: What if someone repeats "Jeans" 1,000 times? They get 1,000× the score. This is called keyword stuffing.
2.2 Inverse Document Frequency (IDF)
"How rare is the word across the entire corpus?"
N = Total documents in corpus. n = Documents containing this term.
Intuition: Rooted in Information Theory. Rare events carry more information (Shannon Entropy).
The word "Zolpidem" appearing in 0.01% of docs is far more informative than "The" appearing in 100% of docs.
For each query term, multiply its frequency in the document (local) by its rarity in the corpus (global), then sum.
3. Why TF-IDF Fails
TF-IDF works for small demos, but it breaks catastrophically in production systems. It has three fundamental flaws that allow bad actors to game the system and cause irrelevant documents to rank highly.
A document with 1,000 "Jeans" gets 1,000× the score. Keyword stuffing and spam win easily. There's no concept of "diminishing returns" — the 100th mention is as valuable as the 1st.
A 10,000-word essay will naturally contain more matches than a 10-word title just by random chance. Long documents dominate results even when less relevant.
Going from 1→2 matches is a huge signal. Going from 100→101 is almost noise. TF-IDF treats both jumps identically (+1 to score).
4. BM25: The Industrial Fix
BM25 (Best Match 25) is not a new algorithm it's TF-IDF with two critical patches. It keeps the good parts (Frequency + Rareness) but fixes the flaws by adding Saturation and Length Normalization.
BM25 has been the default scoring algorithm in Lucene, Elasticsearch, and Solr since 2015. When you search Google, Amazon, or any major search engine, the first-pass retrieval almost certainly uses BM25 or a close variant.
BM25 is a probabilistic model derived from the Robertson-Spärck Jones framework. The math is principled, not arbitrary — each component corresponds to a statistical assumption about relevance.
5. The Three BM25 Components
BM25 decomposes into three factors that are multiplied together for each query term. Understanding each component independently makes the full formula much less intimidating.
5.1 TF Saturation
Instead of using raw term count, BM25 applies a saturation curve. The curve has an asymptote no matter how many times a word appears, its contribution approaches a maximum and never exceeds it. This prevents keyword stuffing from working.
k₁ (typically 1.2) controls the curve shape. Low k₁ → saturates quickly (binary-like: term is present or not). High k₁ → saturates slowly (closer to raw TF). The default of 1.2 is a sweet spot for most corpora.
Notice: Going from 1→5 adds 0.36. Going from 5→100 only adds 0.18. Diminishing returns!
The Saturation Curve
Term Frequency (TF) vs. Score Contribution
k1 creates an L-shape (instant saturation). Higher k1 looks more linear.5.2 IDF (Rareness)
Same core idea as TF-IDF: rare words carry more signal than common words. BM25 uses a slightly different formula with a probabilistic derivation that avoids negative weights for very common terms.
The +0.5 smoothing ensures mathematical stability when a term appears in more than 50% of documents. Without it, IDF could go negative, which would mean common words hurt relevance not what we want.
The rare medical term carries ~180× more signal than the common word "the".
5.3 Length Normalization
Long documents naturally contain more term occurrences just by having more words. We need to penalize them to give short, dense documents a fair chance. A 2-word title saying "Blue Jeans" should beat a 10,000-word essay that happens to mention "blue" and "jeans" once each.
|D| = document length. avgdl = average document length in corpus.b (typically 0.75) controls normalization strength: b = 0 means no normalization (length ignored). b = 1 means full normalization (strict penalty for long docs).
This factor goes in the denominator, so larger values = lower scores. Short docs get a boost!
6. The Full BM25 Formula
Now we can assemble the complete equation. For each query term, we compute an IDF weight multiplied by a saturated TF that's normalized by document length. The final score is the sum across all query terms.
7. Execution Model
The search engine doesn't loop over all documents. That would be O(N) which is far too slow. Instead, it loops over query terms and uses the inverted index to find only documents that contain each term. Scores are accumulated in a sparse data structure.
8. Worked Example: "Blue Jeans"
Let's compare TF-IDF and BM25 on a simple query. We have two documents: Doc A is a short product title. Doc B is a long blog post that happens to mention the keywords more often.
TF-IDF Result
NaiveBM25 Result
Robust9. Parameter Behavior
BM25 has two tunable parameters: k₁ and b. The defaults (k₁=1.2, b=0.75) work well for most cases. Only tune these if you have a proper evaluation dataset and metrics pipeline otherwise you're shooting in the dark.
10. Lucene Implementation
In practice, BM25 is implemented in BM25Similarity.java inside Lucene. The implementation is highly optimized for cache efficiency and minimal memory access.
BM25Similarity.java
All scores are computed as 32-bit floats. IDF is pre-computed once at query time. The saturation curve is evaluated per posting.
Norms (.nvd files)
Document lengths are not stored as raw integers. They are compressed to 1 byte per document. This trades slight precision for massive RAM savings (billions of docs × 1 byte is manageable).
11. What BM25 Does NOT Do
BM25 is a lexical matcher. It counts words. It has no understanding of meaning, context, or the world. These limitations are fundamental they're not bugs that can be patched.
- ✗No semantic understanding. "laptop" and "computer" are completely different words to BM25.
- ✗No phrase support. "New York" is just two words: "New" and "York". Word order is ignored.
- ✗No authority signals. A spam page and Wikipedia are treated equally if they have the same TF/IDF.
- ✗No freshness. A document from 1990 and one from today score identically.
12. Position in the Stack
BM25 is typically the "Stage 1.5" scorer it runs after boolean retrieval and before expensive re-ranking. Its job is to take millions of candidates and narrow them down to thousands that can be re-ranked by slower, more accurate models.
Key Takeaways
TF-IDF is the Ancestor
It introduced the core ideas (Count + Rareness), but failed at scale due to lack of controls.
BM25 is the Fix
It adds Saturation (k1) and Length Normalization (b) to make TF-IDF robust against spam and length bias.
Saturation Matters
The first mention of a word is infinitely more valuable than the hundredth.
Unbeatable Baseline
Despite AI advances, BM25 remains the fastest, cheapest, and most effective zero-shot retriever.