Chapter 3.2: Indexing & Infrastructure
The Inverted Index
The data structure that makes text search possible. By flipping the relationship from "document → words" to "word → documents", we achieve O(1) lookups across billions of documents.
The Inversion That Changed Everything
A traditional database stores documents and scans each one to find matches. An inverted index pre-computes the answer to "which documents contain this word?" for every word in the corpus.
Anatomy of an Inverted Index
1Term Dictionary (FST)
The term dictionary is a sorted index of every unique word. It uses a Finite State Transducer (FST) a compressed prefix tree that shares common prefixes and suffixes.
2Posting Lists (Compressed)
Each term points to a list of document IDs. These lists can contain millions of entries, so we apply heavy compression.
The term "the" appears in 50 million documents.
Uncompressed: 50M × 4 bytes = 200 MB
Compressed: ~5 MB (40x smaller!)
How a Query Executes
Let's trace exactly what happens when you search for "wireless keyboard"
Parse & Analyze Query
Look Up Each Term in Dictionary
Retrieve Posting Lists
Intersect Lists (AND Query)
Two-pointer merge algorithm. Both lists are sorted, so we walk them together.
Score & Rank (BM25)
Boolean Operations on Posting Lists
AND (Intersection)
OR (Union)
NOT (Difference)
The Analysis Pipeline
Before text enters the inverted index, it passes through an analysis pipeline. The same analysis must happen at both index time and query time otherwise queries won't match!
How the Index is Built (Segments)
The index is NOT a single updated file. It is a collection of immutable Segments.
You cannot efficiently insert a new document ID into the middle of a sorted compressed posting list on disk. Instead, search engines use an LSM-Tree (Log Structured Merge) approach:
- Buffer: New documents go into an in-memory buffer.
- Flush: When full, the buffer is written as a new mini-segment (Analysis happens here).
- Merge: Background threads pick small segments and merge them into larger ones, deleting "tombstones" (deleted docs).
This explains why "Refresh" (making new segment visible) and "Commit" (fsync to disk) are different operations.
Common Pitfalls
Pitfall: GUIDs as Text
"type": "keyword"Pitfall: Aggregating on Text
"field": "color.keyword"Performance Benchmarks
Dataset: 100 million product descriptions (~50 GB text)
| Query Type | Time | Notes |
|---|---|---|
| Single term | 3ms | Index lookup + score top 1000 |
| AND (2 terms) | 5ms | Intersection of posting lists |
| AND (5 terms) | 12ms | Multiple intersections |
| Phrase (2 words) | 8ms | Check positions |
| Wildcard lap* | 50ms | Scan term dictionary |
| Wildcard *top | 500ms | Must scan entire dictionary! |
When NOT to use an Inverted Index
Numeric Ranges
Query: price > 100
Inverted indexes must verify every unique word > 100.
Use: BKD Trees
Semantic Similarity
Query: "Similar to 'King'"
Inverted index only finds exact character matches ("Kingdom").
Use: Vectors
Sorting / Aggregations
"Group by Category"
Uninverting (Doc->Term) is slow.
Use: DocValues
Key Takeaways
The Inverted Index
Maps Term → Documents, enabling O(1) lookups for full-text search.
Intersection
AND queries are intersections. The smallest posting list is processed first for efficiency.
Compression
Techniques like delta encoding and VInt reduce storage by 40x for common terms.
Analysis Consistency
The same analysis pipeline must be used at index time and query time.
Field Types
Use 'text' for full-text search and 'keyword' for exact filtering/aggregations.