Chapter 3.2: Indexing & Infrastructure

The Inverted Index

The data structure that makes text search possible. By flipping the relationship from "document → words" to "word → documents", we achieve O(1) lookups across billions of documents.

The Inversion That Changed Everything

A traditional database stores documents and scans each one to find matches. An inverted index pre-computes the answer to "which documents contain this word?" for every word in the corpus.

Forward Index (Database)

Doc 1: "the quick brown fox"

Doc 2: "quick brown dog"

Doc 3: "lazy dog"

To find "brown": Scan Doc 1, then Doc 2, then Doc 3...

O(N) Linear scan of all documents

Inverted Index (Search Engine)

"brown" → [Doc 1, Doc 2]

"dog" → [Doc 2, Doc 3]

"quick" → [Doc 1, Doc 2]

To find "brown": Direct lookup in dictionary.

O(1) Instant lookup regardless of corpus size

Anatomy of an Inverted Index

1Term Dictionary (FST)

The term dictionary is a sorted index of every unique word. It uses a Finite State Transducer (FST) a compressed prefix tree that shares common prefixes and suffixes.

Encoding: "apple", "apply", "application"

apple

apply

...ication

application

Shared prefixes compress storage. Lookup time = O(word length), not O(vocabulary size).

2Posting Lists (Compressed)

Each term points to a list of document IDs. These lists can contain millions of entries, so we apply heavy compression.

Raw IDs (Expensive)

[100, 101, 102, 105, 200, 300]

6 integers × 4 bytes = 24 bytes

Delta Encoded (Cheap)

[100, +1, +1, +3, +95, +100]

Small deltas = fewer bytes. ~4x compression!

Real-World Compression

The term "the" appears in 50 million documents.
Uncompressed: 50M × 4 bytes = 200 MB
Compressed: ~5 MB (40x smaller!)

How a Query Executes

Let's trace exactly what happens when you search for "wireless keyboard"

Parse & Analyze Query

"wireless keyboard" → tokens: ["wireless", "keyboard"]

Look Up Each Term in Dictionary

FST lookup "wireless" → pointer to List A (~100ns)

FST lookup "keyboard" → pointer to List B (~100ns)

Retrieve Posting Lists

"wireless" → [10, 45, 89, 234, 567, 890, 1234]

"keyboard" → [45, 123, 567, 789, 890, 2345]

Intersect Lists (AND Query)

Two-pointer merge algorithm. Both lists are sorted, so we walk them together.

// Find documents in BOTH lists

Result: [45, 567, 890]

Score & Rank (BM25)

def bm25(doc, terms):

score = 0

for term in terms:

tf = term_frequency(term, doc)

idf = inverse_doc_frequency(term)

score += idf * tf_weight(tf)

return score

Total Time: ~5 milliseconds

Across millions of documents

Boolean Operations on Posting Lists

AND (Intersection)

laptop: [1, 5, 23, 100]

macbook: [5, 23, 500]

Result: [5, 23]

OR (Union)

laptop: [1, 5, 23]

tablet: [5, 50, 100]

Result: [1, 5, 23, 50, 100]

NOT (Difference)

laptop: [1, 5, 23, 100]

apple: [5, 23]

Result: [1, 100]

The Analysis Pipeline

Before text enters the inverted index, it passes through an analysis pipeline. The same analysis must happen at both index time and query time otherwise queries won't match!

Raw Text: "Hello, C++ World!"

Character Filters

Strip HTML, map C++ → cpp, remove accents

Tokenizer

Split on whitespace/punctuation → ["Hello", "cpp", "World"]

Token Filters

Lowercase: "hello", Stemming: "run" → "run"

Final Terms: ["hello", "cpp", "world"]

How the Index is Built (Segments)

The index is NOT a single updated file. It is a collection of immutable Segments.

You cannot efficiently insert a new document ID into the middle of a sorted compressed posting list on disk. Instead, search engines use an LSM-Tree (Log Structured Merge) approach:

Buffer: New documents go into an in-memory buffer.
Flush: When full, the buffer is written as a new mini-segment (Analysis happens here).
Merge: Background threads pick small segments and merge them into larger ones, deleting "tombstones" (deleted docs).

This explains why "Refresh" (making new segment visible) and "Commit" (fsync to disk) are different operations.

Common Pitfalls

Pitfall: GUIDs as Text

// BAD: Standard analyzer splits on hyphens

"550e8400-e29b-41d4-a716-446655440000"

→ ["550e8400", "e29b", "41d4", ...]

// 5 terms instead of 1!

Fix: Use keyword type

"type": "keyword"

Pitfall: Aggregating on Text

// BAD: Aggregating on analyzed text

"Light Blue" → ["light", "blue"]

Results: "light": 5000, "blue": 4000

// Not useful!

Fix: Use keyword subfield

"field": "color.keyword"

Performance Benchmarks

Dataset: 100 million product descriptions (~50 GB text)

Query Type	Time	Notes
Single term	3ms	Index lookup + score top 1000
AND (2 terms)	5ms	Intersection of posting lists
AND (5 terms)	12ms	Multiple intersections
Phrase (2 words)	8ms	Check positions
Wildcard lap*	50ms	Scan term dictionary
Wildcard *top	500ms	Must scan entire dictionary!

Memory Usage (100M docs, 50GB text)

Raw text data: 50 GB

Term dictionary (FST): 500 MB

Posting lists: 8 GB

Position data: 25 GB

Total Inverted Index

~34 GB

(0.68x raw text)

When NOT to use an Inverted Index

Numeric Ranges

Query: price > 100
Inverted indexes must verify every unique word > 100.
Use: BKD Trees

Semantic Similarity

Query: "Similar to 'King'"
Inverted index only finds exact character matches ("Kingdom").
Use: Vectors

Sorting / Aggregations

"Group by Category"
Uninverting (Doc->Term) is slow.
Use: DocValues

Key Takeaways

The Inverted Index

Maps Term → Documents, enabling O(1) lookups for full-text search.

Intersection

AND queries are intersections. The smallest posting list is processed first for efficiency.

Compression

Techniques like delta encoding and VInt reduce storage by 40x for common terms.

Analysis Consistency

The same analysis pipeline must be used at index time and query time.

Field Types

Use 'text' for full-text search and 'keyword' for exact filtering/aggregations.

← 3.1 What is an Index?Next: BKD Trees & DocValues