Chapter 5.2: Retrieval Architecture
Boolean Retrieval & The Inverted Index
Boolean retrieval is the physical machinery that turns a query into sets of document IDs. It is a system built from set theory, compression, cache behavior, and SIMD not from ranking or scoring.
This chapter describes the data structures and CPU-level techniques that make large-scale set retrieval fast.
The Anatomy of an Inverted Index
A search index turns the world upside down. Instead of Doc → Terms, we store Term → Docs. This inversion is what makes search fast: rather than scanning every document to find which ones contain "python", we look up "python" directly and retrieve the list of all document IDs containing it.
Physically, the inverted index is split into two distinct data structures. The term dictionarymaps every unique term to a file pointer—like the index at the back of a book that tells you which page to flip to. The postings file contains the actual lists of document IDs at those pointers. This separation is crucial: the dictionary is small enough to fit in RAM (often just a few hundred megabytes for a billion-document index), while the postings file can be massive but is accessed sequentially.
Modern implementations add a third layer: block indexes that store metadata about chunks of postings. Instead of treating each posting list as a flat array, we split it into blocks of 128 document IDs, storing the max doc ID per block. This enables block-level skipping—if block 47's max doc ID is 10,000 and we're looking for doc 50,000, we can skip directly to a later block without decompressing the intervening data.
1. The Term Dictionary (FST)
Maps "term" → "file pointer".
Like T9 Text / Auto-complete
Why not a Hash Map? Hash maps are fast (O(1)) but memory hungry and don't support prefixes. We use a Finite State Transducer (FST). It compresses shared prefixes (e.g., "run", "runner", "running") into a graph, often fitting the entire vocabulary in RAM.
2. The Postings List
A sorted list of Document IDs that contain the term.
Constraint: Must be sorted. Solving complex queries depends on this sort order (like merging two sorted arrays).
[docID, freq, positions, payload]Physically: The hot path touches docID always, freq sometimes, positions almost never. This separation is critical for memory locality.
3. Block Index / Skip Data
Large postings lists are not flat arrays. They are split into fixed-size blocks (e.g., 128 docIDs).
Each block stores: maxDocID, compressed payload, and skip offsets.
Why? This enables block-level skipping, galloping, and avoiding decompression of blocks that cannot intersect. Modern retrieval is block-oriented, not integer-oriented.
Compression: The Speed Factor
Why Compress? It's not for Disk Space.
We compress integers primarily to save Memory Bandwidth. A CPU can process data faster than RAM can deliver it. If we can squeeze 4 integers into the space of 1, the CPU stays fed.
1. Delta Encoding (The Gap)
Instead of storing huge integers [100, 101, 105], we store the differences. Smaller numbers require fewer bits.
2. SIMD-BP128 (Bit Packing)
We pack blocks of 128 integers into the exact number of bits needed (e.g., 5 bits each). Then we use SIMD (AVX2 / AVX-512) instructions to decode an entire block in a few cycles.
The real win is fewer bytes loaded from memory, fewer cache lines touched, and sequential access patterns. The CPU is rarely the bottleneck. Memory bandwidth is.
Block-Level Skipping
Intersection does not start by decoding integers. For each block:
- Compare
block.maxDocIDwith the other list's current docID. - If the block cannot intersect → skip the entire block.
- Only decompress blocks that might produce matches.
This reduces decompression work, branches, and cache pollution. This is what allows postings lists to scale to billions of documents.
Intersection: The "Zipper"
When you search blue AND sky, the engine performs Set Intersection. Because the lists are sorted, we can do this in linear time O(N+M) using the "Zipper" (Merge Join) algorithm.
Optimization: Galloping Search
What if List A has 1,000,000 items and List B has 2? Iterating 1M items is wasteful.
Galloping checks indices exponentially ($1, 2, 4, 8, 16...$) to skip massive chunks of List A. Complexity becomes O(M · log N).
Roaring Bitmaps: The Modern Filters
For heavy filtering (e.g., status=published), we don't use standard integer lists. A BitSet (001001...) is fast but wasteful if data is sparse (storing 1M zeros for one '1').
Roaring Bitmaps solve this by being hybrid. They split distinct integers into 64k "Containers" and choose the best format dynamically:
- Sparse (Array): If only 3 cars are parked, just write down their spot numbers:
[10, 50, 99]. - Dense (Bitmap): If the lot is full, it's cheaper to draw a map of the lot with X's where cars are.
- Array ContainerFor very sparse data. Stores sorted
uint16[]. - Bitmap ContainerFor dense data. Stores 65,536 bits (~8KB).
- Run ContainerFor long consecutive runs (e.g., [1000..50000]).
- • Memory-bandwidth bound
- • Vectorized AND / OR
- • Block-wise skipping
- • Branch-free inner loops
The Cost of Positions (Phrases)
Two-Phase Verification
Query: "new york". Use phrase queries carefully. They require checking the position of every term match. Position files are huge (3x-4x larger than doc lists).
new AND york"Don't pay the I/O cost for documents that don't even have both terms."
Additional Optimization: Rare-Term Anchoring
For "new york", if df("york") << df("new"):
- Iterate york positions.
- Check if new exists at
pos - 1.
Block-Max and Upper-Bound Skipping (Preview)
Some engines store, per block: Max possible contribution to score.
During traversal:
- If a block cannot possibly beat the current threshold...
- The entire block is skipped.
Retrieval Architecture: A Concrete Example
Let's trace the query "blue jeans" through the system.
Step 1: Dictionary Lookup
Step 2: Access & Stream
The engine opens two streams to the Postings File. It does not load everything. It streams Block Headers first.
Step 3: Block Skip Check
minDoc / maxDoc headers of current blocks to see if they overlap.Step 4: SIMD Decompression
Step 5: Intersection & Phrase Check
Jeans: [5, 23, 104, 500...]
For Page 23: Load "Blue" positions. Load "Jeans" positions. Check: Is Pos(Jeans) == Pos(Blue) + 1?
Final Output
Pass to Ranking Engine (BM25) or Filter logic.
One-Page Mental Picture
Block Skip → SIMD Decode → Intersect
Why This Is Fast
- ✓Dictionary in RAM: Zero disk seeks to find "where the data is".
- ✓Streaming IO: Postings are read sequentially, not randomly.
- ✓Block Skipping: Massive chunks of data are ignored without ever being decompressed.
- ✓SIMD: Vectorized instructions reduce CPU cycles per integer.
Key Takeaways
The Foundation
Boolean retrieval is the physical machinery that converts queries into sets of document IDs. It defines the 'Candidate Set' that later stages will score and rank. Every document is binary—either in the candidate set or not—making set operations the core primitive.
FST > Hash Map
We use Finite State Transducers (FSTs) for the term dictionary because they support prefix queries (run*) and compress 100x better than Hash Maps. An FST shares prefixes across terms, so 'run', 'runner', 'running' share storage for 'run', often fitting entire vocabularies in RAM.
Compression is Speed
We don't compress to save disk; we compress to save RAM bandwidth. Modern CPUs can process data faster than RAM can deliver it. SIMD-BP128 packs 128 integers into minimal bits and decodes them vectorized—achieving 5 billion integers per second.
Roaring Bitmaps
The modern standard for filter storage. Roaring Bitmaps hybridize three container types: sorted arrays for sparse data, bitsets for dense data, and run-length encoding for consecutive sequences. This adaptive strategy delivers blistering fast set operations regardless of data distribution.
Phrase Cost
Phrase queries like '"new york"' require verifying term positions within documents, not just term presence. Position files are 3-4x larger than doc ID lists, making phrase queries an order of magnitude slower than simple boolean queries.
Block-Oriented Execution
Modern retrieval operates on blocks of 128 document IDs, not individual integers. By comparing block metadata (min/max doc IDs) before decompression, engines skip entire blocks that cannot intersect, avoiding decompression overhead entirely.
Memory Bandwidth Is the Bottleneck
Almost every optimization in retrieval exists to move fewer bytes from RAM to CPU, not to reduce CPU cycles. Compression, block skipping, and SIMD are all about feeding the CPU faster by moving less data.