Systems Atlas

Chapter 5.2: Retrieval Architecture

Boolean Retrieval & The Inverted Index

Boolean retrieval is the physical machinery that turns a query into sets of document IDs. It is a system built from set theory, compression, cache behavior, and SIMD not from ranking or scoring.

This chapter describes the data structures and CPU-level techniques that make large-scale set retrieval fast.

If you wanted to find every book containing the word "Love", you wouldn't read every book. You would use the index at the back. We are building that index but for billions of pages, and making it work in milliseconds.

The Anatomy of an Inverted Index

A search index turns the world upside down. Instead of Doc → Terms, we store Term → Docs. This inversion is what makes search fast: rather than scanning every document to find which ones contain "python", we look up "python" directly and retrieve the list of all document IDs containing it.

Physically, the inverted index is split into two distinct data structures. The term dictionarymaps every unique term to a file pointer—like the index at the back of a book that tells you which page to flip to. The postings file contains the actual lists of document IDs at those pointers. This separation is crucial: the dictionary is small enough to fit in RAM (often just a few hundred megabytes for a billion-document index), while the postings file can be massive but is accessed sequentially.

Modern implementations add a third layer: block indexes that store metadata about chunks of postings. Instead of treating each posting list as a flat array, we split it into blocks of 128 document IDs, storing the max doc ID per block. This enables block-level skipping—if block 47's max doc ID is 10,000 and we're looking for doc 50,000, we can skip directly to a later block without decompressing the intervening data.

1. The Term Dictionary (FST)

Maps "term" → "file pointer".
Like T9 Text / Auto-complete
Why not a Hash Map? Hash maps are fast (O(1)) but memory hungry and don't support prefixes. We use a Finite State Transducer (FST). It compresses shared prefixes (e.g., "run", "runner", "running") into a graph, often fitting the entire vocabulary in RAM.

2. The Postings List

A sorted list of Document IDs that contain the term.
Constraint: Must be sorted. Solving complex queries depends on this sort order (like merging two sorted arrays).

Logically: [docID, freq, positions, payload]
Physically: The hot path touches docID always, freq sometimes, positions almost never. This separation is critical for memory locality.

3. Block Index / Skip Data

Large postings lists are not flat arrays. They are split into fixed-size blocks (e.g., 128 docIDs).
Each block stores: maxDocID, compressed payload, and skip offsets.
Why? This enables block-level skipping, galloping, and avoiding decompression of blocks that cannot intersect. Modern retrieval is block-oriented, not integer-oriented.

// Physical Layout on Disk
Term Dictionary (FST)
"apple" → 0x00A1
"apply" → 0x00B4
"apricot"→ 0x00C8
Postings File (.doc)
0x00A1: [1, 4, 10, 250...]
0x00B4: [2, 4, 8, 9...]

Compression: The Speed Factor

Why Compress? It's not for Disk Space.

We compress integers primarily to save Memory Bandwidth. A CPU can process data faster than RAM can deliver it. If we can squeeze 4 integers into the space of 1, the CPU stays fed.

"Compression is the art of fitting more data into the CPU's mouth at once."

1. Delta Encoding (The Gap)

Instead of storing huge integers [100, 101, 105], we store the differences. Smaller numbers require fewer bits.

"Don't write the full address, just say 'Next door'."
Raw:15,00015,001
Delta:15,000+1

2. SIMD-BP128 (Bit Packing)

We pack blocks of 128 integers into the exact number of bits needed (e.g., 5 bits each). Then we use SIMD (AVX2 / AVX-512) instructions to decode an entire block in a few cycles.

The real win is fewer bytes loaded from memory, fewer cache lines touched, and sequential access patterns. The CPU is rarely the bottleneck. Memory bandwidth is.

Imagine a waiter carrying 8 plates at once (SIMD) vs. running back to the kitchen for every single plate (Scalar).
// Simd-BitPacking conceptual speed
Decoding Speed: ~5 Billion ints/sec
vs VByte: ~0.5 Billion ints/sec

Block-Level Skipping

Intersection does not start by decoding integers. For each block:

  • Compare block.maxDocID with the other list's current docID.
  • If the block cannot intersect → skip the entire block.
  • Only decompress blocks that might produce matches.

This reduces decompression work, branches, and cache pollution. This is what allows postings lists to scale to billions of documents.

Intersection: The "Zipper"

When you search blue AND sky, the engine performs Set Intersection. Because the lists are sorted, we can do this in linear time O(N+M) using the "Zipper" (Merge Join) algorithm.

// The "Sort-vs-Sort" Algorithm
while (p1 < len1 && p2 < len2) {
if (list1[p1] == list2[p2]) {
result.add(list1[p1]); // Match!
p1++; p2++;
} else if (list1[p1] < list2[p2]) {
p1++; // Advance lagger
} else {
p2++;
}
}
In practice, engines dynamically choose between Linear merge, Galloping, and Block-skip + merge based on relative list sizes and term selectivity.

Optimization: Galloping Search

What if List A has 1,000,000 items and List B has 2? Iterating 1M items is wasteful.

Galloping checks indices exponentially ($1, 2, 4, 8, 16...$) to skip massive chunks of List A. Complexity becomes O(M · log N).

Roaring Bitmaps: The Modern Filters

For heavy filtering (e.g., status=published), we don't use standard integer lists. A BitSet (001001...) is fast but wasteful if data is sparse (storing 1M zeros for one '1').

Roaring Bitmaps solve this by being hybrid. They split distinct integers into 64k "Containers" and choose the best format dynamically:

Consider a "Parking Lot" Strategy:
  • Sparse (Array): If only 3 cars are parked, just write down their spot numbers: [10, 50, 99].
  • Dense (Bitmap): If the lot is full, it's cheaper to draw a map of the lot with X's where cars are.
Roaring does exactly this switching strategies per "chunk" of numbers.
  • Array Container
    For very sparse data. Stores sorted uint16[].
  • Bitmap Container
    For dense data. Stores 65,536 bits (~8KB).
  • Run Container
    For long consecutive runs (e.g., [1000..50000]).
Performance Characteristics
  • Memory-bandwidth bound
  • Vectorized AND / OR
  • Block-wise skipping
  • Branch-free inner loops
Used by: Lucene, Elasticsearch, Druid, Pinot, Spark, Snowflake, ClickHouse

The Cost of Positions (Phrases)

Two-Phase Verification

Query: "new york". Use phrase queries carefully. They require checking the position of every term match. Position files are huge (3x-4x larger than doc lists).

Step 1: Boolean
Intersect new AND york
Fast ( RAM)
Step 2: Verification
Load positions only for survivors
Slow (Disk I/O)

"Don't pay the I/O cost for documents that don't even have both terms."

Additional Optimization: Rare-Term Anchoring

For "new york", if df("york") << df("new"):

  1. Iterate york positions.
  2. Check if new exists at pos - 1.
This minimizes position loads, comparisons, and Disk I/O.

Block-Max and Upper-Bound Skipping (Preview)

Some engines store, per block: Max possible contribution to score.
During traversal:

  • If a block cannot possibly beat the current threshold...
  • The entire block is skipped.
"Only visit blocks that are even capable of producing competitive documents."

Retrieval Architecture: A Concrete Example

Let's trace the query "blue jeans" through the system.

1. RAM: Term Dictionary (FST)
// Maps Term → File Pointer
"blue" → 0xA1
"black" → 0xA8
"jeans" → 0xB2
"jacket" → 0xB9
2. DISK: Postings File
// Compressed Blocks at Addresses
@ 0xA1 (blue):
Block 1: [10, 23, 105, 500...]
Block 2: [1000, 1040...]
@ 0xB2 (jeans):
Block 1: [5, 23, 104, 500...]
Block 2: [900, 1040...]

Step 1: Dictionary Lookup

Input:"blue", "jeans"Action:Traverse FST in RAM.Result:blue → 0xA1, jeans → 0xB2

Step 2: Access & Stream

The engine opens two streams to the Postings File. It does not load everything. It streams Block Headers first.

Stream A: @0xA1
Stream B: @0xB2

Step 3: Block Skip Check

We compare the minDoc / maxDoc headers of current blocks to see if they overlap.
Blue Block
[10 ... 500]
Jeans Block
[5 ... 104]
OVERLAP → DECODE
Blue Block
[1000 ... 2000]
Jeans Block
[300 ... 900]
NO OVERLAP → SKIP

Step 4: SIMD Decompression

Input:0x1F, 0xA2, 0x07... (Compressed Bytes)Action:
SIMD Expand (AVX-512) to CPU Cache.
Output:[10, 23, 105, 500...]

Step 5: Intersection & Phrase Check

The "Zipper" (Intersection)
Blue : [10, 23, 105, 500...]
Jeans: [5, 23, 104, 500...]
Phrase Verification (Optional)

For Page 23: Load "Blue" positions. Load "Jeans" positions. Check: Is Pos(Jeans) == Pos(Blue) + 1?

Final Output

[Doc 23, Doc 500, ...]

Pass to Ranking Engine (BM25) or Filter logic.

One-Page Mental Picture

Query
RAM: FST Lookup (Get Pointers)
DISK: Open Postings Streams
Loop:
Block Skip → SIMD Decode → Intersect
Candidate DocID Set

Why This Is Fast

  • Dictionary in RAM: Zero disk seeks to find "where the data is".
  • Streaming IO: Postings are read sequentially, not randomly.
  • Block Skipping: Massive chunks of data are ignored without ever being decompressed.
  • SIMD: Vectorized instructions reduce CPU cycles per integer.

Key Takeaways

01

The Foundation

Boolean retrieval is the physical machinery that converts queries into sets of document IDs. It defines the 'Candidate Set' that later stages will score and rank. Every document is binary—either in the candidate set or not—making set operations the core primitive.

02

FST > Hash Map

We use Finite State Transducers (FSTs) for the term dictionary because they support prefix queries (run*) and compress 100x better than Hash Maps. An FST shares prefixes across terms, so 'run', 'runner', 'running' share storage for 'run', often fitting entire vocabularies in RAM.

03

Compression is Speed

We don't compress to save disk; we compress to save RAM bandwidth. Modern CPUs can process data faster than RAM can deliver it. SIMD-BP128 packs 128 integers into minimal bits and decodes them vectorized—achieving 5 billion integers per second.

04

Roaring Bitmaps

The modern standard for filter storage. Roaring Bitmaps hybridize three container types: sorted arrays for sparse data, bitsets for dense data, and run-length encoding for consecutive sequences. This adaptive strategy delivers blistering fast set operations regardless of data distribution.

05

Phrase Cost

Phrase queries like '"new york"' require verifying term positions within documents, not just term presence. Position files are 3-4x larger than doc ID lists, making phrase queries an order of magnitude slower than simple boolean queries.

06

Block-Oriented Execution

Modern retrieval operates on blocks of 128 document IDs, not individual integers. By comparing block metadata (min/max doc IDs) before decompression, engines skip entire blocks that cannot intersect, avoiding decompression overhead entirely.

07

Memory Bandwidth Is the Bottleneck

Almost every optimization in retrieval exists to move fewer bytes from RAM to CPU, not to reduce CPU cycles. Compression, block skipping, and SIMD are all about feeding the CPU faster by moving less data.