Systems Atlas
Chapter 5.5Retrieval Architecture

Execution Plans

How search engines translate "find laptops under $500" into disk reads, iterator operations, and score calculations.

This chapter reveals the internal machinery of query execution. Understanding these mechanics is essential for debugging slow queries and writing efficient search applications. Every millisecond in a search request can be traced to these operations.

Pipeline Depth

3 Layers

Query → Weight → Scorer abstraction for segment-level execution.

Iteration Cost

O(1)

Amortized cost per document via skip lists and block iteration.

Profile Overhead

~10x

Profile API adds significant overhead—debug only.

Prerequisites: This chapter builds on Boolean Retrieval (posting lists),BM25 (scoring), andSegments (index structure).

1. The Query → Weight → Scorer Pipeline

This is the fundamental architecture of Lucene query execution. Every search, from a simple term lookup to a complex boolean query with filters, passes through these three layers. Understanding this abstraction is the key to understanding query performance.

The separation exists for a reason: Query objects are reusable and stateless, Weight objects hold index-level statistics (computed once), and Scorer objects handle segment-specific iteration (created per segment). This design enables efficient execution across a segmented index.

Query

User-facing representation. Defines WHAT to search for.

Examples: TermQuery, BooleanQuery
ImmutableThread-safe
Weight

Index-level state. Computes IDF, query boosts.

Created: query.create_weight()
Computed OncePer-Search
Scorer

Segment-level iterator. Computes TF, length norms.

Created: weight.scorer(segment)
Per-SegmentIterates Docs
// Execution Flow
Query (stateless)───────▶createWeight(IndexSearcher)
Weight (IDF, boosts)───────▶scorer(LeafReaderContext) × N segments
Scorer₁, Scorer₂, ... Scorer_N───────▶iterate(), score() per segment
Key Insight: IDF vs TF

IDF (Inverse Document Frequency) is computed at the Weight level because it only depends on global term statistics. TF (Term Frequency) is computed at the Scorer level because it varies per document. This separation avoids redundant computation.


2. DocIdSetIterator (DISI)

The DocIdSetIterator (DISI) is the heart of Lucene's query execution. Every query, filter, and scorer ultimately reduces to iterating over document IDs. Understanding DISI patterns is essential for understanding why certain queries are fast and others are slow.

All DISI implementations return documents in increasing docID order within a segment. This monotonic ordering enables efficient set operations (intersection, union) via merge-like algorithms. The key methods arenextDoc() andadvance(target).

Core DISI Methods
int nextDoc()

Advance to the next matching document. Returns NO_MORE_DOCS when exhausted.

int advance(int target)

Skip to target or beyond. Essential for efficient intersection.

int docID()

Return current document ID. Returns -1 before first nextDoc().

long cost()

Estimated number of matching docs. Used for optimization decisions.

Key DISI Implementations

ClassPurposeStrategy
TermScorerSingle term matchesIterate posting list
ConjunctionDISIAND queriesIntersection via leap-frog
DisjunctionDISIOR queriesUnion via priority queue
ReqExclScorerMUST_NOT handlingSkip excluded docs
BlockMaxDISIEarly terminationSkip low-scoring blocks

3. Conjunction: The Leap-Frog Algorithm

For AND queries (A AND B AND C), Lucene uses the leap-frog algorithm. Instead of scanning all documents, iterators "leap" over each other to find common documents efficiently. This is why AND queries are typically faster than OR queries.

The algorithm exploits the fact that DISI returns documents in sorted order. When iterator A is at doc 5 and B is at doc 10, we can skip A directly to 10 (or beyond) without checking docs 6-9. This optimization is dramatic when intersecting large posting lists with small ones.

# Leap-Frog Algorithm for A AND B AND C
def leap_frog(iterators):
while True:
# Find current min/max docID
min_doc = min(it.doc_id() for it in iterators)
max_doc = max(it.doc_id() for it in iterators)
if min_doc == max_doc:
# All aligned → MATCH!
yield min_doc
advance_all(iterators)
else:
# Leap to max_doc
for it in iterators:
if it.doc_id() < max_doc:
it.advance(max_doc)
Example: "blue" AND "shoes"
"blue"
13571215
"shoes"
258101220
Result
512
Only checked 6 positions, not 12!
Optimization: Sort by Cost

Lucene sorts sub-queries by cost() (estimated matching docs) and puts the cheapest first. This minimizes advance() calls because the smallest iterator drives the leap-frog. If "blue" matches 100 docs and "electronics" matches 1M, we iterate "blue" and check "electronics" only 100 times.


4. Disjunction: Priority Queue

For OR queries (A OR B OR C), Lucene uses a min-heap priority queue. All iterators are placed in a heap ordered by current docID. We pop the minimum, collect all iterators at that docID, compute the combined score, then advance and re-insert them.

Disjunctions are more expensive than conjunctions because we must visit ALL matching documents from any sub-query. However, optimizations like WAND can skip documents that can't make it into the top-K results.

# Priority Queue for A OR B OR C
def disjunction(iterators):
heap = MinHeap(iterators, key=doc_id)
while heap:
# Pop all at minimum docID
min_doc = heap.peek().doc_id()
score = 0
while heap.peek().doc_id() == min_doc:
it = heap.pop()
score += it.score() # Sum scores
it.next_doc()
if it.doc_id() != NO_MORE_DOCS:
heap.push(it)
yield (min_doc, score)
MinShouldMatch Optimization

When minimum_should_match is set (e.g., "at least 2 of 5 terms"), Lucene can skip documents that only match one clause. The heap tracks how many iterators are at each docID.

Score Combination

For disjunctions, scores from matching clauses are summed. A document matching "laptop OR notebook OR computer" gets higher score if it matches all three terms.


5. TwoPhaseIterator

Some queries have expensive match conditions. For example, phrase query "quick brown fox" must verify that terms appear consecutively—a position check that requires reading position data. TwoPhaseIterator separates the cheap "might match" check from the expensive "definitely matches" confirmation.

This optimization is crucial for phrase queries, span queries, and any query with a cheap approximation but expensive verification. The first phase narrows candidates using the posting list intersection; the second phase confirms matches by checking positions.

Phase 1: Approximation

Fast check using posting lists. For phrase "quick brown fox": does doc contain all three terms?

approximation().nextDoc()
Cost: O(posting list intersection)
Phase 2: Confirmation

Expensive verification. For phrase "quick brown fox": do terms appear consecutively?

matches() → true/false
Cost: O(position reading)
two_phase_iterator.py
from abc import ABC, abstractmethod
class TwoPhaseIterator(ABC):
# Fast iterator over candidates
@abstractmethod
def approximation(self) -> DocIdSetIterator: ...
# True if current doc matches (expensive)
@abstractmethod
def matches(self) -> bool: ...
# Hint for optimization (higher = more expensive)
@abstractmethod
def match_cost(self) -> float: ...

6. Query Rewriting

Before execution, queries are rewritten into optimized, executable forms. This transformation expands wildcards, simplifies Boolean logic, and enables accurate cost estimation. Rewriting requires an IndexReader because the final form depends on what terms exist in the index.

Rewriting is a multi-pass process. Query.rewrite(IndexReader)is called repeatedly until the query returns itself (no more transformations possible). This is why wildcard queries can be slow: they must enumerate matching terms before execution can begin.

Common Rewrites

Original QueryRewritten FormWhy
title:lap*BooleanQuery(OR: laptop, lap, lapel, ...)Expand wildcard to actual terms
+A +B -C (C=0 docs)+A +BEliminate no-op clauses
BooleanQuery(MUST: A)AUnwrap single-clause BooleanQuery
ConstantScoreQuery(q)q (with score=1.0)Remove scoring overhead
Wildcard Warning

Leading wildcards (*laptop) can be extremely slow because they match potentially millions of terms. The rewrite phase must enumerate all matching terms before execution begins. Use edge-ngram tokenization instead.


7. Profile API

Elasticsearch's Profile API provides detailed timing breakdowns for query execution. Add"profile": true to any search request to see where time is spent. This is essential for diagnosing slow queries—but note that profiling adds ~10× overhead, so never use it in production.

The profile output is structured per-shard and shows the actual Lucene query types (which may differ from your Elasticsearch Query DSL after rewriting). Each query component shows its breakdown of low-level timing metrics.

profile_request.json
{
"profile": true,
"query": {
"bool": {
"must": [{ "match": { "title": "laptop" } }],
"filter": [{ "range": { "price": { "lte": 1000 } } }]
}
}
}

Key Breakdown Metrics

create_weight

Time to create Weight object (IDF calculation, query normalization)

build_scorer

Time to create Scorer for each segment (posting list setup)

next_doc

Time iterating to next matching document (posting list traversal)

advance

Time skipping to target docID (skip list navigation)

score

Time computing relevance scores (TF-IDF/BM25 calculation)

match

Time in TwoPhaseIterator.matches() (phrase/span verification)

Diagnosing Slow Queries
High next_doc: Large posting lists being traversed. Consider more selective filters.
High build_scorer: Complex query structure. Simplify boolean combinations.
High match: Expensive TwoPhaseIterator (phrase queries). Consider shingles.
High score: Complex similarity function. Avoid scripts in scoring.

8. Distributed Execution

Elasticsearch distributes queries across multiple shards. The query-then-fetch model minimizes network traffic by first collecting just docIDs and scores, then fetching full documents only for the final top-K results.

This two-phase approach is crucial for efficiency. If we fetched full documents immediately, a query requesting top-10 results from 5 shards would transfer 50 full documents over the network, only to discard 40 of them. With query-then-fetch, we transfer only 50 lightweight (docID, score) pairs in phase 1.

Phase 1: Query
  1. 1. Coordinator routes to relevant shards
  2. 2. Each shard executes query locally
  3. 3. Returns top-K (docID, score) pairs
  4. 4. Coordinator merges and re-sorts globally
Phase 2: Fetch
  1. 1. Coordinator identifies global top-K
  2. 2. Requests full documents from owning shards
  3. 3. Assembles final response
  4. 4. Returns to client
// Query-Then-Fetch Flow
Client → Coordinator: "search for laptop, top 10"
Phase 1: Query
Coordinator → Shard₁, Shard₂, Shard₃: "execute query, return top 10"
Shard₁ → Coordinator: [(doc5, 4.2), (doc12, 3.8), ...]
Shard₂ → Coordinator: [(doc99, 5.1), (doc45, 2.9), ...]
Shard₃ → Coordinator: [(doc7, 4.0), ...]
Phase 2: Fetch
Coordinator: merge → global top 10 = [doc99, doc5, doc7, ...]
Coordinator → Shard₂: "fetch doc99"
Coordinator → Shard₁: "fetch doc5"
...
Coordinator → Client: [full results with _source]

Key Takeaways

01

Query → Weight → Scorer

Three-layer pipeline: Query defines WHAT, Weight computes IDF, Scorer iterates and scores per segment.

02

All Matching is Iteration

DocIdSetIterator (DISI) is the heart of Lucene. AND uses leap-frog, OR uses priority queue.

03

Cost Drives Order

Lucene executes cheapest clauses first in conjunctions, using posting list length as cost estimate.

04

Profile API

Add "profile": true to see breakdown of create_weight, build_scorer, next_doc, advance, and score timings.