Systems Atlas

Chapter 3.3: Indexing & Infrastructure

BKD Trees & DocValues

The inverted index excels at text, but struggles with numbers. Range queries, sorting, and aggregations require specialized data structures: BKD Trees and DocValues.


Where BKD & DocValues Fit in the Search Pipeline

In a modern search engine, different data structures serve different stages of the pipeline:

1.
Query
2.
Text Recall
→ Inverted Index
3.
Filter (Numeric/Geo)
→ BKD Trees
4.
Sort & Aggregate
→ DocValues
5.
Ranking

BKD answers "Which docs match this range?". DocValues answer "What value does this doc have?". They are complementary.

Why Inverted Index Fails for Range Queries

The inverted index is perfect for exact keyword matches, but it fails miserably at range queries. Imagine trying to find products priced between $100 and $200. In an inverted index, numbers are just terms. You would have to look up every single possible price point: "100.00", "100.01", "100.02"... up to "200.00". That creates thousands of independent dictionary lookups for a simple query. We need a structure that understands order.

Inverted Index for Numbers
Query: price > 100
1. Lookup "101" → [Doc1]
2. Lookup "102" → [Doc5]
3. Lookup "103" → [Doc2]
...
990,000 lookups later...
Complexity: O(Range Size) ❌
BKD Tree Approach
Query: price > 100
1. Navigate to root
2. Follow branch ">100"
3. Scan matching leaf blocks
Complexity: O(log N) ✅

Deep Dive: Inside the BKD Tree

1. What it Actually Stores

A BKD tree (Block KD Tree) is a space-partitioning index. Internally:

  • Stores points grouped into fixed-size blocks (leaves).
  • Each block has a bounding box (min/max per dimension).
  • Internal nodes store only these bounding boxes to guide navigation.

2. How it is Built

Unlike databases that update rows, BKD trees are immutable per segment:

  • All points for a field are collected in memory buffer.
  • The space is recursively split (like a k-d tree) until leaves are small.
  • Leaf blocks are flushed to disk sequentially.

3. How Range Queries Execute

For a query like price BETWEEN 1000 AND 2000:

  1. Start at the root node.
  2. Check node's bounding box against query range.
    • If node.max < 1000SKIP SUBTREE (Pruned)
    • If node.min > 2000SKIP SUBTREE (Pruned)
    • Otherwise → Descend into children.
  3. At leaf nodes, scan actual values and collect matching docIDs.
Result: Large parts of the tree are skipped without touching disk.

How BKD Trees Work

A BKD Tree (Block K-Dimensional Tree) solves this by organizing numbers into a hierarchy of blocks. It recursively cuts the number line in half. To find "Price > 200", we don't scan all numbers; we check the top-level block. If it covers $0-$200, we skip it entirely. If it covers $200-$500, we descend into it. This allows us to discard massive chunks of data instantly, achieving logarithmic O(log N) performance.

All Products ($0 - $500)
$0 - $200
$0-$100
$100-$200
$200 - $500 ✓
$200-$350
$350-$500
For query price > 200, the engine skips the entire left branch ($0-$200).

Performance: The Numbers

Dataset: 100 million products with a price field

QueryInverted IndexBKD Tree
price = 1005ms2ms
price > 1005000ms (!)10ms
price BETWEEN 100 AND 200500ms5ms
price BETWEEN 100 AND 100000Timeout15ms

Mental Model: Performance Complexity

Why BKD is faster for ranges:

  • Inverted Index: O(Scan Term List Size) ≈ O(N) for wide ranges.
  • DocValues Scan: O(N) linear scan of all docs.
  • BKD Tree Range: O(log N + matches) ≈ Tree Pruning.

DocValues: Columnar Storage for Sorting

Search engines have a split personality. For searching (filtering), they use the Inverted Index ("Which docs have term X?"). But for sorting and aggregating, they need the reverse: "What is the value of field Y for Doc Z?". Opening every JSON document to extract a price for sorting is too slow. DocValues solves this by storing field values in tight, column-oriented arrays on disk, optimized for sequential access.

Without DocValues (Slow)

// Sort by price - parse each JSON
doc1 → parse {"title":"..", "price":2499}
doc2 → parse {"title":"..", "price":1299}
doc3 → parse {"title":"..", "price":1899}
10,000 JSON parses = 500ms

With DocValues (Fast)

// Pre-computed column of prices
price_col: [2499, 1299, 1899, ...]
Just array index lookups!
10,000 lookups = 5ms

Deep Dive: Inside DocValues

DocValues are not just arrays. Internally, they are:

  • Block-compressed: Stored in 16KB blocks compressed with LZ4.
  • Delta-encoded: Stores the difference between values (smaller integers = fewer bits).
  • Sequential: Laid out by DocID order for CPU-cache friendliness.
Row Store vs Column Store
Row-Oriented (JSON _source)
doc1: {title:"Mac", price:2499, stock:50}
doc2: {title:"Dell", price:1299, stock:120}
Column-Oriented (DocValues)
price_col: [2499, 1299, ...]
title_col: ["Mac", "Dell", ...]
stock_col: [50, 120, ...]

How BKD and DocValues Work Together

A typical query like "Sort electronics by price" uses both:

1. BKD Tree (Filtering)

Used first to find the candidate DocIDs.
"Find all docs where category='electronics'".

2. DocValues (Access)

Used second on the matching DocIDs to get values.
"For each match, load 'price' to sort".

Optimization:

If a text filter reduces results to a tiny set, the engine might skip BKD and just verify the numeric condition using DocValues (linear scan of small set).

Facets & Aggregations: Analytics Engine

A facet is not just a UI feature. It is an aggregation query (GROUP BY) over the index.

Brand
Apple (50)
Price
$0-$100 (200)
Rating
4+ Stars (15)

How Faceting Executes

1
Query Phase: Inverted Index & BKD find matching DocIDs.
"Find laptops" → [Doc 1, Doc 4, Doc 9]
2
Collection Phase (DocValues): For each DocID, read the value from column store.
Doc 1 → "Apple", Doc 4 → "Dell", Doc 9 → "Apple"
3
Aggregation Phase: Increment counters in hash table.
Apple: 2, Dell: 1

Categorical (Brand/Color)

Uses Ordinals. Strings are mapped to integers (0="Apple", 1="Dell") for fast counting arrays.

Fast (Low Cardinality)

Numeric (Price/Date)

Uses BKD + DocValues. BKD quickly identifies bucket ranges, or DocValues scans values to bin them.

Variable Speed
// Example Facet Request
"aggs": {
"by_brand": { "terms": { "field": "brand.keyword" } },
"by_price": { "range": { "field": "price", "ranges": [...] } }
}

Why Faceting is Expensive

Faceting complexity is O(N) where N = number of query matches.
Calculating facets for "all products" (millions of docs) triggers a massive scan of DocValues.

Global Ordinals: The Memory Trap

Aggregating on text (keyword) fields comes with a hidden cost. Strings are heavy and slow to compare. To make aggregations fast, Elasticsearch replaces each unique string with a small integer ID (an "ordinal") at query time. This mapping table must be built and held in **Java Heap Memory**. If you have a high-cardinality field (like `user_id` with 100M unique values), this table can consume gigabytes of heap, potentially causing OutOfMemory crashes.

Heap Memory Warning

// 10,000 unique brands
10,000 × 30 bytes = 300 KB (Fine)
// 100 million unique user_ids
100M × 36 bytes = 3.6 GB of HEAP!

High-cardinality string aggregations can crash your cluster. Use composite aggregation for pagination or avoid aggregating on unique fields.

Elasticsearch Field Type Reference

Choosing the right field type is critical for performance. `integer` and `long` are standard, but specialized types like `keyword` and `scaled_float` offer significant advantages for specific use cases.

Index Mapping
{
  "mappings": {
    "properties": {
      "title":      { "type": "text", "analyzer": "english" },
      "category":   { "type": "keyword" },  // Exact match, aggs
      "price":      { "type": "scaled_float", "scaling_factor": 100 },
      "quantity":   { "type": "integer" },
      "location":   { "type": "geo_point" }  // 2D BKD tree
    }
  }
}
TypeBytesUse For
byte1Small enums (0-127)
integer4Counts, quantities
long8Timestamps, big IDs
float4Prices (has rounding errors!)
scaled_float4Money no rounding!
Pro Tip: Use scaled_float for Money
// BAD: Float has precision issues
// 19.99 might become 19.989999771118164
// GOOD: Scaled float
"price": { "type": "scaled_float", "scaling_factor": 100 }
// 19.99 stored as 1999 (integer), no precision loss

Disabling DocValues (Space Optimization)

DocValues consume significant storage. If you never sort or aggregate on a field, you can disable them to save 30-50% storage.

Disable DocValues for text you won't sort
{
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "doc_values": false  // Can't sort/aggregate
      }
    }
  }
}

When to Disable

  • • Text fields you'll NEVER sort on
  • • Fields never used in aggregations
  • • Saves ~30-50% storage per field

When NOT to Disable

  • • Any field used in sorting
  • • Fields used in aggregations
  • • Fields used in script scoring

Geo Queries (BKD in 2D)

Geographic data isn't special; it's just a multi-dimensional BKD tree.

Multi-Dimensional BKD:

  • Geo: Dim 1 = Lat, Dim 2 = Lon.
  • Time+Price: Dim 1 = Timestamp, Dim 2 = Price.
Each node stores a bounding box in N-dimensional space. Entire regions (subtrees) are pruned if they don't intersect the query shape.

Geo Point Index & Document
// Mapping
{ "location": { "type": "geo_point" } }

// Document
{
  "name": "Starbucks",
  "location": { "lat": 37.7749, "lon": -122.4194 }
}
2D BKD Tree: Lat/Lon Space Partitioning
San Francisco
East Bay
South Bay
Mountains

Geographic space split recursively like a quadtree

Distance Query: "Within 5km"

{
"geo_distance": {
"distance": "5km",
"location": { lat: 37.77, lon: -122.41 }
}
}
Execution: BKD finds bounding box, then calculates exact distances.

Bounding Box Query

{
"geo_bounding_box": {
"location": {
"top_left": { lat: 38, lon: -123 },
"bottom_right": { lat: 37.5, lon: -122 }
}
}
}
Execution: Direct BKD tree traversal. Very fast!

Storage & Memory Benchmarks

Numeric indexing is extremely efficient. Compared to text indices, BKD trees are tiny (often 5% of original data size) and fast. Here is a breakdown of typical performance numbers for a 100M document dataset.

OperationTime
BKD range query (price 100-200)5ms
DocValues sort (by price)50ms
Aggregation (avg price)100ms
Geo distance (within 5km)30ms
Geo bounding box10ms
Memory Usage Comparison (100M docs)
Numeric Field
BKD tree: ~50 MB
DocValues: ~400 MB
Total: ~450 MB
Text Field (for comparison)
Inverted index: ~2 GB
(DocValues not supported for analyzed text)

What BKD and DocValues Are Not Used For

Text Matching
"Contains word 'foo'"
Use Inverted Index
Semantic Similarity
"Looks like 'foo'"
Use Vector Index (HNSW)
Full Table Scans
Filtering without index
Avoid (Slow)

Deletes, Updates, and Segments

BKD trees and DocValues are immutable per segment. This design choice has massive implications:

1

Deletes are Soft

Deleted documents are just marked in a "bitset" file. They are physically removed during segment merging.

2

Updates are Expensive

An update is actually Delete Old + Index New. This is why heavy update workloads churn CPU/Disk.

3

Merges are Heavy

Merging segments requires rebuilding the BKD tree and rewriting DocValues columns from scratch.

When Not to Use BKD or DocValues

  • Do not use BKD for text fields: It handles numbers/geo. Text belongs in the Inverted Index.
  • Do not rely on DocValues for filtering: Scanning billions of values is O(N). Always filter with BKD/Inverted Index first.
  • Disable DocValues for fields that are never sorted, aggregated, or scripted to save disk space.

Unifying Mental Model

A modern search engine is a multi-index system. It doesn't put everything in one bucket. Instead, it maintains four distinct data structures in parallel for every document:

Text Space
Inverted Index
Maps terms to docs.
Numeric/Geo Space
BKD Trees
Maps multi-dim points to docs.
Vector Space
HNSW Graph
Maps semantic vectors to docs.
Columnar Space
DocValues
Maps docs to values (Sort/Aggs).

Key Takeaways

01

BKD Trees

Enable O(log N) range queries on numeric, date, and geo fields.

02

DocValues

Store data column-wise to enable fast sorting and aggregations.

03

Precision

Use 'scaled_float' for money to avoid floating-point precision issues.

04

Global Ordinals

High-cardinality aggregations on text fields consume massive heap memory.

05

Space Optimization

Disable DocValues on fields you never sort or aggregate to save 30-50% storage.