Systems Atlas

Chapter 3.1: Indexing & Infrastructure

What is an Index?

Before a search engine can find anything, it must organize everything. An index is the pre-computed data structure that makes sub-millisecond lookups possible across billions of documents.


What Exactly IS an Index?

An index in a search engine is a pre-computed data structure that maps terms to the documents that contain them, enabling distinct lookup and retrieval without scanning every document. It flips the relationship from "Document contains Terms" to "Term is in Documents".

Forward vs. Inverted Index

Forward Index
Document → Terms
Doc 1: ["apple", "banana"]
Doc 2: ["banana", "cherry"]
Good for: Storage, Retreiving full document content.
Bad for: "Find docs with apple" (Must scan O(N)).
Inverted Index
Term → Documents
"apple": [Doc 1]
"banana": [Doc 1, Doc 2]
Good for: Search (O(1) lookup).
This is how search engines work.

The Card Catalog Problem

To understand the necessity of an index, imagine a library with one million books arranged purely by arrival date. A visitor asks: "Which books mention machine learning?". Without a lookup system, you would have to walk to the first shelf, open the first book, scan every page for the phrase, and then repeat this process for the remaining 999,999 books. This "linear scan" is how simple system tools like `grep` work, but it is impossible at scale.

Without an Index
  • → Walk to shelf 1, open book 1
  • → Scan for "machine learning"... not found
  • → Repeat for 1,000,000 books
  • Time: Several years of your life
With an Index
  • → Open card catalog
  • → Look up "machine learning"
  • → Find: Books #4521, #8903, #15234...
  • Time: 5 minutes

Inside the Black Box: Index Structure

1. Core Components

An inverted index consists of two main parts:

  • Term Dictionary: A sorted list of all unique terms (vocabulary) in the corpus.
  • Posting Lists: For each term, a list of Document IDs that contain it.

2. Positional Information

To support phrase queries ("San Francisco") and proximity search ("limit" NEAR "speed"), the index stores positions:

"limit": [
  { doc_id: 1, positions: [10, 24] },
  { doc_id: 5, positions: [2] }
]

How an Index is Built (Construction)

1. Tokenize
Break text into words
"Hello World" → ["Hello", "World"]
2. Normalize
Lowercase, Stem
["hello", "world"]
3. Group
Group by Term
hello: [doc1], world: [doc1]
4. Write
Flush to Disk
Immutable Segment

The First Law of Search

"We pay upfront with slower writes and more storage to get faster reads."

This performance gain is not a free lunch. We are explicitly trading **write speed** and **storage space** to gain **read speed**. Every time you add a document, the system must spend significant CPU time analyzing the text and updating complex data structures. It also needs to store these structures on disk, often consuming 3-5x more space than the raw data itself. Understanding this "First Law" is critical: search engines are optimized for reading, not writing.

AspectTraditional DatabaseSearch Engine
Write SpeedFast just appendSlower must analyze & index
Text Search5000ms full table scan5ms index lookup
Storage1x just the data3-5x data + index structures
ConsistencyImmediateEventually consistent (~1s delay)

Anatomy of Document Indexing

When you send a JSON document to a search engine, it doesn't simply store the file like a database or file system. Instead, it deconstructs the document field by field. It analyzes text, extracts numbers, and simultaneously routes different data types into specialized data structures. This process transforms a single "document" into entries across an Inverted Index (for text), BKD Trees (for numbers), and Columnar Stores (for sorting).

Input Document
{
  "title": "MacBook Pro M3",
  "price": 2499.99,
  "category": "laptops"
}
Analysis Pipeline
"MacBook Pro M3" → ["macbook", "pro", "m3"]
Inverted Index
Text Search
BKD Tree
Range Queries
DocValues
Sorting
Stored Fields
Retrieval

Write Amplification

A single document insert triggers approximately 20 internal operations: term dictionary updates, posting list appends, BKD tree insertions, and translog writes. This operational overhead is the cost of sub-millisecond read performance.

The Index Hierarchy

A search "index" is not a single fileit's a distributed hierarchy designed for horizontal scaling and fault tolerance.

INDEX: "products"
Logical Namespace
hash(doc_id) % N
Shard 0
Primary
Replica
Segments
Shard 1
Primary
Replica
Segments
Shard 2
Primary
Replica
Segments
LevelPurposeChangeable?
IndexLogical namespace for queryingSettings only
ShardsHorizontal partitioningNO fixed at creation
ReplicasFault tolerance & read throughputYes, anytime
SegmentsImmutable storage unitsAutomatic

Creating an Index in Elasticsearch

Now that we understand the theory, what does this look like in practice? When you create an index, you define its settings (infrastructure) and mappings (data structure). This is a one-time setup decision that is expensive to change later.

PUT /products
{
  "settings": {
    "number_of_shards": 3,         // Fixed at creation!
    "number_of_replicas": 1,       // Can change later
    "refresh_interval": "1s"       // Near-realtime latency
  },
  "mappings": {
    "properties": {
      "title":      { "type": "text", "analyzer": "english" },
      "price":      { "type": "float" },
      "category":   { "type": "keyword" },
      "created_at": { "type": "date" }
    }
  }
}
SettingWhat It DoesGuidance
number_of_shards: 3Splits data into 3 partitionsTarget 20-50GB per shard
number_of_replicas: 11 backup copy per shard0 for dev, 1+ for prod
type: "text"Full-text searchableNatural language content
type: "keyword"Exact matches onlyIDs, categories, tags

Capacity Planning Reference

How big will your index be? As a rule of thumb, the index takes up 3-5x the raw storage space of your data due to the multiple data structures (Inverted Index, BKD Trees, DocValues, Stored Fields) maintained for each document.

Startup
100K
Raw: 200MB → Index: ~600MB
Shards: 1
Nodes: Single instance
Growth Stage
10M
Raw: 20GB → Index: ~60GB
Shards: 2-3
Nodes: 3 (for HA)
Enterprise
500M
Raw: 1TB → Index: ~3TB
Shards: 60-100
Nodes: 15-20 (multi-AZ)
The Sizing Formula
Optimal Shards = Total Index Size ÷ 30GB

When to Use What: Decision Guide

Each indexing structure solves a different problem. Here's how to choose the right tool for your query patterns.

Query PatternUse ThisWhy
"wireless keyboard"Inverted IndexFull-text search, tokenization, relevance
price > 100 AND price < 500BKD TreeO(log N) range queries on numbers/dates
Sort by price ASCDocValuesColumn-oriented storage for fast sorting
Count by categoryDocValuesAggregations need doc → value lookups
"affordable laptop" (semantic)Vector Index (HNSW)Meaning-based matching, not keywords
Within 5km of locationBKD Tree (2D)geo_point uses 2D BKD internally
category = "laptops" (exact)keyword typeNot analyzed, exact match, aggregatable

Use type: "text" when:

  • • Users search with natural language
  • • You need relevance ranking (BM25)
  • • Content varies in length (descriptions, titles)
  • • Stemming/synonyms are helpful

Use type: "keyword" when:

  • • Exact matches only (IDs, emails, SKUs)
  • • Used in filters or aggregations
  • • Case-sensitive matching needed
  • • Low cardinality (categories, status)

The Power of Hybrid: text + keyword subfield

// Best of both worlds
"title": {
"type": "text", // For full-text search
"fields": {
"raw": { "type": "keyword" } // For exact match & aggs
}
}

Search on title, aggregate on title.raw.

When to Add Vector Search

✓ Good Fit
  • • Users search with synonyms ("cheap" vs "affordable")
  • • Long-tail queries with zero keyword matches
  • • Similar item recommendations
  • • FAQ/support article search
✗ Not Needed
  • • Exact product model search ("iPhone 15 Pro")
  • • SKU or ID lookups
  • • Structured filters only (price, color)
  • • Small corpus with good keyword coverage

Indexes Beyond Search Engines

Inverted indexes aren't just for Google or Elasticsearch. They power:

Databases (Postgres GIN)
PostgreSQL uses GIN (Generalized Inverted Index) for JSONB and text search.
Log Analytics
Splunk and Datadog use inverted indexes to find error logs instantly.
E-Commerce Filters
Facet counts on sidebars are powered by inverted index aggregations.

Key Takeaways

01

Trade-offs

An index trades write performance and storage for O(1) read performance on text.

02

Hierarchy

The hierarchy is Index → Shards → Replicas → Segments. Shard count cannot change.

03

Capacity Planning

Target 20-50GB per shard. Plan capacity before you create the index.