Chapter 3.1: Indexing & Infrastructure

What is an Index?

Before a search engine can find anything, it must organize everything. An index is the pre-computed data structure that makes sub-millisecond lookups possible across billions of documents.

What Exactly IS an Index?

An index in a search engine is a pre-computed data structure that maps terms to the documents that contain them, enabling distinct lookup and retrieval without scanning every document. It flips the relationship from "Document contains Terms" to "Term is in Documents".

Forward vs. Inverted Index

Forward Index

Document → Terms

Doc 1: ["apple", "banana"]

Doc 2: ["banana", "cherry"]

Good for: Storage, Retreiving full document content.
Bad for: "Find docs with apple" (Must scan O(N)).

Inverted Index

Term → Documents

"apple": [Doc 1]

"banana": [Doc 1, Doc 2]

Good for: Search (O(1) lookup).
This is how search engines work.

The Card Catalog Problem

To understand the necessity of an index, imagine a library with one million books arranged purely by arrival date. A visitor asks: "Which books mention machine learning?". Without a lookup system, you would have to walk to the first shelf, open the first book, scan every page for the phrase, and then repeat this process for the remaining 999,999 books. This "linear scan" is how simple system tools like `grep` work, but it is impossible at scale.

Without an Index

→ Walk to shelf 1, open book 1
→ Scan for "machine learning"... not found
→ Repeat for 1,000,000 books
Time: Several years of your life

With an Index

→ Open card catalog
→ Look up "machine learning"
→ Find: Books #4521, #8903, #15234...
Time: 5 minutes

Inside the Black Box: Index Structure

1. Core Components

An inverted index consists of two main parts:

Term Dictionary: A sorted list of all unique terms (vocabulary) in the corpus.
Posting Lists: For each term, a list of Document IDs that contain it.

2. Positional Information

To support phrase queries ("San Francisco") and proximity search ("limit" NEAR "speed"), the index stores positions:

"limit": [
{ doc_id: 1, positions: [10, 24] },
{ doc_id: 5, positions: [2] }
]

How an Index is Built (Construction)

1. Tokenize

Break text into words

"Hello World" → ["Hello", "World"]

2. Normalize

Lowercase, Stem

["hello", "world"]

3. Group

Group by Term

hello: [doc1], world: [doc1]

4. Write

Flush to Disk

Immutable Segment

The First Law of Search

"We pay upfront with slower writes and more storage to get faster reads."

This performance gain is not a free lunch. We are explicitly trading **write speed** and **storage space** to gain **read speed**. Every time you add a document, the system must spend significant CPU time analyzing the text and updating complex data structures. It also needs to store these structures on disk, often consuming 3-5x more space than the raw data itself. Understanding this "First Law" is critical: search engines are optimized for reading, not writing.

Aspect	Traditional Database	Search Engine
Write Speed	Fast just append	Slower must analyze & index
Text Search	5000ms full table scan	5ms index lookup
Storage	1x just the data	3-5x data + index structures
Consistency	Immediate	Eventually consistent (~1s delay)

Anatomy of Document Indexing

When you send a JSON document to a search engine, it doesn't simply store the file like a database or file system. Instead, it deconstructs the document field by field. It analyzes text, extracts numbers, and simultaneously routes different data types into specialized data structures. This process transforms a single "document" into entries across an Inverted Index (for text), BKD Trees (for numbers), and Columnar Stores (for sorting).

Input Document

{
  "title": "MacBook Pro M3",
  "price": 2499.99,
  "category": "laptops"
}

Analysis Pipeline

"MacBook Pro M3" → ["macbook", "pro", "m3"]

Inverted Index

Text Search

BKD Tree

Range Queries

DocValues

Sorting

Stored Fields

Retrieval

Write Amplification

A single document insert triggers approximately 20 internal operations: term dictionary updates, posting list appends, BKD tree insertions, and translog writes. This operational overhead is the cost of sub-millisecond read performance.

The Index Hierarchy

A search "index" is not a single fileit's a distributed hierarchy designed for horizontal scaling and fault tolerance.

INDEX: "products"

Logical Namespace

hash(doc_id) % N

Shard 0

Primary

Replica

Segments

Shard 1

Primary

Replica

Segments

Shard 2

Primary

Replica

Segments

Level	Purpose	Changeable?
Index	Logical namespace for querying	Settings only
Shards	Horizontal partitioning	NO fixed at creation
Replicas	Fault tolerance & read throughput	Yes, anytime
Segments	Immutable storage units	Automatic

Creating an Index in Elasticsearch

Now that we understand the theory, what does this look like in practice? When you create an index, you define its settings (infrastructure) and mappings (data structure). This is a one-time setup decision that is expensive to change later.

PUT /products

{
  "settings": {
    "number_of_shards": 3,         // Fixed at creation!
    "number_of_replicas": 1,       // Can change later
    "refresh_interval": "1s"       // Near-realtime latency
  },
  "mappings": {
    "properties": {
      "title":      { "type": "text", "analyzer": "english" },
      "price":      { "type": "float" },
      "category":   { "type": "keyword" },
      "created_at": { "type": "date" }
    }
  }
}

Setting	What It Does	Guidance
number_of_shards: 3	Splits data into 3 partitions	Target 20-50GB per shard
number_of_replicas: 1	1 backup copy per shard	0 for dev, 1+ for prod
type: "text"	Full-text searchable	Natural language content
type: "keyword"	Exact matches only	IDs, categories, tags

Capacity Planning Reference

How big will your index be? As a rule of thumb, the index takes up 3-5x the raw storage space of your data due to the multiple data structures (Inverted Index, BKD Trees, DocValues, Stored Fields) maintained for each document.

Startup

100K

Raw: 200MB → Index: ~600MB

Shards: 1

Nodes: Single instance

Growth Stage

10M

Raw: 20GB → Index: ~60GB

Shards: 2-3

Nodes: 3 (for HA)

Enterprise

500M

Raw: 1TB → Index: ~3TB

Shards: 60-100

Nodes: 15-20 (multi-AZ)

The Sizing Formula

Optimal Shards = Total Index Size ÷ 30GB

When to Use What: Decision Guide

Each indexing structure solves a different problem. Here's how to choose the right tool for your query patterns.

Query Pattern	Use This	Why
"wireless keyboard"	Inverted Index	Full-text search, tokenization, relevance
price > 100 AND price < 500	BKD Tree	O(log N) range queries on numbers/dates
Sort by price ASC	DocValues	Column-oriented storage for fast sorting
Count by category	DocValues	Aggregations need doc → value lookups
"affordable laptop" (semantic)	Vector Index (HNSW)	Meaning-based matching, not keywords
Within 5km of location	BKD Tree (2D)	geo_point uses 2D BKD internally
category = "laptops" (exact)	keyword type	Not analyzed, exact match, aggregatable

Use type: "text" when:

• Users search with natural language
• You need relevance ranking (BM25)
• Content varies in length (descriptions, titles)
• Stemming/synonyms are helpful

Use type: "keyword" when:

• Exact matches only (IDs, emails, SKUs)
• Used in filters or aggregations
• Case-sensitive matching needed
• Low cardinality (categories, status)

The Power of Hybrid: text + keyword subfield

// Best of both worlds

"title": {

"type": "text", // For full-text search

"fields": {

"raw": { "type": "keyword" } // For exact match & aggs

}

Search on title, aggregate on title.raw.

When to Add Vector Search

✓ Good Fit

• Users search with synonyms ("cheap" vs "affordable")
• Long-tail queries with zero keyword matches
• Similar item recommendations
• FAQ/support article search

✗ Not Needed

• Exact product model search ("iPhone 15 Pro")
• SKU or ID lookups
• Structured filters only (price, color)
• Small corpus with good keyword coverage

Indexes Beyond Search Engines

Inverted indexes aren't just for Google or Elasticsearch. They power:

Databases (Postgres GIN)

PostgreSQL uses GIN (Generalized Inverted Index) for JSONB and text search.

Log Analytics

Splunk and Datadog use inverted indexes to find error logs instantly.

E-Commerce Filters

Facet counts on sidebars are powered by inverted index aggregations.

Key Takeaways

Trade-offs

An index trades write performance and storage for O(1) read performance on text.

Hierarchy

The hierarchy is Index → Shards → Replicas → Segments. Shard count cannot change.

Capacity Planning

Target 20-50GB per shard. Plan capacity before you create the index.

← 2.7 Ambiguity Next: The Inverted Index