Chapter 3.1: Indexing & Infrastructure
What is an Index?
Before a search engine can find anything, it must organize everything. An index is the pre-computed data structure that makes sub-millisecond lookups possible across billions of documents.
What Exactly IS an Index?
An index in a search engine is a pre-computed data structure that maps terms to the documents that contain them, enabling distinct lookup and retrieval without scanning every document. It flips the relationship from "Document contains Terms" to "Term is in Documents".
Forward vs. Inverted Index
Bad for: "Find docs with apple" (Must scan O(N)).
This is how search engines work.
The Card Catalog Problem
To understand the necessity of an index, imagine a library with one million books arranged purely by arrival date. A visitor asks: "Which books mention machine learning?". Without a lookup system, you would have to walk to the first shelf, open the first book, scan every page for the phrase, and then repeat this process for the remaining 999,999 books. This "linear scan" is how simple system tools like `grep` work, but it is impossible at scale.
- → Walk to shelf 1, open book 1
- → Scan for "machine learning"... not found
- → Repeat for 1,000,000 books
- Time: Several years of your life
- → Open card catalog
- → Look up "machine learning"
- → Find: Books #4521, #8903, #15234...
- Time: 5 minutes
Inside the Black Box: Index Structure
1. Core Components
An inverted index consists of two main parts:
- Term Dictionary: A sorted list of all unique terms (vocabulary) in the corpus.
- Posting Lists: For each term, a list of Document IDs that contain it.
2. Positional Information
To support phrase queries ("San Francisco") and proximity search ("limit" NEAR "speed"), the index stores positions:
{ doc_id: 1, positions: [10, 24] },
{ doc_id: 5, positions: [2] }
]
How an Index is Built (Construction)
The First Law of Search
"We pay upfront with slower writes and more storage to get faster reads."
This performance gain is not a free lunch. We are explicitly trading **write speed** and **storage space** to gain **read speed**. Every time you add a document, the system must spend significant CPU time analyzing the text and updating complex data structures. It also needs to store these structures on disk, often consuming 3-5x more space than the raw data itself. Understanding this "First Law" is critical: search engines are optimized for reading, not writing.
| Aspect | Traditional Database | Search Engine |
|---|---|---|
| Write Speed | Fast just append | Slower must analyze & index |
| Text Search | 5000ms full table scan | 5ms index lookup |
| Storage | 1x just the data | 3-5x data + index structures |
| Consistency | Immediate | Eventually consistent (~1s delay) |
Anatomy of Document Indexing
When you send a JSON document to a search engine, it doesn't simply store the file like a database or file system. Instead, it deconstructs the document field by field. It analyzes text, extracts numbers, and simultaneously routes different data types into specialized data structures. This process transforms a single "document" into entries across an Inverted Index (for text), BKD Trees (for numbers), and Columnar Stores (for sorting).
{
"title": "MacBook Pro M3",
"price": 2499.99,
"category": "laptops"
}Write Amplification
A single document insert triggers approximately 20 internal operations: term dictionary updates, posting list appends, BKD tree insertions, and translog writes. This operational overhead is the cost of sub-millisecond read performance.
The Index Hierarchy
A search "index" is not a single fileit's a distributed hierarchy designed for horizontal scaling and fault tolerance.
hash(doc_id) % N| Level | Purpose | Changeable? |
|---|---|---|
| Index | Logical namespace for querying | Settings only |
| Shards | Horizontal partitioning | NO fixed at creation |
| Replicas | Fault tolerance & read throughput | Yes, anytime |
| Segments | Immutable storage units | Automatic |
Creating an Index in Elasticsearch
Now that we understand the theory, what does this look like in practice? When you create an index, you define its settings (infrastructure) and mappings (data structure). This is a one-time setup decision that is expensive to change later.
{
"settings": {
"number_of_shards": 3, // Fixed at creation!
"number_of_replicas": 1, // Can change later
"refresh_interval": "1s" // Near-realtime latency
},
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "english" },
"price": { "type": "float" },
"category": { "type": "keyword" },
"created_at": { "type": "date" }
}
}
}| Setting | What It Does | Guidance |
|---|---|---|
| number_of_shards: 3 | Splits data into 3 partitions | Target 20-50GB per shard |
| number_of_replicas: 1 | 1 backup copy per shard | 0 for dev, 1+ for prod |
| type: "text" | Full-text searchable | Natural language content |
| type: "keyword" | Exact matches only | IDs, categories, tags |
Capacity Planning Reference
How big will your index be? As a rule of thumb, the index takes up 3-5x the raw storage space of your data due to the multiple data structures (Inverted Index, BKD Trees, DocValues, Stored Fields) maintained for each document.
When to Use What: Decision Guide
Each indexing structure solves a different problem. Here's how to choose the right tool for your query patterns.
| Query Pattern | Use This | Why |
|---|---|---|
| "wireless keyboard" | Inverted Index | Full-text search, tokenization, relevance |
| price > 100 AND price < 500 | BKD Tree | O(log N) range queries on numbers/dates |
| Sort by price ASC | DocValues | Column-oriented storage for fast sorting |
| Count by category | DocValues | Aggregations need doc → value lookups |
| "affordable laptop" (semantic) | Vector Index (HNSW) | Meaning-based matching, not keywords |
| Within 5km of location | BKD Tree (2D) | geo_point uses 2D BKD internally |
| category = "laptops" (exact) | keyword type | Not analyzed, exact match, aggregatable |
Use type: "text" when:
- • Users search with natural language
- • You need relevance ranking (BM25)
- • Content varies in length (descriptions, titles)
- • Stemming/synonyms are helpful
Use type: "keyword" when:
- • Exact matches only (IDs, emails, SKUs)
- • Used in filters or aggregations
- • Case-sensitive matching needed
- • Low cardinality (categories, status)
The Power of Hybrid: text + keyword subfield
Search on title, aggregate on title.raw.
When to Add Vector Search
- • Users search with synonyms ("cheap" vs "affordable")
- • Long-tail queries with zero keyword matches
- • Similar item recommendations
- • FAQ/support article search
- • Exact product model search ("iPhone 15 Pro")
- • SKU or ID lookups
- • Structured filters only (price, color)
- • Small corpus with good keyword coverage
Indexes Beyond Search Engines
Inverted indexes aren't just for Google or Elasticsearch. They power:
Key Takeaways
Trade-offs
An index trades write performance and storage for O(1) read performance on text.
Hierarchy
The hierarchy is Index → Shards → Replicas → Segments. Shard count cannot change.
Capacity Planning
Target 20-50GB per shard. Plan capacity before you create the index.