Chapter 3.7: Indexing & Infrastructure
The Write & Query Path
Indexing is not instantaneous, and searching is not magic. This chapter traces the lifecycle of a request millisecond by millisecond, from the client call to the disk commit, revealing where latency hides.
Architecture Overview
Understanding the "Write Path" (how data gets in) and the "Query Path" (how data gets out) is critical for performance tuning. Elasticsearch optimizes for read speed and data safety, sometimes at the expense of write latency or immediate visibility. We will walk through exactly what happens when you index a document and when you search for it.
Key Terminology
Infrastructure
- Coordinator: The node that receives the client request. It routes data to shards (write) or scatters queries (read).
- Shard: A self-contained Lucene index. Can be a Primary (writes) or Replica (reads/failover).
- Segment: An immutable file containing a portion of the inverted index. Searchable.
Operations
- Translog: "Transaction Log". A sequential append-only file on disk. Guarantees DURABILITY.
- Refresh: Process of making data SEARCHABLE. Writes buffer to a new segment.
- Flush: Process of making data PERSISTENT. Fsyncs segments and clears translog.
The Write Path (Indexing)
When you call PUT /products/_doc/1, the goal is durability (don't lose data) and eventual searchability. Data is durable almost instantly (Translog), but searchable only after a refresh (default 1s). This split allows high ingestion rates while ensuring safety.
Translog (Durability)
Writing a full Lucene segment is expensive. The Translog is a cheap, sequential file. We write here synchronously to ensure data isn't lost if the node crashes.
Refresh (Searchability)
Every 1 second (default), the Memory Buffer is written to a new, read-only Segment. The buffer is cleared. ONLY NOW is the document visible to search.
Flush (Persistence)
When the Translog gets too big (512MB) or every 30m, a Flush occurs. It commits segments to disk, clears the Translog, and ensures clean restart.
What's in a Segment?
A segment isn't just one file; it's a collection of highly optimized structures rooted in the Inverted Index.
| Structure | File Ext | Purpose |
|---|---|---|
| Inverted Index | .tip, .tim | Maps terms → doc IDs. Used for full-text search. |
| Doc Values | .dvd, .dvm | Columnar storage. Used for sorting & aggregations. |
| Stored Fields | .fdt, .fdx | Original JSON (_source). Retrieved at end of search. |
| BKD Tree | .dim, .dii | Numeric & spatial points. Used for numeric range filters. |
Failure & Recovery
What happens if a node crashes mid-write?
Crash BEFORE Translog fsync: The client receives a 500 error or timeout. Data is NOT on disk. The write is lost, but the client knows it failed and can retry.
Crash AFTER Translog fsync: Client got 200 OK, but data wasn't in a segment yet. On restart, Elasticsearch replays the Translog, rebuilding the memory buffer and segments. No data is lost.
The Query Path (Searching)
Distributed search solves the Scatter-Gather problem. We don't want to send huge JSON documents across the network until we know which ones matched. Thus, search is split into two distinct phases.
- 1Coordinator sends query to all relevant shards.
- 2Each shard scores docs locally (Inverted Index).
- 3Shard returns only IDs + Scores.
- 4Coordinator sorts & merges to find Global Top 10.
- 1Coordinator knows exactly which docs won.
- 2Requests full _source from specific shards.
- 3Shards read from Stored Fields (.fdt).
- 4Coordinator assembles JSON response.
Why Filters are Faster
Filters (e.g., term, range) run before scoring. They discard non-matching documents cheaply using bitsets. The expensive scoring algorithm (BM25) then only runs on the small subset of remaining docs.
Typical Latencies
| Phase | Latency Goal | Bottleneck |
|---|---|---|
| Simple Write | 1 - 5 ms | Disk I/O (Translog fsync) |
| Heavy Bulk Write | 50 - 200 ms | CPU (Tokenization) + Memory Buffer |
| Query (Cached) | < 10 ms | Network RTT |
| Query (Complex) | 50 - 500 ms | CPU (Scoring millions of docs) |
| Fetch (Big Docs) | 20 - 100 ms | Network Bandwidth (JSON size) |
Caching Layers
Why is the second run of a query often 10x faster? It's not magic; it's caching at three distinct levels. Optimizing cache usage is the cheapest way to improve performance.
The "Warm Up" Effect
Cold Run (100ms): Disk I/O to read segments + CPU to score docs + CPU to parse JSON.
Warm Run (5ms): RAM access (OS Cache) + Bitset lookup (Node Cache) + Pre-computed Result (Shard Cache).
Caches the results of filter clauses (like `status:active`) as bitsets (001010).
- • Why it helps: Skipping 1M docs becomes a fast bitwise AND operation.
- • Scope: Shared by all shards on a node.
- • Policy: LRU. Only caches segments > 10k docs.
Caches the full JSON response for search/aggregations (e.g., "Top 5 Categories").
- • Why it helps: Returns instant results for dashboards/reports. Zero CPU usage.
- • Invalidation: Invalidated instantly on Segment Refresh.
The Kernel automatically keeps hot file blocks in RAM.
- • Why it helps: Turns expensive Disk Random I/O (10ms) into RAM access (100ns).
- • Tip: Never allocate >50% RAM to Java Heap. Leave 50% for this!
Monitoring & Debugging
Inspect Write StateCLI
Debug QueriesAPI
Key Takeaways
Tune Refresh Interval
Default 1s is too aggressive for heavy writes. Set to 30s+ to reduce segment creation overhead by 90%.
Monitor OS Page Cache
Elasticsearch relies on the OS for caching. If you starve the FS cache (by giving Heap > 50%), latency tanks.
Filter First
Filter clauses are cached (bitsets) and fast. Use them to narrow down the document set before expensive scoring runs.
Fetch Less
Retrieving large _source fields is a network bottleneck. Use source filtering to get only what you display.