Chapter 3.7: Indexing & Infrastructure

The Write & Query Path

Indexing is not instantaneous, and searching is not magic. This chapter traces the lifecycle of a request millisecond by millisecond, from the client call to the disk commit, revealing where latency hides.

Architecture Overview

Understanding the "Write Path" (how data gets in) and the "Query Path" (how data gets out) is critical for performance tuning. Elasticsearch optimizes for read speed and data safety, sometimes at the expense of write latency or immediate visibility. We will walk through exactly what happens when you index a document and when you search for it.

Key Terminology

Infrastructure

Coordinator: The node that receives the client request. It routes data to shards (write) or scatters queries (read).
Shard: A self-contained Lucene index. Can be a Primary (writes) or Replica (reads/failover).
Segment: An immutable file containing a portion of the inverted index. Searchable.

Operations

Translog: "Transaction Log". A sequential append-only file on disk. Guarantees DURABILITY.
Refresh: Process of making data SEARCHABLE. Writes buffer to a new segment.
Flush: Process of making data PERSISTENT. Fsyncs segments and clears translog.

The Write Path (Indexing)

When you call PUT /products/_doc/1, the goal is durability (don't lose data) and eventual searchability. Data is durable almost instantly (Translog), but searchable only after a refresh (default 1s). This split allows high ingestion rates while ensuring safety.

0.1ms

1. Coordinator Node

Hashes ID, routes to Primary Shard

Primary Shard

2. Translog

fsync to disk

DURABLE ✓

3. Memory Buffer

Build Inverted Index

NOT SEARCHABLE ⚠

4. Replicate

Forward to all Replica Shards

200 OK

Total Time: ~5-10ms

Translog (Durability)

Writing a full Lucene segment is expensive. The Translog is a cheap, sequential file. We write here synchronously to ensure data isn't lost if the node crashes.

Refresh (Searchability)

Every 1 second (default), the Memory Buffer is written to a new, read-only Segment. The buffer is cleared. ONLY NOW is the document visible to search.

Flush (Persistence)

When the Translog gets too big (512MB) or every 30m, a Flush occurs. It commits segments to disk, clears the Translog, and ensures clean restart.

What's in a Segment?

A segment isn't just one file; it's a collection of highly optimized structures rooted in the Inverted Index.

Structure	File Ext	Purpose
Inverted Index	.tip, .tim	Maps terms → doc IDs. Used for full-text search.
Doc Values	.dvd, .dvm	Columnar storage. Used for sorting & aggregations.
Stored Fields	.fdt, .fdx	Original JSON (`_source`). Retrieved at end of search.
BKD Tree	.dim, .dii	Numeric & spatial points. Used for numeric range filters.

Failure & Recovery

What happens if a node crashes mid-write?

Scenario 1

Crash BEFORE Translog fsync: The client receives a 500 error or timeout. Data is NOT on disk. The write is lost, but the client knows it failed and can retry.

Scenario 2

Crash AFTER Translog fsync: Client got 200 OK, but data wasn't in a segment yet. On restart, Elasticsearch replays the Translog, rebuilding the memory buffer and segments. No data is lost.

The Query Path (Searching)

Distributed search solves the Scatter-Gather problem. We don't want to send huge JSON documents across the network until we know which ones matched. Thus, search is split into two distinct phases.

Phase 1: Query (Scatter)

1Coordinator sends query to all relevant shards.
2Each shard scores docs locally (Inverted Index).
3Shard returns only IDs + Scores.
4Coordinator sorts & merges to find Global Top 10.

Phase 2: Fetch (Gather)

1Coordinator knows exactly which docs won.
2Requests full _source from specific shards.
3Shards read from Stored Fields (.fdt).
4Coordinator assembles JSON response.

Latency Breakdown

Parse

Query / Scoring (20ms)

Merge

Fetch (5ms)

Why Filters are Faster

Filters (e.g., term, range) run before scoring. They discard non-matching documents cheaply using bitsets. The expensive scoring algorithm (BM25) then only runs on the small subset of remaining docs.

Typical Latencies

Phase	Latency Goal	Bottleneck
Simple Write	1 - 5 ms	Disk I/O (Translog fsync)
Heavy Bulk Write	50 - 200 ms	CPU (Tokenization) + Memory Buffer
Query (Cached)	< 10 ms	Network RTT
Query (Complex)	50 - 500 ms	CPU (Scoring millions of docs)
Fetch (Big Docs)	20 - 100 ms	Network Bandwidth (JSON size)

Caching Layers

Why is the second run of a query often 10x faster? It's not magic; it's caching at three distinct levels. Optimizing cache usage is the cheapest way to improve performance.

The "Warm Up" Effect

Cold Run (100ms): Disk I/O to read segments + CPU to score docs + CPU to parse JSON.
Warm Run (5ms): RAM access (OS Cache) + Bitset lookup (Node Cache) + Pre-computed Result (Shard Cache).

1. Node Query Cache

Caches the results of filter clauses (like `status:active`) as bitsets (001010).

• Why it helps: Skipping 1M docs becomes a fast bitwise AND operation.
• Scope: Shared by all shards on a node.
• Policy: LRU. Only caches segments > 10k docs.

2. Shard Request Cache

Caches the full JSON response for search/aggregations (e.g., "Top 5 Categories").

• Why it helps: Returns instant results for dashboards/reports. Zero CPU usage.
• Invalidation: Invalidated instantly on Segment Refresh.

3. OS Page Cache

The Kernel automatically keeps hot file blocks in RAM.

• Why it helps: Turns expensive Disk Random I/O (10ms) into RAM access (100ns).
• Tip: Never allocate >50% RAM to Java Heap. Leave 50% for this!

Monitoring & Debugging

Inspect Write StateCLI

GET /_cat/indices?v

# Check segment counts

GET /_cat/segments/my-index?v

# Check refresh performance

GET /my-index/_stats/refresh

Debug QueriesAPI

# Why is this specific query slow?

GET /_search?error_trace=true

{ "profile": true, "query": ... }

# Production Slow Log settings

index.search.slowlog.threshold.query.warn: 10s

Key Takeaways

Tune Refresh Interval

Default 1s is too aggressive for heavy writes. Set to 30s+ to reduce segment creation overhead by 90%.

Monitor OS Page Cache

Elasticsearch relies on the OS for caching. If you starve the FS cache (by giving Heap > 50%), latency tanks.

Filter First

Filter clauses are cached (bitsets) and fast. Use them to narrow down the document set before expensive scoring runs.

Fetch Less

Retrieving large _source fields is a network bottleneck. Use source filtering to get only what you display.

3.6 Sharding Chapter 4: Data Foundation