Chapter 3.5: Indexing & Infrastructure

Segments & Immutability

The golden rule of Lucene: once written, never modified. This design enables lock-free reads, perfect caching, and crash recovery.

What is a Segment?

A segment is an immutable, self-contained piece of the index. Every search query must check all segments, then merge the results. Unlike traditional databases where indexes are a single mutable file, Lucene splits its index into many segments that are created over time and periodically merged together.

Lucene Segment

• Immutable: Never modified after creation
• Self-contained: Has its own term dictionary, postings, stored fields
• Searchable independently: Each segment can answer queries
• Created on refresh: New segment every 1 second (default)

Traditional DB Index

• Mutable: Updated in-place (B-tree pages)
• Single structure: One index file with locks
• Page-level writes: Complex crash recovery
• Immediate visibility: But at cost of locking

Index Directory Structure

/data/indices/my_index/
├── segments_5              ← Commit point (which segments are live)
├── _0.si                   ← Segment 0 info (metadata)
├── _0.cfs                  ← Compound file (all data)
├── _0.cfe                  ← Compound entries
├── _1.si                   ← Segment 1 info
├── _1_Lucene90_0.dvd       ← Doc values
├── _1_Lucene90_0.dvm       ← Doc values metadata
├── _1.fdx                  ← Stored fields index
├── _1.fdt                  ← Stored fields data
├── _1.liv                  ← Live docs bitmap (deleted = 0)
└── write.lock              ← Prevents concurrent writers

Index Approaches Compared

Different storage systems handle indexing differently. Lucene's segment-based approach is optimized for read-heavy, write-once workloads typical in search. Here's how it compares to other approaches.

System	Index Type	Mutability	Best For
Lucene/ES	Inverted index + segments	Immutable	Full-text search, analytics
RocksDB	LSM-Tree + SST files	Immutable (SSTs)	Key-value, high write throughput
PostgreSQL	B-tree pages	Mutable (in-place)	OLTP, transactions
Cassandra SAI	Per-SSTable index	Immutable	Secondary indexes at scale

Why Immutability?

Most databases update data "in-place", overwriting the old record with the new one. While intuitive, this approach is full of dangers: concurrency issues require complex locking, and a system crash during a write can leave the database corrupted. Lucene takes a radically different approach: segments are immutable. Once a file is written to disk, it is never changed. "Updating" a document means writing a new version to a new segment and marking the old one as deleted.

Mutable (Database)

Row 1: John paid $100

Row 2: Jane paid $50

Row 2: Jane paid $200 (overwritten)

• Requires locking during writes
• Risk of corruption on crash
• Cache invalidation complexity

Immutable (Lucene Segments)

Segment 1: Jane paid $50

Segment 2: Jane paid $200 (new!)

Original preserved, new segment added

• No locks needed for reads
• Perfect cache utilization
• Crash recovery via translog

The Segment Lifecycle

How does a raw JSON document become a searchable immutable segment? It's a journey through memory and disk. First, documents land in an In-Memory Buffer (RAM). Every second (by default), a process called Refresh turns this buffer into a new small segment on disk. At this moment and only at this moment the document becomes visible to search. This "refresh" mechanism is why Elasticsearch is called "Near Real-Time" (NRT).

1. In-Memory Buffer

Documents collected in RAM after translog write

NOT SEARCHABLE

↓

2. Refresh → New Segment

Buffer flushed to immutable segment file (every 1s)

SEARCHABLE ✓

↓

3. Segments Accumulate

50MB

30MB

⚠ Each query must check ALL segments

↓

4. Merge (Background)

Small segments combined, deleted docs removed

86MB (merged)

✓ Disk space reclaimed

Segment Files & Formats

Each segment consists of multiple files, each storing a specific type of data. Understanding these files helps debug issues and optimize storage. The compound file format (.cfs) combines these into one file for efficiency on older filesystems.

Extension	Name	Contents
segments_N	Commit Point	Lists all live segments in the index
.si	Segment Info	Metadata: doc count, codec, diagnostics
.cfs / .cfe	Compound File	Bundled segment data (reduces file handles)
.tim / .tip	Term Dictionary	All unique terms + pointers to postings
.doc / .pos	Postings	Doc IDs + positions for each term
.fdt / .fdx	Stored Fields	Original document content (for _source)
.liv	Live Docs	Bitmap of non-deleted documents
.dvd / .dvm	Doc Values	Columnar data for sorting/aggregations

The Tombstone Tax (Deletes)

Because segments are immutable, deletes just mark documents as "dead" in a .liv file. The data remains on disk until merge reclaims it. This creates hidden costs that grow with your delete rate.

Deleted Docs Still Consume Resources

// Storage waste

Index: 100GB, 20% deleted → 20GB wasted

// Query overhead

0% deleted: 10ms

20% deleted: 12ms (+20%)

50% deleted: 18ms (+80%)

Delete Ratio	Storage Overhead	Query Slowdown	Action Needed
0-10%	Minimal	~5%	Normal operation
10-30%	Noticeable	10-30%	Consider expunge_deletes
>30%	Severe	50%+	Force merge required

The Segment Explosion Problem

If we create a new segment every second, after an hour we'll have 3,600 files. After a day, 86,400. Since every search query has to check every segment, performance degrades linearly with the number of segments. To solve this, Lucene runs Background Merges constantly picking small segments and merging them into larger ones (like 2048 game tiles).

Segment Count	Query Latency	Memory Overhead	File Descriptors
1-5	10ms (baseline)	Low	~50
10-50	15ms (+50%)	Moderate	~500
100-500	40ms (+300%)	High	~5,000
1,000+	100ms+ (degraded)	Critical	Risk of exhaustion

Symptoms of Too Many Segments

• Query slowdown: Must check thousands of segments
• File handle exhaustion: Linux default ~65,000
• Memory pressure: Each segment has heap metadata
• Merge storms: Catch-up merging consumes all I/O

Solutions

// Increase refresh interval

"refresh_interval": "30s" // vs 1s

// Disable during bulk load

"refresh_interval": "-1"

// Force merge (read-only indices only!)

POST /index/_forcemerge?max_num_segments=1

Shards & Segments

In a distributed Elasticsearch cluster, each shard is an independent Lucene index with its own segments. A query fans out to all shards, and each shard searches its own segments. This creates a multiplication effect: total_segments = shards × segments_per_shard.

Shard 0 (Node A)

3 segments

Shard 1 (Node B)

2 segments

Shard 2 (Node C)

4 segments

Query fan-out: 3 shards × ~3 segments each = 9 segment searches

Impact on Query Performance

• More shards = more network overhead (coordinator gathers all results)
• Segment count per shard adds to per-node latency
• Hot spots occur when one shard has many more segments than others
• Force-merge across shards helps maintain uniform performance

Practical Tuning Tips

Segment management requires balancing search latency, indexing throughput, and resource usage. Here are production-tested guidelines for different workloads.

Refresh Interval Tuning

1s (default): Real-time search, many small segments
30s: Good for logs/metrics, fewer segments
60s+: Near-batch workloads
-1: Disabled bulk indexing only

Merge Policy Types

TieredMergePolicy: Default, balances size tiers
LogByteSizeMergePolicy: Older, less adaptive
max_merged_segment_size: Cap at 5GB (default)
segments_per_tier: 10 (default), lower = more merging

When to Force Merge

✓ Good Use Cases:

• After bulk indexing is complete
• Read-only/archived indices
• Before taking a snapshot

✗ Avoid:

• On actively written indices
• During peak query times
• On indices with ILM rollover

Real-World Use Cases

Understanding segment behavior in production helps you anticipate issues before they become problems. Here are common scenarios and their segment patterns.

🔥 High-Throughput Ingestion (10K docs/sec)

# Day 1 segment stats

Refresh: 1s → 86,400 segments created/day

After merge: ~50-100 segments (tiered policy)

⚠ If merge can't keep up:

→ Segment count grows to 1000+ → query degradation

Fix: Set refresh_interval: "30s" during ingestion, force-merge after.

📊 Logs Index (Time-Series)

# Typical ILM pattern

Hot phase: refresh=1s, many segments (search speed traded for freshness)

Warm phase: refresh=30s, force-merge to 1 segment

Cold phase: Read-only, 1 segment, searchable snapshot

Key: Force-merge on rollover to warm tier.

📈 Typical Production Stats

Avg segments/shard

2.1GB

Avg segment size

Deleted docs ratio

Code & API Snippets

Here are the essential Elasticsearch APIs and settings for monitoring and managing segments.

Check Segment StatsGET

# Per-index segment info

GET /my_index/_segments

# Human-readable summary

GET /_cat/segments/my_index?v&h=index,shard,segment,docs.count,size

# Index stats including segment count

GET /my_index/_stats/segments

Index SettingsPUT

PUT /my_index/_settings

{

"index.refresh_interval": "30s",

"index.merge.policy.max_merged_segment": "5gb",

"index.merge.policy.segments_per_tier": 10,

"index.merge.policy.deletes_pct_allowed": 20

}

Force Merge OperationsPOST

# Merge to single segment (read-only index only!)

POST /my_index/_forcemerge?max_num_segments=1

# Just expunge deletes (safer)

POST /my_index/_forcemerge?only_expunge_deletes=true

# Background merge (non-blocking)

POST /my_index/_forcemerge?max_num_segments=5&wait_for_completion=false

Write Amplification: The Hidden Cost

Immutability has a hidden cost: write amplification. Because segments are never modified in place, the same data gets rewritten multiple times as it moves through the system first to the translog, then to a segment, then through multiple merge phases.

You write 1 document, but it's written 5-7 times:

1. Translog (fsync)

2. In-memory buffer

3. Refresh → new segment

4. Merge level 1

5. Merge level 2

6. Merge level 3

Capacity Planning

Ingestion: 10 MB/sec (logical)

Actual I/O: 50-70 MB/sec

Plan for 7x write amplification. Use SSDs. Leave 50% headroom.

Key Takeaways

Segments are Immutable

Once written, never modified. This enables lock-free reads, perfect caching, and robust crash recovery.

Refresh = Visibility

Documents are only searchable after a 'refresh' writes them to a segment (default 1s). Costly for throughput.

The Merge Tax

Merging reclaims space from deleted docs but consumes I/O. Never force-merge actively written indices.

Write Amplification

Due to immutable segments and merging, a single document write results in 5-7x physical disk I/O.

3.4 Vector Indices Next: Sharding & Scale