Systems Atlas

Chapter 3.5: Indexing & Infrastructure

Segments & Immutability

The golden rule of Lucene: once written, never modified. This design enables lock-free reads, perfect caching, and crash recovery.


What is a Segment?

A segment is an immutable, self-contained piece of the index. Every search query must check all segments, then merge the results. Unlike traditional databases where indexes are a single mutable file, Lucene splits its index into many segments that are created over time and periodically merged together.

Lucene Segment

  • Immutable: Never modified after creation
  • Self-contained: Has its own term dictionary, postings, stored fields
  • Searchable independently: Each segment can answer queries
  • Created on refresh: New segment every 1 second (default)

Traditional DB Index

  • Mutable: Updated in-place (B-tree pages)
  • Single structure: One index file with locks
  • Page-level writes: Complex crash recovery
  • Immediate visibility: But at cost of locking

Index Directory Structure

/data/indices/my_index/
├── segments_5              ← Commit point (which segments are live)
├── _0.si                   ← Segment 0 info (metadata)
├── _0.cfs                  ← Compound file (all data)
├── _0.cfe                  ← Compound entries
├── _1.si                   ← Segment 1 info
├── _1_Lucene90_0.dvd       ← Doc values
├── _1_Lucene90_0.dvm       ← Doc values metadata
├── _1.fdx                  ← Stored fields index
├── _1.fdt                  ← Stored fields data
├── _1.liv                  ← Live docs bitmap (deleted = 0)
└── write.lock              ← Prevents concurrent writers

Index Approaches Compared

Different storage systems handle indexing differently. Lucene's segment-based approach is optimized for read-heavy, write-once workloads typical in search. Here's how it compares to other approaches.

SystemIndex TypeMutabilityBest For
Lucene/ESInverted index + segmentsImmutableFull-text search, analytics
RocksDBLSM-Tree + SST filesImmutable (SSTs)Key-value, high write throughput
PostgreSQLB-tree pagesMutable (in-place)OLTP, transactions
Cassandra SAIPer-SSTable indexImmutableSecondary indexes at scale

Why Immutability?

Most databases update data "in-place", overwriting the old record with the new one. While intuitive, this approach is full of dangers: concurrency issues require complex locking, and a system crash during a write can leave the database corrupted. Lucene takes a radically different approach: segments are immutable. Once a file is written to disk, it is never changed. "Updating" a document means writing a new version to a new segment and marking the old one as deleted.

Mutable (Database)
Row 1: John paid $100
Row 2: Jane paid $50
Row 2: Jane paid $200 (overwritten)
  • • Requires locking during writes
  • • Risk of corruption on crash
  • • Cache invalidation complexity
Immutable (Lucene Segments)
Segment 1: Jane paid $50
Segment 2: Jane paid $200 (new!)
Original preserved, new segment added
  • • No locks needed for reads
  • • Perfect cache utilization
  • • Crash recovery via translog

The Segment Lifecycle

How does a raw JSON document become a searchable immutable segment? It's a journey through memory and disk. First, documents land in an In-Memory Buffer (RAM). Every second (by default), a process called Refresh turns this buffer into a new small segment on disk. At this moment and only at this moment the document becomes visible to search. This "refresh" mechanism is why Elasticsearch is called "Near Real-Time" (NRT).

1. In-Memory Buffer
Documents collected in RAM after translog write
NOT SEARCHABLE
2. Refresh → New Segment
Buffer flushed to immutable segment file (every 1s)
SEARCHABLE ✓
3. Segments Accumulate
50MB
30MB
5M
1
⚠ Each query must check ALL segments
4. Merge (Background)
Small segments combined, deleted docs removed
86MB (merged)
✓ Disk space reclaimed

Segment Files & Formats

Each segment consists of multiple files, each storing a specific type of data. Understanding these files helps debug issues and optimize storage. The compound file format (.cfs) combines these into one file for efficiency on older filesystems.

ExtensionNameContents
segments_NCommit PointLists all live segments in the index
.siSegment InfoMetadata: doc count, codec, diagnostics
.cfs / .cfeCompound FileBundled segment data (reduces file handles)
.tim / .tipTerm DictionaryAll unique terms + pointers to postings
.doc / .posPostingsDoc IDs + positions for each term
.fdt / .fdxStored FieldsOriginal document content (for _source)
.livLive DocsBitmap of non-deleted documents
.dvd / .dvmDoc ValuesColumnar data for sorting/aggregations

The Tombstone Tax (Deletes)

Because segments are immutable, deletes just mark documents as "dead" in a .liv file. The data remains on disk until merge reclaims it. This creates hidden costs that grow with your delete rate.

Deleted Docs Still Consume Resources

// Storage waste
Index: 100GB, 20% deleted → 20GB wasted
// Query overhead
0% deleted: 10ms
20% deleted: 12ms (+20%)
50% deleted: 18ms (+80%)
Delete RatioStorage OverheadQuery SlowdownAction Needed
0-10%Minimal~5%Normal operation
10-30%Noticeable10-30%Consider expunge_deletes
>30%Severe50%+Force merge required

The Segment Explosion Problem

If we create a new segment every second, after an hour we'll have 3,600 files. After a day, 86,400. Since every search query has to check every segment, performance degrades linearly with the number of segments. To solve this, Lucene runs Background Merges constantly picking small segments and merging them into larger ones (like 2048 game tiles).

Segment CountQuery LatencyMemory OverheadFile Descriptors
1-510ms (baseline)Low~50
10-5015ms (+50%)Moderate~500
100-50040ms (+300%)High~5,000
1,000+100ms+ (degraded)CriticalRisk of exhaustion

Symptoms of Too Many Segments

  • Query slowdown: Must check thousands of segments
  • File handle exhaustion: Linux default ~65,000
  • Memory pressure: Each segment has heap metadata
  • Merge storms: Catch-up merging consumes all I/O

Solutions

// Increase refresh interval
"refresh_interval": "30s" // vs 1s
// Disable during bulk load
"refresh_interval": "-1"
// Force merge (read-only indices only!)
POST /index/_forcemerge?max_num_segments=1

Shards & Segments

In a distributed Elasticsearch cluster, each shard is an independent Lucene index with its own segments. A query fans out to all shards, and each shard searches its own segments. This creates a multiplication effect: total_segments = shards × segments_per_shard.

Shard 0 (Node A)
_0
_1
_2
3 segments
Shard 1 (Node B)
_0
_1
2 segments
Shard 2 (Node C)
_0
_1
_2
_3
4 segments
Query fan-out: 3 shards × ~3 segments each = 9 segment searches

Impact on Query Performance

  • • More shards = more network overhead (coordinator gathers all results)
  • • Segment count per shard adds to per-node latency
  • • Hot spots occur when one shard has many more segments than others
  • • Force-merge across shards helps maintain uniform performance

Practical Tuning Tips

Segment management requires balancing search latency, indexing throughput, and resource usage. Here are production-tested guidelines for different workloads.

Refresh Interval Tuning

  • 1s (default): Real-time search, many small segments
  • 30s: Good for logs/metrics, fewer segments
  • 60s+: Near-batch workloads
  • -1: Disabled bulk indexing only

Merge Policy Types

  • TieredMergePolicy: Default, balances size tiers
  • LogByteSizeMergePolicy: Older, less adaptive
  • max_merged_segment_size: Cap at 5GB (default)
  • segments_per_tier: 10 (default), lower = more merging

When to Force Merge

✓ Good Use Cases:

  • • After bulk indexing is complete
  • • Read-only/archived indices
  • • Before taking a snapshot

✗ Avoid:

  • • On actively written indices
  • • During peak query times
  • • On indices with ILM rollover

Real-World Use Cases

Understanding segment behavior in production helps you anticipate issues before they become problems. Here are common scenarios and their segment patterns.

🔥 High-Throughput Ingestion (10K docs/sec)

# Day 1 segment stats
Refresh: 1s → 86,400 segments created/day
After merge: ~50-100 segments (tiered policy)
⚠ If merge can't keep up:
→ Segment count grows to 1000+ → query degradation

Fix: Set refresh_interval: "30s" during ingestion, force-merge after.

📊 Logs Index (Time-Series)

# Typical ILM pattern
Hot phase: refresh=1s, many segments (search speed traded for freshness)
Warm phase: refresh=30s, force-merge to 1 segment
Cold phase: Read-only, 1 segment, searchable snapshot

Key: Force-merge on rollover to warm tier.

📈 Typical Production Stats

15
Avg segments/shard
2.1GB
Avg segment size
8%
Deleted docs ratio

Code & API Snippets

Here are the essential Elasticsearch APIs and settings for monitoring and managing segments.

Check Segment StatsGET
# Per-index segment info
GET /my_index/_segments
# Human-readable summary
GET /_cat/segments/my_index?v&h=index,shard,segment,docs.count,size
# Index stats including segment count
GET /my_index/_stats/segments
Index SettingsPUT
PUT /my_index/_settings
{
"index.refresh_interval": "30s",
"index.merge.policy.max_merged_segment": "5gb",
"index.merge.policy.segments_per_tier": 10,
"index.merge.policy.deletes_pct_allowed": 20
}
Force Merge OperationsPOST
# Merge to single segment (read-only index only!)
POST /my_index/_forcemerge?max_num_segments=1
# Just expunge deletes (safer)
POST /my_index/_forcemerge?only_expunge_deletes=true
# Background merge (non-blocking)
POST /my_index/_forcemerge?max_num_segments=5&wait_for_completion=false

Write Amplification: The Hidden Cost

Immutability has a hidden cost: write amplification. Because segments are never modified in place, the same data gets rewritten multiple times as it moves through the system first to the translog, then to a segment, then through multiple merge phases.

You write 1 document, but it's written 5-7 times:
1. Translog (fsync)
2. In-memory buffer
3. Refresh → new segment
4. Merge level 1
5. Merge level 2
6. Merge level 3
Capacity Planning
Ingestion: 10 MB/sec (logical)
Actual I/O: 50-70 MB/sec
Plan for 7x write amplification. Use SSDs. Leave 50% headroom.

Key Takeaways

01

Segments are Immutable

Once written, never modified. This enables lock-free reads, perfect caching, and robust crash recovery.

02

Refresh = Visibility

Documents are only searchable after a 'refresh' writes them to a segment (default 1s). Costly for throughput.

03

The Merge Tax

Merging reclaims space from deleted docs but consumes I/O. Never force-merge actively written indices.

04

Write Amplification

Due to immutable segments and merging, a single document write results in 5-7x physical disk I/O.