Systems Atlas

Chapter 3.6: Indexing & Infrastructure

Sharding & Scale

When one server isn't enough. Sharding splits an index across multiple machines for horizontal scale and fault tolerance.


What is a Shard?

A shard is a horizontal partition of your index a self-contained slice of your data that can live on any node in the cluster. This is the same concept as horizontal partitioning in database systems, where a large table is split across multiple servers. In Elasticsearch, each shard is a complete Lucene index with its own segments, term dictionaries, and search capabilities.

Elasticsearch Shard

  • Horizontal partition of index data
  • Complete Lucene index internally
  • • Can be primary or replica
  • Independently searchable and recoverable

Database Horizontal Partitioning

  • Range partitioning: by date range, ID range
  • Hash partitioning: by hash(key) % N
  • List partitioning: by category values
  • • Elasticsearch uses hash partitioning

An Index is a Collection of Shards

Index: "products"
splits into
Shard 0
Shard 1
Shard 2
Shard 3
Shard 4
Each shard is a complete Lucene index

Why Shard?

A single machine has hard physical limits: RAM is finite, CPU cores are limited, and disks fill up. When your data exceeds these limits (e.g., 5TB index on a 1TB machine) or your query volume is too high for one CPU to handle, you must scale horizontally. Sharding breaks your single index into smaller, manageable chunks called shards, which can be distributed across multiple servers.

Single Node

Limited by one machine's CPU, RAM, disk

Sharded (5 Shards, 1 Replica)
P0
P1
P2
P3
P4
R0
R1
R2
R3
R4

Scale out by adding machines. Replicas for availability.

Primary vs Replica Shards

Every shard has a role. Primary shards handle all write operations and are the authoritative copy of the data.Replica shards are copies of primaries that provide two critical benefits: fault tolerance (if a node dies, replicas take over) and read throughput (search queries can be served by any copy).

Primary Shard

  • Handles all writes (index, update, delete)
  • Authoritative copy of the data
  • Replicates operations to replica shards
  • Number is fixed at index creation
# Fixed at creation time
"number_of_shards": 5

Replica Shard

  • Copy of primary shard data
  • Provides fault tolerance (failover)
  • Increases read throughput
  • Number is adjustable anytime
# Can change dynamically
"number_of_replicas": 1

Replica Benefits

Fault Tolerance: If Node A dies, replica on Node B promotes to primary. Zero data loss.
Read Scaling: Search queries distributed across all copies. 2 replicas = 3x read capacity.

Shards & Lucene Internals

Each shard is not just an abstract partition it's a complete Lucene index. This means every shard has:

Its own segments (immutable data files)
Its own translog (write-ahead log)
Its own refresh (1s visibility latency)
Its own merge policy (background segment merges)
Its own term dictionary (inverted index)
Its own doc values (columnar storage)
This is why shard count affects resource usage each shard carries its own Lucene overhead.

Routing: How Documents Find Their Home

If an index is split into 5 pieces, where does document #123 go? The system uses a deterministic formula:shard = hash(routing_key) % number_of_shards. By default, the routing key is the document's _id.

shard_number = MurmurHash3(routing_key) % number_of_shards
// Example: Document ID "product_abc123"
MurmurHash3("product_abc123") = 2847593847
2847593847 % 5 = 2
→ Document goes to Shard 2

Custom Routing Benefits

// Route all orders for a user together
POST /orders/_doc?routing=user_456
// Query only hits ONE shard!
GET /orders/_search?routing=user_456

Use for: multi-tenant, user data, parent-child

Danger: Hot Spots!

// Power user has 10M docs
Shard 0: 50 MB
Shard 1: 50 MB
Shard 2: 5 GB ← CRUSHED
Shard 3: 50 MB

Same routing key = same shard for all docs

Shard States & Lifecycle

Shards go through various states during their lifecycle. Understanding these states helps you diagnose cluster health issues and understand what's happening during rebalancing.

🔄
INITIALIZING
Shard is recovering or being created
STARTED
Active and serving requests
📦
RELOCATING
Moving to another node
UNASSIGNED
No node available to host it

Shard Lifecycle Flow

UNASSIGNEDINITIALIZINGSTARTEDRELOCATINGSTARTED

Shard Placement & Balancing

Elasticsearch automatically distributes shards across nodes for optimal resource utilization. The key rule: a primary and its replicas never live on the same node otherwise a single node failure would lose both copies.

Shard Distribution Across 3 Nodes

Node 1
P0
P3
R1
R4
Node 2
P1
P4
R0
R2
Node 3
P2
R3
P PrimaryR Replica

Allocation Awareness

In production, use shard allocation awareness to spread replicas across racks, zones, or datacenters. This ensures a single rack failure doesn't take out both primary and replica.

Query Execution: Scatter-Gather

How do you search across 50 shards? You don't directly. You send your query to a Coordinator Node, which acts as a project manager. The coordinator broadcasts ("scatters") the query to every shard. Each shard executes the search locally and returns its top results. The coordinator then collects ("gathers") these partial results.

Coordinator Node
Phase 1: Query (Scatter)
Shard 0
Returns: Top 10 IDs
Shard 1
Returns: Top 10 IDs
Shard 2
Returns: Top 10 IDs
Merge Phase
Sort 30 results → Global Top 10
Phase 2: Fetch (Gather)

🚀 Optimization: Routing Key Query

When you query with a routing key, ES skips scattering to all shards it hits only the relevant shard!

// Without routing: hits ALL 5 shards
GET /orders/_search
// With routing: hits ONLY 1 shard
GET /orders/_search?routing=user_456

The Deep Pagination Problem

The scatter-gather model has a major weakness: random access to deep pages is exponentially expensive. Requesting "Page 1,000" (results 10,000 to 10,010) forces the cluster to sort massive amounts of data in memory.

The Logical Trap: Why can't we just ask for the 10,000th doc?

Imagine searching for "fastest runners". You might ask each shard for its 10,000th fastest person. But what if Shard A holds all top 10,000 runners in the world? If Shard A only sends its 10,000th person, and Shard B sends its 1st, the Coordinator sees Shard B's person as the winner which is wrong.

To guarantee accuracy, EVERY shard must return its own top 10,010 results, just in case they are the global winners.

Page 1000 = Memory Explosion

// Request: page 1000, 10 results
Each shard must return: 10,010 results
5 shards × 10,010 = 50,050 docs in memory!
The Coordinator sorts ALL OF THESE just to pick the final 10.

The Solution: Cursor-Based Pagination (`search_after`)

Instead of saying "Skip 10,000 records" (which forces the DB to count them), we use a live cursor. We tell the engine: "I don't care about the first 10,000. I just know the last result I saw had a score of 1.5. Give me 10 results after 1.5."

Stateful Efficiency
  • Offset (Bad): "Go to page 1,000" → Shard must calculate 10,000 items to know where page 1,000 starts.
  • Cursor (Good): "Start after [Value X]" → Shard jumps directly to Value X in the sorted index and returns just the next 10 items.
  • Result: Each shard returns 10 docs, not 10,010. Memory usage stays flat/constant regardless of depth.
// Client sends the sort values of the LAST result from the previous page
"search_after": ["2024-01-15T10:30:00", "product_123"],
"sort": [{ "created_at": "desc" }, { "_id": "asc" }]

Shard Sizing Guide

Target 10-50 GB per shard. Too small creates overhead; too large slows recovery.

SizeStatusWhy
< 1 GBOver-shardedToo much overhead per shard
1-10 GBAcceptableOK for small indices
10-50 GBIdeal ✓Optimal balance
50-100 GBLargeLonger recovery time
> 100 GBUnder-shardedVery slow operations

❌ Over-Sharding Problems

  • • Excessive heap overhead per shard
  • • Too many file descriptors
  • • Cluster state bloat
  • • Master node instability

❌ Under-Sharding Problems

  • • Recovery takes hours (100GB+)
  • • Can't spread load across nodes
  • • Reindex required to add shards
  • • Network saturation during moves

⚠️ Shard Count is FIXED After Creation!

You CANNOT change the number of shards later. Plan carefully or expect to reindex.

// Formula for shard count
shards = ceil(data_size_gb / 30)
// 200 GB → 7 shards

Cluster Rebalancing & Recovery

When nodes join or leave the cluster, Elasticsearch automatically rebalances shards. Understanding this process helps you plan maintenance windows and capacity changes.

Node Added

  • 1. New node joins cluster
  • 2. Master allocates shards to new node
  • 3. Shards copy data from existing nodes
  • 4. Once synced, shards become STARTED

Duration: depends on shard sizes, network speed

Node Removed/Failed

  • 1. Shards on dead node become UNASSIGNED
  • 2. Replica promotes to primary (if available)
  • 3. New replicas created on remaining nodes
  • 4. Rebalancing to spread load evenly

No data loss if replicas exist on other nodes

Capacity Planning Example

Let's work through a real example. How many shards and nodes do you need for 100 million products?

Requirements
100 million products
Average doc: 2 KB
Traffic: 1,000 queries/sec
Calculations
100M × 2KB = 200 GB raw
With index overhead: ~300 GB
300 GB ÷ 30 = 10 shards
+ 1 replica = 20 shard copies
20 ÷ 5 per node = 4 nodes

Key Takeaways

01

Horizontal Scale

Sharding splits data across nodes to exceed single-machine limits. Shards are independent Lucene indices.

02

Deterministic Routing

hash(id) % shards guarantees we always know where a document lives. Changing shard count breaks this (reindex required).

03

Replicas = Safety + Speed

Replica shards provide HA (failover) and increase read throughput (load balancing).

04

Deep Pagination Danger

Scatter-Gather requires sorting (shards * page_size) docs in memory. Use search_after for deep scrolling.

Practical API Examples

Essential commands for monitoring and managing shards in production.

View Shard AllocationGET
# Detailed shard info
GET /_cat/shards?v&h=index,shard,prirep,state,docs,store,node
# Index-specific
GET /_cat/shards/my_index?v
# Unassigned shards only
GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state
Create Index with Shard SettingsPUT
PUT /my_index
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
}
}
Update Replica Count (Dynamic)PUT
PUT /my_index/_settings
{
"number_of_replicas": 2
}

TL;DR Summary

Why Sharding Matters

  • • Enables horizontal scaling beyond single-node limits
  • • Distributes data and query load across nodes
  • • Replicas provide fault tolerance and read scaling

Core Rules

  • • Target 10-50GB per shard
  • • Shard count is immutable plan ahead
  • • Use routing for multi-tenant optimization
  • • Avoid deep pagination use search_after

Common Pitfalls

  • • Over-sharding: too many tiny shards = overhead
  • • Under-sharding: huge shards = slow recovery
  • • Hot spots from uneven routing keys
  • • Ignoring replica placement across racks/zones

Best Practices

  • • Formula: shards = ceil(data_GB / 30)
  • • Always have at least 1 replica
  • • Use allocation awareness for HA
  • • Monitor with _cat/shards regularly