Chapter 3.6: Indexing & Infrastructure
Sharding & Scale
When one server isn't enough. Sharding splits an index across multiple machines for horizontal scale and fault tolerance.
What is a Shard?
A shard is a horizontal partition of your index a self-contained slice of your data that can live on any node in the cluster. This is the same concept as horizontal partitioning in database systems, where a large table is split across multiple servers. In Elasticsearch, each shard is a complete Lucene index with its own segments, term dictionaries, and search capabilities.
Elasticsearch Shard
- • Horizontal partition of index data
- • Complete Lucene index internally
- • Can be primary or replica
- • Independently searchable and recoverable
Database Horizontal Partitioning
- • Range partitioning: by date range, ID range
- • Hash partitioning: by hash(key) % N
- • List partitioning: by category values
- • Elasticsearch uses hash partitioning
An Index is a Collection of Shards
Why Shard?
A single machine has hard physical limits: RAM is finite, CPU cores are limited, and disks fill up. When your data exceeds these limits (e.g., 5TB index on a 1TB machine) or your query volume is too high for one CPU to handle, you must scale horizontally. Sharding breaks your single index into smaller, manageable chunks called shards, which can be distributed across multiple servers.
Limited by one machine's CPU, RAM, disk
Scale out by adding machines. Replicas for availability.
Primary vs Replica Shards
Every shard has a role. Primary shards handle all write operations and are the authoritative copy of the data.Replica shards are copies of primaries that provide two critical benefits: fault tolerance (if a node dies, replicas take over) and read throughput (search queries can be served by any copy).
Primary Shard
- • Handles all writes (index, update, delete)
- • Authoritative copy of the data
- • Replicates operations to replica shards
- • Number is fixed at index creation
Replica Shard
- • Copy of primary shard data
- • Provides fault tolerance (failover)
- • Increases read throughput
- • Number is adjustable anytime
Replica Benefits
Shards & Lucene Internals
Each shard is not just an abstract partition it's a complete Lucene index. This means every shard has:
Routing: How Documents Find Their Home
If an index is split into 5 pieces, where does document #123 go? The system uses a deterministic formula:shard = hash(routing_key) % number_of_shards. By default, the routing key is the document's _id.
Custom Routing Benefits
Use for: multi-tenant, user data, parent-child
Danger: Hot Spots!
Same routing key = same shard for all docs
Shard States & Lifecycle
Shards go through various states during their lifecycle. Understanding these states helps you diagnose cluster health issues and understand what's happening during rebalancing.
Shard Lifecycle Flow
Shard Placement & Balancing
Elasticsearch automatically distributes shards across nodes for optimal resource utilization. The key rule: a primary and its replicas never live on the same node otherwise a single node failure would lose both copies.
Shard Distribution Across 3 Nodes
Allocation Awareness
In production, use shard allocation awareness to spread replicas across racks, zones, or datacenters. This ensures a single rack failure doesn't take out both primary and replica.
Query Execution: Scatter-Gather
How do you search across 50 shards? You don't directly. You send your query to a Coordinator Node, which acts as a project manager. The coordinator broadcasts ("scatters") the query to every shard. Each shard executes the search locally and returns its top results. The coordinator then collects ("gathers") these partial results.
🚀 Optimization: Routing Key Query
When you query with a routing key, ES skips scattering to all shards it hits only the relevant shard!
The Deep Pagination Problem
The scatter-gather model has a major weakness: random access to deep pages is exponentially expensive. Requesting "Page 1,000" (results 10,000 to 10,010) forces the cluster to sort massive amounts of data in memory.
The Logical Trap: Why can't we just ask for the 10,000th doc?
Imagine searching for "fastest runners". You might ask each shard for its 10,000th fastest person. But what if Shard A holds all top 10,000 runners in the world? If Shard A only sends its 10,000th person, and Shard B sends its 1st, the Coordinator sees Shard B's person as the winner which is wrong.
To guarantee accuracy, EVERY shard must return its own top 10,010 results, just in case they are the global winners.
Page 1000 = Memory Explosion
The Solution: Cursor-Based Pagination (`search_after`)
Instead of saying "Skip 10,000 records" (which forces the DB to count them), we use a live cursor. We tell the engine: "I don't care about the first 10,000. I just know the last result I saw had a score of 1.5. Give me 10 results after 1.5."
- Offset (Bad): "Go to page 1,000" → Shard must calculate 10,000 items to know where page 1,000 starts.
- Cursor (Good): "Start after [Value X]" → Shard jumps directly to Value X in the sorted index and returns just the next 10 items.
- Result: Each shard returns 10 docs, not 10,010. Memory usage stays flat/constant regardless of depth.
Shard Sizing Guide
Target 10-50 GB per shard. Too small creates overhead; too large slows recovery.
| Size | Status | Why |
|---|---|---|
| < 1 GB | Over-sharded | Too much overhead per shard |
| 1-10 GB | Acceptable | OK for small indices |
| 10-50 GB | Ideal ✓ | Optimal balance |
| 50-100 GB | Large | Longer recovery time |
| > 100 GB | Under-sharded | Very slow operations |
❌ Over-Sharding Problems
- • Excessive heap overhead per shard
- • Too many file descriptors
- • Cluster state bloat
- • Master node instability
❌ Under-Sharding Problems
- • Recovery takes hours (100GB+)
- • Can't spread load across nodes
- • Reindex required to add shards
- • Network saturation during moves
⚠️ Shard Count is FIXED After Creation!
You CANNOT change the number of shards later. Plan carefully or expect to reindex.
Cluster Rebalancing & Recovery
When nodes join or leave the cluster, Elasticsearch automatically rebalances shards. Understanding this process helps you plan maintenance windows and capacity changes.
Node Added
- 1. New node joins cluster
- 2. Master allocates shards to new node
- 3. Shards copy data from existing nodes
- 4. Once synced, shards become STARTED
Duration: depends on shard sizes, network speed
Node Removed/Failed
- 1. Shards on dead node become UNASSIGNED
- 2. Replica promotes to primary (if available)
- 3. New replicas created on remaining nodes
- 4. Rebalancing to spread load evenly
No data loss if replicas exist on other nodes
Capacity Planning Example
Let's work through a real example. How many shards and nodes do you need for 100 million products?
Key Takeaways
Horizontal Scale
Sharding splits data across nodes to exceed single-machine limits. Shards are independent Lucene indices.
Deterministic Routing
hash(id) % shards guarantees we always know where a document lives. Changing shard count breaks this (reindex required).
Replicas = Safety + Speed
Replica shards provide HA (failover) and increase read throughput (load balancing).
Deep Pagination Danger
Scatter-Gather requires sorting (shards * page_size) docs in memory. Use search_after for deep scrolling.
Practical API Examples
Essential commands for monitoring and managing shards in production.
TL;DR Summary
Why Sharding Matters
- • Enables horizontal scaling beyond single-node limits
- • Distributes data and query load across nodes
- • Replicas provide fault tolerance and read scaling
Core Rules
- • Target 10-50GB per shard
- • Shard count is immutable plan ahead
- • Use routing for multi-tenant optimization
- • Avoid deep pagination use search_after
Common Pitfalls
- • Over-sharding: too many tiny shards = overhead
- • Under-sharding: huge shards = slow recovery
- • Hot spots from uneven routing keys
- • Ignoring replica placement across racks/zones
Best Practices
- • Formula: shards = ceil(data_GB / 30)
- • Always have at least 1 replica
- • Use allocation awareness for HA
- • Monitor with _cat/shards regularly