Chapter 5.7Performance Optimization

Caching at Retrieval Layer

A well-designed caching strategy can reduce query latency by 100x and decrease backend load by 90%. This chapter reveals the multi-layer caching architecture that powers fast search.

Caching is not just an optimization—it's a necessity for production search systems. Understanding what to cache, where to cache it, and when to invalidate is the difference between sub-millisecond response times and frustrated users.

Cache Hit Latency

0.1-1ms

In-memory cache hit vs 10-100ms for full query execution.

Query Repetition

70%

Top 10% of queries account for 70% of traffic (Zipf's Law).

Target Hit Rate

>80%

Well-tuned production systems achieve 80%+ cache hit rates.

Prerequisites: This chapter builds on Execution Plans (query pipeline),Filters & Facets (filter context), andSegments (index structure).

1. Why Caching Matters: Zipf's Law

User query behavior follows Zipf's Law: a small number of queries account for a disproportionately large share of traffic. This makes caching extraordinarily effective—cache the top 10% of queries and you serve 70% of traffic from memory.

Named after linguist George Zipf, this power-law distribution appears throughout human behavior: the most popular word ("the") appears exponentially more often than the 10th most popular word. Search queries exhibit the same pattern. On any e-commerce site, queries like "iphone" or "nike shoes" appear thousands of times per minute, while long-tail queries like "blue running shoes size 12 wide" might appear once a day. This skewed distribution is the fundamental reason caching works so well for search systems.

The practical implication is profound: by caching just a few thousand popular query results, you can serve the majority of your search traffic without touching your index at all. This reduces latency from 50-200ms (index search) to sub-millisecond (cache hit), while simultaneously reducing load on your search cluster. A well-tuned caching strategy can reduce your required Elasticsearch nodes by 50% or more, directly impacting infrastructure costs.

# Query Distribution (Zipf's Law)

Top 1%:

30% traffic

Top 10%:

70% traffic

Top 20%:

85% traffic

Long tail:

15% traffic

Cache the head → serve most traffic from memory → reduce load on scoring/disk

Operation	Typical Latency	Comparison
Cache hit (in-memory)	0.1-1 ms	Baseline
SSD read	0.1-1 ms	~1x
Posting list scan	1-10 ms	10x slower
Full BM25 scoring	10-100 ms	100x slower
Complex aggregation	50-500 ms	500x slower

2. Multi-Layer Caching Architecture

Production search systems use multiple caching layers, each serving different purposes with different invalidation strategies. Understanding this stack is key to optimizing your search performance.

Think of the cache layers like a pyramid: at the top (closest to users) are CDN caches for static content like autocomplete suggestions and popular category pages. These caches are large, geographically distributed, and have long TTLs because the content rarely changes. Below that sits your application cache (Redis or Memcached), storing full query results for popular searches. The application cache is centralized and faster to invalidate when content changes.

At the Elasticsearch level, two caches work in tandem: the Request Cache stores complete response payloads (aggregations, suggestions) at the shard level, while the Query Cachestores filter clause results as bitsets at the node level. Finally, the OS filesystem cache keeps frequently accessed index segments in memory. Each layer trades off hit rate, invalidation complexity, and storage efficiency differently—the key is configuring them to complement each other.

Caching Layer Stack

CDN Cache

Static results, autocomplete

Application Cache

Redis/Memcached - full query results

Request Cache

Shard-level - aggregations

Query Cache

Node-level - filter bitsets

Filesystem Cache

OS page cache - index segments

Layer	What It Caches	Scope	TTL Strategy
CDN	Static search pages, autocomplete	Global	Time-based (1-24h)
Application	Full query results	Cluster	Time + event-based
Request	Aggregations, suggestions	Shard	Auto-invalidate on write
Query	Filter bitsets	Node	Segment-tied + LRU
Filesystem	Index segments	OS	OS managed (LRU)

3. Elasticsearch Query Cache

The Query Cache (also called Node Query Cache) stores filter clause results as bitsets. When the same filter is used again, Elasticsearch can skip re-evaluating documents entirely and just use the cached bitset for set operations.

A bitset is an extremely compact data structure: one bit per document in the index. If your index has 100 million documents, a filter bitset only takes ~12 MB of memory while encoding the complete set of matching document IDs. When a cached filter is combined with other filters (AND, OR, NOT), Elasticsearch performs highly optimized bitwise operations (AND = &, OR = |) that complete in microseconds rather than iterating through posting lists.

The Query Cache is particularly valuable for e-commerce faceted search where the same filters appear constantly: "category:shoes", "size:10", "color:black", "brand:nike", "in_stock:true". Each of these filters gets cached as a bitset, and subsequent queries combining them just AND the bitsets together. A query like "shoes AND size:10 AND color:black" becomes three bitwise AND operations vs. three posting list intersections—orders of magnitude faster.

# How Query Cache Works: Filter → Bitset

Filter: category = "electronics" AND price < 500

Cached as bitset (1 bit per doc):

...

Doc0=no, Doc1=yes, Doc2=no, Doc3=yes...

Next query with same filter:

1. Hash filter clause → Cache key
2. Cache hit → Return bitset directly
3. Instant AND/OR with other filters

Configuration

# elasticsearch.yml

indices.queries.cache.size: 10%

# Per index

index.queries.cache.enabled: true

Default: 10% of heap, shared across all shards on the node

Caching Eligibility

Term, range, geo_* filters: These deterministic filters produce stable results that benefit from caching since the same filter always matches the same documents.

Segments with >10K docs: Small segments change frequently from merging, so Elasticsearch only caches filters on larger, more stable segments to avoid wasted cache churn.

Queries with "now": Time-based queries like "last 24 hours" produce different results every second, making cached results immediately stale and useless.

Script queries: Custom scripts may use external state or randomness, so Elasticsearch cannot guarantee the same script will produce the same results twice.

Cache Invalidation

Segment Merge

Cached bitsets for merged segments are discarded

Document Update

Any write invalidates affected segment caches

LRU Eviction

Least-recently-used entries evicted when cache full

4. Elasticsearch Request Cache

The Request Cache (Shard-Level Request Cache) stores entire search responses including aggregation results, suggestions, and hit counts. It's perfect for dashboard queries that run repeatedly.

Unlike the Query Cache (which stores filter bitsets), the Request Cache stores complete serialized response JSON at the shard level. The cache key is computed from the entire request body, so even small changes to the query produce cache misses. By default, it only caches requests with size: 0(no document hits) because caching actual document content would consume too much memory and become stale quickly as documents are updated.

The Request Cache is automatically invalidated when the underlying data changes (via refresh). This makes it safe for dynamic content but means it works best when your data has natural update boundaries. For example, if you batch-import data once per hour, queries executed during that hour get 100% cache hits. The cache is particularly valuable for aggregation-heavy dashboards where computing facet counts across billions of documents is expensive but the counts don't need to be real-time.

Aggregations

Category counts, price histograms, date ranges—perfect for faceted navigation

Suggestions

Autocomplete, "did you mean", and completion suggesters

Hit Counts

Total match counts when size=0 (no document retrieval)

request_cache_example.json

// Enable request cache for aggregation query

POST /products/_search?request_cache=true

{

"size": 0, // No hits, just aggs

"aggs": {

"categories": {

"terms": { "field": "category.keyword" }

"price_ranges": {

"histogram": { "field": "price", "interval": 100 }

}

Request Type	Cached by Default	Notes
size: 0 (aggs only)	✓ Yes	Aggregations without hits
size > 0	✗ No	Queries returning documents
Contains "now"	✗ No	Results change with time
random_score	✗ No	Non-deterministic

5. Cache Warming Strategies

A cold cache after deployment or cluster restart can cause severe performance degradation.Cache warming proactively populates caches before traffic arrives, ensuring users never experience the "cold start" penalty.

The "thundering herd" problem illustrates why warming matters: when a cache is empty and traffic suddenly arrives, every request becomes a cache miss simultaneously, overwhelming your search cluster with 100x the normal load. A single new server added to a load balancer without warming can bring down an entire cluster if traffic is high enough. Proper warming prevents this by ensuring cache hit rates are healthy before real traffic arrives.

The key to effective warming is replaying real query patterns from your analytics. Extract the top 1,000-10,000 most frequent queries from the past week, then replay them at a controlled rate (10-50 QPS) before routing production traffic to new instances. This populates caches with the exact queries that will drive most of your traffic, maximizing hit rate from the first real request.

When to Warm Caches

After Deployment

New application instances start with empty caches. Warm before adding to load balancer.

After Index Rebuild

Reindexing invalidates all caches. Warm top queries after rebuild completes.

Before Traffic Spikes

Black Friday, product launches, marketing campaigns—warm beforehand.

Off-Peak Maintenance

Nightly cache refresh to prevent stale entries and maintain hit rates.

cache_warmer.py

import time

from elasticsearch import Elasticsearch

def warm_cache(es, top_queries, rate_per_second=10):

"""Warm cache without overwhelming cluster"""

interval = 1.0 / rate_per_second

for query in top_queries:

start = time.time()

# Execute query (populates all cache layers)

es.search(index="products", body=query)

# Throttle to avoid cluster overload

elapsed = time.time() - start

if elapsed < interval:

time.sleep(interval - elapsed)

# Usage: warm top 1000 queries from analytics

top_queries = analytics.get_top_queries(limit=1000)

warm_cache(es, top_queries)

Pro Tip: Gradual Warming

Never blast all warming queries at once. Use rate limiting (10-50 QPS) to warm caches gradually without impacting ongoing traffic or triggering circuit breakers.

6. Cache Invalidation Strategies

"There are only two hard things in Computer Science: cache invalidation and naming things." Choosing the right invalidation strategy is critical for balancing freshness with performance.

The fundamental tension is between freshness and hit rate. Aggressive invalidation (expiring entries quickly) keeps data fresh but reduces cache effectiveness. Lenient invalidation maximizes hit rate but risks serving stale data. The right balance depends entirely on your use case: a product catalog might tolerate 1-hour staleness, while stock prices need real-time accuracy.

Most production systems use hybrid strategies: short TTLs for volatile data combined with event-based invalidation for critical updates. For example, set a 5-minute TTL by default, but explicitly invalidate an item's cache when it's updated through your CMS. The TTL acts as a safety net catching any invalidation failures, while event-based invalidation ensures critical changes propagate immediately.

Time-Based (TTL)

Entries expire after fixed duration.

cache.setex(key, ttl=300, value)

Simple, predictable

Stale until TTL expires

Event-Based

Invalidate when data changes.

on_update → cache.delete(key)

Always fresh

Complex dependency tracking

Hybrid (Stale-While-Revalidate)

Return stale, refresh async.

if stale: return + async_refresh()

Fast + eventually fresh

Brief staleness window

Content Type	Suggested TTL	Rationale
Static catalog search	1 hour	Product catalogs typically update through scheduled batch jobs rather than in real-time, so hour-long cache windows rarely serve stale data.
News search	5 minutes	Breaking news stories need to appear quickly, but a few minutes of delay is usually acceptable for search results.
Real-time data	30 seconds	Stock prices, social feeds, and live events demand near-instant updates—cache only to absorb traffic bursts during spikes.
Autocomplete	1 hour	Suggestion lists change infrequently since they're based on aggregate query patterns, not individual documents.
Aggregations/facets	15 minutes	Facet counts (e.g., "42 items in category X") can lag slightly without confusing users—exact counts rarely matter.

7. Cache Monitoring & Metrics

You can't optimize what you don't measure. Monitor cache hit rates, eviction rates, and memory usage to ensure your caching strategy is effective.

The most important metric is cache hit rate: (hits / (hits + misses)) × 100. A healthy search cache typically achieves 70-90% hit rate. If you're below 50%, your cache is likely undersized, TTLs are too short, or your queries are too diverse (long-tail dominated). Track this metric over time to detect problems: sudden drops usually indicate configuration changes, traffic pattern shifts, or cache capacity issues.

Eviction rate is your early warning system: high evictions mean the cache is undersized and frequently discarding entries to make room for new ones. This reduces hit rate and wastes the CPU time spent computing entries that get evicted before being reused. If evictions exceed 1-2% of your total entries per minute, consider increasing cache size or reducing TTLs to allow natural expiration instead.

cache_stats_api.sh

# Get cache stats from Elasticsearch

GET /_nodes/stats/indices/query_cache,request_cache

# Response snippet:

{

"query_cache": {

"memory_size_in_bytes": 52428800,

"hit_count": 12000,

"miss_count": 3000,

"evictions": 500

}

>80%

Target Hit Rate

hit_count / (hit_count + miss_count)

<5%

Eviction Rate

evictions / total_count

<limit

Memory Usage

Stay below configured max

Hit Rate Calculation Example

hit_rate = hit_count / (hit_count + miss_count)

hit_rate = 12000 / (12000 + 3000) = 0.80 = 80%

8. Common Caching Pitfalls

Even experienced teams make caching mistakes. Here are the most common issues and how to avoid them.

1. Cache Stampede

When cache expires, many requests simultaneously hit the backend to refresh.

Solution:

Use stale-while-revalidate + distributed locking. Only one request refreshes while others get stale data.

2. Cache Pollution

Low-value long-tail queries evict high-value head queries from limited cache space.

Solution:

Segment caches by query type. Separate caches for autocomplete, search, and aggregations.

3. Ignoring Personalization

Caching personalized results can leak data or show wrong recommendations.

Solution:

Include user segment (not user ID) in cache key. Cache per-segment, not per-user.

Key Takeaways

Multi-Layer Caching

Production search uses multiple cache layers: CDN for static results, Application (Redis/Memcached) for full query results, Request Cache for aggregations, Query Cache for filter bitsets, and Filesystem Cache for index segments. Each layer serves different data with different TTLs.

Query Cache = Bitsets

Elasticsearch's Query Cache stores filter clause results as compact bitsets (1 bit per document). When the same filter is used again, the engine uses the cached bitset for instant set operations instead of re-evaluating documents. Auto-invalidated when segments merge or documents are updated.

Request Cache = Full Results

The Shard-Level Request Cache stores entire search responses including aggregation results, suggestions, and hit counts. By default, it only caches size:0 queries since returning actual document hits would bloat the cache.

Cache Warming Matters

A cold cache after deployment or cluster restart can overwhelm your backend since every request becomes a cache miss. Proactively warm caches by replaying your top queries (from analytics) before sending live traffic to new nodes.

Early Termination & WAND Approximate Retrieval