Systems Atlas

Chapter 0.2

Problems This Guide Solves

The knowledge gaps that hold engineers back from building production-grade search systems.


Problem 1: "I don't know what I don't know"

Most engineers learn search by reading Elasticsearch docs (know how to add a doc), copy-pasting Stack Overflow (know what worked for someone else), and trial and error (no systematic mental model).

🎯 Real example:

Engineer at a marketplace spends 2 weeks debugging "why results are random." Root cause: they indexed price as a string ("$49.99") instead of a number (49.99). Sorting by string gave: $10, $100, $20, $200.

Time wasted: 2 weeks.With proper knowledge: 5 minutes.

This guide provides:

A structured curriculum from business problem to production system. No more "I didn't know I needed to think about that."

Problem 2: "The tutorial worked, production didn't"

Scenario: Engineer builds search for 10K products. Works great. Company grows to 10M products. Search breaks: indexing takes 12 hours, P99 latency is 2 seconds, ranking is random garbage.

🎯 Real example:

E-commerce startup used a single-node Elasticsearch cluster for 2 years. Black Friday traffic 10x'd. The node ran out of heap memory. Search went down for 4 hours.

Revenue lost: ~$200K.Prevention cost: ~$500/month for proper sharding.

Root cause: Didn't understand sharding strategies, segment merging, or feature engineering for ranking at scale.

This guide teaches:

How to think about scale from Day 1. Chapters 4, 13, 14 are dedicated to this.

Problem 3: "My ML model is great but search sucks"

Scenario: ML team trains a state-of-the-art reranker. Offline NDCG: 0.85 (excellent). Online CTR: No improvement.

🎯 Real example:

Team spent 3 months building a BERT-based reranker. Deployed it. Performance dashboards showed: latency went from 50ms to 800ms. CTR dropped.

The problem: BERT can only rerank 100 docs in the latency budget. But retrieval was returning garbage in the top 100. Model was reranking trash.

The fix: Improve retrieval first (BM25 tuning + vector hybrid). Then the reranker actually had good candidates to work with.

This guide teaches:

End-to-end pipeline thinking. You can't rank what you don't retrieve. Chapters 5, 6, 7, 8 cover this.

Problem 4: "Search is slow and I don't know why"

Scenario: Average latency is 50ms, but P99 is 800ms.

🎯 Common causes (from real incidents):

  • Cold shards: One replica hadn't been queried in hours. JVM needed to warm up.
  • Heavy aggregation: Facet on a field with 10M unique values (colors, sizes were not normalized).
  • Cross-cluster timeout: One datacenter had a network blip, query waited for timeout.
  • GC pause: Indexing heavy load caused garbage collection during query serving.
  • Slow disk: One node's SSD was degraded, pulling down the whole cluster's P99.

This guide teaches:

Systematic latency debugging by understanding internal architecture. Chapters 12, 14 cover caching and distributed systems.

Problem 5: "We keep breaking search with every release"

Scenario: PM says "Add this new field to ranking." Engineer adds it. Relevance drops 20%.

🎯 Real example:

Team added "product popularity" as a ranking signal. Seemed logical. But "popularity" was defined as "total sales all-time." Result: Old bestsellers (discontinued, out of stock) ranked #1. New products (actually available) were buried.

The fix: Use "sales velocity" (last 30 days) not "total sales." But also: Set up offline evaluation to catch this BEFORE deployment.

Root cause: No offline evaluation pipeline, no A/B testing framework, no guardrail metrics.

This guide teaches:

Safe deployment practices. Chapter 15 (Evaluation) and Chapter 16 (Analytics) are dedicated to this.

Problem 6: "I can't explain search to my stakeholders"

Scenario: VP asks "Why does adding synonyms take 3 sprints?" You know it's complex but can't articulate why.

🎯 The communication gap:

Synonyms aren't just a config file. You need to:
1. Decide: query-time or index-time expansion?
2. Build: a synonym management UI (who adds them?)
3. Validate: "laptop" → "notebook" is wrong (notebooks are paper)
4. Test: Do the synonyms actually improve relevance?
5. Monitor: Did recall go up? Did precision go down?

Without this context, it sounds like you're sandbagging.

This guide teaches:

The vocabulary and mental models to communicate with business stakeholders. Every chapter includes the "why" not just the "how."

What This Guide Unlocks

BeforeAfter
"Search is a black box"Understand every layer: Query → Retrieval → Ranking → Serving
"It works on my laptop"Design for 100M docs, 10K QPS from day one
"My model is accurate"Understand where ML fits in the full pipeline
"I don't know why it's slow"Systematic latency debugging with specific patterns
"We break relevance every release"Offline evaluation + A/B testing + guardrails
"I can't explain this to my PM"Vocabulary and frameworks for stakeholder communication

Time This Guide Saves You

100+ hrs

Debugging production issues

6 months

Learning through trial and error

$100K+

Avoiding costly mistakes