Systems Atlas

Chapter 4.3: Data Foundation

Document Modeling

The "Document" is the atomic unit of search. Getting the granularity and structure wrong is a one-way door that's expensive to fix. Unlike SQL normalization, search requires careful denormalization.


What is a "Document"?

In Search, a document is what you retrieve and rank as one unit. It is not necessarily a database row, a file, or a product. The golden rule of search modeling is simple but profound: Model based on what the user sees on the result card.

Document ≠ Database Row

A document is a ranking unit, not a storage unit. One database row can become:

0 Documents
Filtered out (Soft delete, banned)
1 Document
1:1 Mapping (Standard)
100 Documents
Variants, Time-series splits

Examples:

  • Amazon: Product vs Offer vs Seller (3 different doc types)
  • Zomato: Restaurant vs Outlet vs Menu Item
  • YouTube: Video vs Video+TimeMarker (Chapter search)

Example: The E-commerce Dilemma

IMAGE
iPhone 15 Pro Max
⭐⭐⭐⭐⭐ (2,341 reviews)
$999 - $1,499
6 colors available
// Question: Is the document a Product or a SKU?
If Document = Product:
• 1 doc ("iPhone 15")
• Clean results (1 card)
If Document = SKU:
• 6 docs ("iPhone 15 Red", "Blue"...)
• Cluttered results (6 cards for same phone)

Granularity Strategy

Let's analyze a "T-Shirt" product foundationally. It has a parent entity ("Classic Crew Tagless") and 6 concrete variations (Red S, Red M, Red L, Blue S, Blue M, Blue L). We have three distinct ways to model this in Lucene, each with massive trade-offs for query accuracy and result display.

The Cardinality Trap

Before choosing, you must understand cardinality. High cardinality fields (SKUs, IDs, Timestamps) are expensive in search engines.

Low Cardinality (Safe)
Brands, Categories, Cities
Cacheable, Fast Aggregations
High Cardinality (Risk)
SKUs, User IDs, Timestamps
Explodes memory, Breaks Caches
Rule: If Field Cardinality ≈ Document Count, treat it as a performance risk.
Option A: Product-Level Document (The "Flat" Approach)Cross-Match Risk ⚠

The Concept

We treat the abstract "Product" as the document. All attributes from all variants are flattened into arrays. This creates a "Soup of Attributes". We lose the knowledge of which color belongs to which size.

// One document represents the entire product
{
"product_id": "tshirt_001",
"title": "Classic Cotton Crew",
"price_range": { "min": 29.99, "max": 34.99 },
// Flattened Soup ⚠
"colors": ["Red", "Blue"],
"sizes": ["S", "M", "L"],
"skus_in_stock": ["red_s", "red_m", "blue_l"]
}
The "Cross-Match" Problem

If a user searches for "Blue Small":

  • Query checks: Does doc have "Blue"? YES.
  • Query checks: Does doc have "Small"? YES.
  • Result: Matches!

Reality: We might only have "Blue Large" in stock. You just showed a false positive result that the user cannot buy, leading to bounce/frustration.

Option B: SKU-Level Document (Total Denormalization)Pollution Risk ⚠

The Concept

We go to the other extreme. Every physical SKU becomes its own completely independent document. We respect the physical reality of the inventory.

// Document 1
{ "sku": "red_s", "title": "Cotton Crew", "color": "Red", "size": "S" }
// Document 2
{ "sku": "red_m", "title": "Cotton Crew", "color": "Red", "size": "M" }
// Document 3 (The only match for Blue L)
{ "sku": "blue_l", "title": "Cotton Crew", "color": "Blue", "size": "L" }
// ... repeated 6 times for 6 variants
The Result Pollution Problem

A general search for "Cotton T-shirt" will match ALL 6 documents.

1. Cotton Crew (Red S)
2. Cotton Crew (Red M)
3. Cotton Crew (Red L)
4. Cotton Crew (Blue S)
...

This "variant explosion" pushes other relevant products to Page 2. Users feel they are seeing duplicates and assume you have no variety.

Option C: SKU-Level + Field Collapsing (The Hybrid)Best Practice ✅

We index Option B (SKUs) to get perfect filtering, but we ask the Query Engine to "Collapse" results by `product_id`. This means: "Find all matching SKUs, but straightforwardly only show me the best matching SKU per product."

1. The Collapsing Query

// Find "Blue Shirt"
{
"query": { "match": { "title": "blue shirt" } },
// Grouping Magic
"collapse": {
"field": "product_id",
"inner_hits": {
"name": "other_variants",
"size": 5
}
}
}

2. The Structured Result

// Returns ONE hit per product
{
"product_id": "tshirt_001",
"sku": "blue_l", // Best match!
"title": "Cotton Crew",
// Other variants tucked inside
"inner_hits": {
"other_variants": [
{ "sku": "blue_m" },
{ "sku": "blue_s" }
]
}
}

Performance Trade-offs

ApproachQuery Time (100k)Why?
No collapse (Option B)15msPure streaming of results. No post-processing.
Collapse (Option C)35ms (+133%)Engine must group 1000s of hits by ID in memory.
Collapse + Inner Hits55ms (+267%)Fetching extra data for every group bucket.

When Collapsing Breaks

Collapsing is a UX hack, not a true data model. It works for showing "Top N", but fails at scale.

1. Pagination Hell

"Page 2" is unstable because groups are calculated dynamically. A product on Page 1 might shift to Page 2 on refresh if scores drift slightly.

2. Sorting Costs

Collapsing + Custom Sort (e.g. by Price) forces the engine to load ALL variants into heap memory to find the "min(price)" representative.

3. Recall Accuracy

The "Best SKU" chosen to represent the group might not be the best visual match for the user's aesthetic, just the highest BM25 score.

4. Distributed Cost

Every shard must perform the collapse independently, then send full groups to the coordinator node for a second merge.

Modeling Relationships

Search engines are NoSQL stores. They hate joins. When you need to model relationships (Products have Variants, Authors have Books), you must choose between three patterns, trading off between index performance, query performance, and flexibility.

1. Flat Modeling (The "NoSQL" Default)Fastest Read/Write ⚡

The Mechanism

Arrays of objects are "flattened" into parallel arrays of values. Internally, Lucene has no concept of "objects inside objects". It just sees a bag of values for each field.

// How you send it:
"authors": [
{ "first": "John", "last": "Smith" },
{ "first": "Alice", "last": "White" }
]
// How Lucene stores it:
"authors.first": ["John", "Alice"]
"authors.last": ["Smith", "White"]

The "Loss of Correlation" Trap

Because the connection between "Alice" and "White" is broken, a query for author="Alice Smith" will MATCH!

  • Does valid doc have "Alice"? Yes.
  • Does valid doc have "Smith"? Yes.
  • Result: False Positive.
Best For: Simple tags, flags, or attributes where combination doesn't matter (e.g. `tags: ["new", "sale"]`).
2. Nested Objects (The "Hidden Docs")Correctness ✅

The Mechanism

Elasticsearch tricks Lucene by indexing each object as a separate, hidden "micro-document" right next to the parent. This preserves boundaries.

// Lucene Segment Layout
Doc 1: { "first": "John", "last": "Smith" } (Hidden)
Doc 2: { "first": "Alice", "last": "White" } (Hidden)
Doc 3: { "title": "Book Name", ... } (Parent)

The Update Penalty

Crucial: Because Lucene segments are immutable, you cannot update just one child. If you change "John" to "Jon", Elasticsearch must re-index the parent + ALL children.

Cost = O(N) where N is number of nested objects.
Best For: Static variants (colors/sizes) that rarely change but need strict filtering.
Nested Query Cost Model

Lucene executes these as a Block Join. It iterates the hidden children to find matches, then walks up to the parent. While reads are fast, large nested arrays (e.g. 1000s of comments) break segment merging because the entire block must move together.

3. Parent-Child (The "Join")Update Speed 🚀

The Mechanism

Documents are fully independent but live on the same shard. A mapped "Join Field" links them. The "Join" happens at query time, not index time.

// Independent Documents
PUT /products/_doc/1
{ "title": "MacBook", "join": "product" }
PUT /products/_doc/2?routing=1
{ "price": 2000, "join": { "name": "variant", "parent": "1" } }

The Query Latency Tax

Because the link is resolved at query time, these queries are 10x slower than nested queries. They also consume significant Heap Memory for "Global Ordinals" to track the relationships.

Best For: High-churn data. Example: "Offer Prices" that change every minute for thousands of sellers.
The Routing Constraint

Parent and Children must reside on the same shard. You MUST provide a routing key (parent ID).
Risk: Bad routing keys lead to "Hot Shards" where one shard handles 10x traffic. Rebalancing this is painful.

Performance Comparison

OperationFlatNestedParent-Child
Query (100K docs)15ms20ms150ms
Update Parent5ms5ms5ms
Update Child5ms50ms (all children)5ms
Memory OverheadLowMediumHigh (global ordinals)

Update Strategies & Reindexing Cost

1. Full Rebuild

Delete index, Create new, Ingest all.

Expensive / Simple
Use for: Schema changes, Mapper parsing exceptions.
2. Partial Update

Send only changed fields.

"Fake" In-Place
Lucene still does (Delete + Insert) under the hood. High segment churn.
3. Alias Swap

Build Index B || Point Alias A &to; B

Zero Downtime
The industry standard for schema migrations.

The Denormalization Tax

In SQL, we normalize to reduce redundancy (store "Brand Name" once). In Search, we denormalize for speed (copy "Brand Name" into every product). This creates a massive storage footprint.

Why is the Index 5x larger than DB?

  • Inverted Index: Terms for searching
  • Doc Values: Columns for sorting
  • Stored Fields: JSON for retrieval
  • Replicas: 1 copy = 2x storage
# Storage Breakdown (100MB Raw Data)
Raw Data100 MB
+ Inverted Index80 MB
+ Doc Values40 MB
+ Stored Fields100 MB
Total (Primary)320 MB
+ 1 Replica320 MB
Cluster Footprint640 MB (6.4x)

The Hidden Cost: Consistency vs Freshness

Search engines are Eventually Consistent. They are not transactionally correct like Postgres.
Ingestion pipelines are async. There is always a lag between "User Buy" and "Index Update".

Scenario A: The "Ghost" Stock

Item is sold out in DB, but Search Index still says "In Stock". User clicks buy → Error.
Fix: Check DB availability at checkout.

Scenario B: Price Mismatch

Search shows $19.99 (old cache), Product Page shows $25.00.
Fix: Frequent small-batch updates (NRT).

Real Production Failure Stories

1. The Variant Explosion

An e-commerce site indexed every SKU (Option B) without collapsing.

Result: "Nike Shirt" returned 400 results (All Nike variants), burying Adidas completely. Relevance score dropped 80%.
2. The Nested Throughput Death

A log platform used Nested Objects for 'tags' on high-velocity logs.

Result: Indexing rate dropped 90% because every log update forced Lucene to rewrite the entire block of nested segments.
3. The Parent-Child Latency

A job board used Parent-Child for Company → Jobs.

Result: At 50M docs, queries took 800ms (p99) due to global ordinal loading. They had to switch to Denormalized Flat docs.

Industry Modeling Playbook

1. E-Commerce (Amazon/Shopify)
  • Model: Option C (SKU Docs + Collapse)
  • Why: Need precise filtering (Size/Color) but grouped display.
  • Trade-off: Higher query latency, better UX.
2. Content/Media (Netflix/Spotify)
  • Model: Option A (Flat Product/Show)
  • Why: "Variations" (Episodes) are rarely searched independently by attributes.
  • Trade-off: Fast queries, duplicates impossible.
3. SaaS/B2B (Salesforce/Jira)
  • Model: Parent-Child (Account → Ticket)
  • Why: Access controls (Parent) apply to all tickets. Tickets change constantly.
  • Trade-off: Slow queries, instant security updates.

Decision Framework

How do you choose? Answer these four questions to pick the correct model.

QuestionIf YES...If NO...
1. Are there many variations of the same item?Consider SKU + CollapsingModel as Product
2. Do filters need strictly correlated attributes?
(e.g. Red MUST be Size L)
Need Nested or Parent-ChildFlat arrays are faster
3. Is your update rate extremely high?
(e.g. Real-time inventory)
Parent-Child (Update child only)Nested (Updates are expensive)
4. Has > 1000 variants per product?Parent-Child or Split IndicesNested fits in a block
5. Need stable pagination across pages?Avoid Collapse (Use Flat/Nested)Collapse is acceptable (First N High Confidence)

Key Takeaways

01

Model for the Result Card

The 'Document' is what you show the user, not necessarily a database row. Often, 10 DB rows = 1 Search Document.

02

The Denormalization Tax

Search indices are 5x larger than the source DB due to inverted indices, doc values, and replicas. Plan storage accordingly.

03

Choose Your Granularity

Use Field Collapsing for 'SKU-level precision with Product-level display'. Avoid Nested Objects for high-churn data.

04

Updates are Expensive

In Lucene, every update is a Delete + Insert. Partial updates are a lie. Optimize for write throughput.

Mental Model Summary: The 3 Axes

🎯

Relevance Correctness

Does "Red Small" actually match a Red Small item? (Nested/SKU wins here)

Query Latency

How fast does the search page load? (Flat wins here)

🔄

Update Cost

How fast can inventory sync? (Parent-Child wins here)

"You can only pick 2. Choose wisely."