Chapter 4.3: Data Foundation
Document Modeling
The "Document" is the atomic unit of search. Getting the granularity and structure wrong is a one-way door that's expensive to fix. Unlike SQL normalization, search requires careful denormalization.
What is a "Document"?
In Search, a document is what you retrieve and rank as one unit. It is not necessarily a database row, a file, or a product. The golden rule of search modeling is simple but profound: Model based on what the user sees on the result card.
Document ≠ Database Row
A document is a ranking unit, not a storage unit. One database row can become:
Examples:
- Amazon: Product vs Offer vs Seller (3 different doc types)
- Zomato: Restaurant vs Outlet vs Menu Item
- YouTube: Video vs Video+TimeMarker (Chapter search)
Example: The E-commerce Dilemma
Granularity Strategy
Let's analyze a "T-Shirt" product foundationally. It has a parent entity ("Classic Crew Tagless") and 6 concrete variations (Red S, Red M, Red L, Blue S, Blue M, Blue L). We have three distinct ways to model this in Lucene, each with massive trade-offs for query accuracy and result display.
The Cardinality Trap
Before choosing, you must understand cardinality. High cardinality fields (SKUs, IDs, Timestamps) are expensive in search engines.
The Concept
We treat the abstract "Product" as the document. All attributes from all variants are flattened into arrays. This creates a "Soup of Attributes". We lose the knowledge of which color belongs to which size.
If a user searches for "Blue Small":
- Query checks: Does doc have "Blue"? YES.
- Query checks: Does doc have "Small"? YES.
- Result: Matches!
Reality: We might only have "Blue Large" in stock. You just showed a false positive result that the user cannot buy, leading to bounce/frustration.
The Concept
We go to the other extreme. Every physical SKU becomes its own completely independent document. We respect the physical reality of the inventory.
A general search for "Cotton T-shirt" will match ALL 6 documents.
2. Cotton Crew (Red M)
3. Cotton Crew (Red L)
4. Cotton Crew (Blue S)
...
This "variant explosion" pushes other relevant products to Page 2. Users feel they are seeing duplicates and assume you have no variety.
We index Option B (SKUs) to get perfect filtering, but we ask the Query Engine to "Collapse" results by `product_id`. This means: "Find all matching SKUs, but straightforwardly only show me the best matching SKU per product."
1. The Collapsing Query
2. The Structured Result
Performance Trade-offs
| Approach | Query Time (100k) | Why? |
|---|---|---|
| No collapse (Option B) | 15ms | Pure streaming of results. No post-processing. |
| Collapse (Option C) | 35ms (+133%) | Engine must group 1000s of hits by ID in memory. |
| Collapse + Inner Hits | 55ms (+267%) | Fetching extra data for every group bucket. |
When Collapsing Breaks
Collapsing is a UX hack, not a true data model. It works for showing "Top N", but fails at scale.
"Page 2" is unstable because groups are calculated dynamically. A product on Page 1 might shift to Page 2 on refresh if scores drift slightly.
Collapsing + Custom Sort (e.g. by Price) forces the engine to load ALL variants into heap memory to find the "min(price)" representative.
The "Best SKU" chosen to represent the group might not be the best visual match for the user's aesthetic, just the highest BM25 score.
Every shard must perform the collapse independently, then send full groups to the coordinator node for a second merge.
Modeling Relationships
Search engines are NoSQL stores. They hate joins. When you need to model relationships (Products have Variants, Authors have Books), you must choose between three patterns, trading off between index performance, query performance, and flexibility.
The Mechanism
Arrays of objects are "flattened" into parallel arrays of values. Internally, Lucene has no concept of "objects inside objects". It just sees a bag of values for each field.
The "Loss of Correlation" Trap
Because the connection between "Alice" and "White" is broken, a query for author="Alice Smith" will MATCH!
- Does valid doc have "Alice"? Yes.
- Does valid doc have "Smith"? Yes.
- Result: False Positive.
The Mechanism
Elasticsearch tricks Lucene by indexing each object as a separate, hidden "micro-document" right next to the parent. This preserves boundaries.
The Update Penalty
Crucial: Because Lucene segments are immutable, you cannot update just one child. If you change "John" to "Jon", Elasticsearch must re-index the parent + ALL children.
Lucene executes these as a Block Join. It iterates the hidden children to find matches, then walks up to the parent. While reads are fast, large nested arrays (e.g. 1000s of comments) break segment merging because the entire block must move together.
The Mechanism
Documents are fully independent but live on the same shard. A mapped "Join Field" links them. The "Join" happens at query time, not index time.
{ "title": "MacBook", "join": "product" }
{ "price": 2000, "join": { "name": "variant", "parent": "1" } }
The Query Latency Tax
Because the link is resolved at query time, these queries are 10x slower than nested queries. They also consume significant Heap Memory for "Global Ordinals" to track the relationships.
Parent and Children must reside on the same shard. You MUST provide a routing key (parent ID).
Risk: Bad routing keys lead to "Hot Shards" where one shard handles 10x traffic. Rebalancing this is painful.
Performance Comparison
| Operation | Flat | Nested | Parent-Child |
|---|---|---|---|
| Query (100K docs) | 15ms | 20ms | 150ms |
| Update Parent | 5ms | 5ms | 5ms |
| Update Child | 5ms | 50ms (all children) | 5ms |
| Memory Overhead | Low | Medium | High (global ordinals) |
Update Strategies & Reindexing Cost
Delete index, Create new, Ingest all.
Send only changed fields.
Build Index B || Point Alias A &to; B
The Denormalization Tax
In SQL, we normalize to reduce redundancy (store "Brand Name" once). In Search, we denormalize for speed (copy "Brand Name" into every product). This creates a massive storage footprint.
Why is the Index 5x larger than DB?
- Inverted Index: Terms for searching
- Doc Values: Columns for sorting
- Stored Fields: JSON for retrieval
- Replicas: 1 copy = 2x storage
The Hidden Cost: Consistency vs Freshness
Search engines are Eventually Consistent. They are not transactionally correct like Postgres.
Ingestion pipelines are async. There is always a lag between "User Buy" and "Index Update".
Item is sold out in DB, but Search Index still says "In Stock". User clicks buy → Error.
Fix: Check DB availability at checkout.
Search shows $19.99 (old cache), Product Page shows $25.00.
Fix: Frequent small-batch updates (NRT).
Real Production Failure Stories
An e-commerce site indexed every SKU (Option B) without collapsing.
A log platform used Nested Objects for 'tags' on high-velocity logs.
A job board used Parent-Child for Company → Jobs.
Industry Modeling Playbook
- • Model: Option C (SKU Docs + Collapse)
- • Why: Need precise filtering (Size/Color) but grouped display.
- • Trade-off: Higher query latency, better UX.
- • Model: Option A (Flat Product/Show)
- • Why: "Variations" (Episodes) are rarely searched independently by attributes.
- • Trade-off: Fast queries, duplicates impossible.
- • Model: Parent-Child (Account → Ticket)
- • Why: Access controls (Parent) apply to all tickets. Tickets change constantly.
- • Trade-off: Slow queries, instant security updates.
Decision Framework
How do you choose? Answer these four questions to pick the correct model.
| Question | If YES... | If NO... |
|---|---|---|
| 1. Are there many variations of the same item? | Consider SKU + Collapsing | Model as Product |
| 2. Do filters need strictly correlated attributes? (e.g. Red MUST be Size L) | Need Nested or Parent-Child | Flat arrays are faster |
| 3. Is your update rate extremely high? (e.g. Real-time inventory) | Parent-Child (Update child only) | Nested (Updates are expensive) |
| 4. Has > 1000 variants per product? | Parent-Child or Split Indices | Nested fits in a block |
| 5. Need stable pagination across pages? | Avoid Collapse (Use Flat/Nested) | Collapse is acceptable (First N High Confidence) |
Key Takeaways
Model for the Result Card
The 'Document' is what you show the user, not necessarily a database row. Often, 10 DB rows = 1 Search Document.
The Denormalization Tax
Search indices are 5x larger than the source DB due to inverted indices, doc values, and replicas. Plan storage accordingly.
Choose Your Granularity
Use Field Collapsing for 'SKU-level precision with Product-level display'. Avoid Nested Objects for high-churn data.
Updates are Expensive
In Lucene, every update is a Delete + Insert. Partial updates are a lie. Optimize for write throughput.
Mental Model Summary: The 3 Axes
Relevance Correctness
Does "Red Small" actually match a Red Small item? (Nested/SKU wins here)
Query Latency
How fast does the search page load? (Flat wins here)
Update Cost
How fast can inventory sync? (Parent-Child wins here)