Systems Atlas

Chapter 4.1: Data Foundation

Quality as a Data Problem

Most engineering teams treat search as a Ranking Problem (Algorithms, Learning to Rank, Vectors). In reality, 80% of search quality failures are Data Problems. No ranking model can fix broken data.


The Core Thesis: The Multiplication Rule

Search quality is a product of multiple factors, not a sum. When you multiply components together, a weakness in any single factor drags down the entire result. This fundamental truth explains why teams who focus exclusively on ranking algorithms often fail to improve user experience they're optimizing a multiplier while ignoring a factor that may be stuck at 0.5.

Final_Score = DataQuality × QueryUnderstanding × RankingModel

If DataQuality = 0.5 (50% of data is correct), your maximum possible score is 0.5. No amount of BERT fine-tuning or vector embeddings can overcome garbage data. This is the fundamental law of search quality.

The Cost of Bad Data: Real-World Examples

Data quality failures aren't abstract concerns they have concrete, measurable costs. The following case studies come from real production systems and demonstrate how seemingly small data issues can cascade into millions of dollars in lost revenue or, in critical domains, put lives at risk.

Example 1: The $1.25M/day Null Price Bug

Company: Mid-size E-commerce (10M products)

Bug: 5% of products had price: null

UI Behavior: Displayed as "$0.00"

User Behavior: Clicked, saw real price in cart, abandoned

// The Math
Daily searches: 1,000,000
Searches showing null-price products: 5%
Users clicking "$0.00" items: 20%
Abandonment when real price seen: 90%
Average Order Value: $50
Lost orders = 1M × 0.05 × 0.20 × 0.90 = 9,000/day
Lost revenue = 9,000 × $50 = $450,000/day
Fix: Added ingestion gate that rejected products without valid price.
Time to fix: 2 hours of engineering.
ROI: $450K × 365 = $164M/year saved by a 2-hour fix.

Example 2: The Pandemic Mask Crisis (Field Nulls)

Company: Healthcare marketplace | Context: March 2020, N95 mask shortage
Bug: mask_type field was optional in schema.

// 80% indexed correctly
{ "title": "3M N95 Respirator", "mask_type": "N95" }
// 20% missing field entirely
{ "title": "KN95 Professional Mask" }
// mask_type field completely absent
The Query: { "query": { "term": { "mask_type": "N95" } } }
Result: 20% of legitimate N95/KN95 masks returned 0 results. During a pandemic. With life-or-death stakes.
# Fix: Ingestion validator
def validate_mask(doc):
if "mask" in doc["title"].lower():
if "mask_type" not in doc or doc["mask_type"] is None:
raise ValidationError("mask_type required for mask products")

Example 3: The "iPhone Case" SEO Spam (Field Contamination)

Company: Electronics marketplace | Bug: Sellers stuffed keywords into product titles

{
"title": "iPhone 15 Pro Max Case Cover Samsung Galaxy S24...",
"actual_compatibility": ["iPhone 15 Pro Max"]
}
The Problem:
  • User searches "Samsung Galaxy S24 case"
  • Results show iPhone cases (because "Samsung Galaxy S24" is in title)
  • User loses trust, leaves site
# Detection Algorithm
def detect_title_spam(title, category):
expected_brands = get_category_brands(category)
title_brands = extract_brand_mentions(title)
# If title has > 3 brand mentions, likely spam
if len(title_brands) > 3:
return True
# If title mentions brands outside category
foreign_brands = title_brands - expected_brands
if len(foreign_brands) > 1:
return True
return False
# Action: Flagged docs get quality_score: 0.1

The Five Data Quality Failures

After analyzing hundreds of search quality incidents across different companies and domains, a clear pattern emerges: most failures fall into one of five categories. Understanding this taxonomy helps you build systematic defenses and quickly diagnose issues when they occur.

A. Field Contamination

Definition: Wrong data in the right field.

❌ Bad: HTML in Description
description: "<div class=\"product-info\"><p>Great <b>shoes</b>!</p></div>"
Tokens: ["div", "class", "product", "info", "p", "great", "b", "shoes"]
✓ Fix: HTML stripping at ingestion
from bs4 import BeautifulSoup
def clean_html(text):
soup = BeautifulSoup(text, "html.parser")
return soup.get_text(separator=" ").strip()

B. Schema Drift

Definition: Field type changes over time without migration.

# Timeline Example
2022-01: color = "Red" # string
2022-06: color = "RED" # different case
2023-01: color = "#FF0000" # hex code
2023-06: color = {"name": "Red", "hex": "#FF0000"} # object!
Elasticsearch Behavior: First document sets the mapping. Conflicting types cause indexing failures. Silent data loss if dynamic mapping is off.

C. The Implicit Null Problem

Definition: Missing fields treated inconsistently.

# The ranking formula
def rank_score(doc):
relevance = calculate_bm25(query, doc)
popularity = doc.get("click_count", ???) # What value for new items?
return relevance * 0.7 + popularity * 0.3
StrategyValue for NullEffect
Zero0New items buried at bottom
Average500Spam gets free boost
Negative-1Explicitly deprioritized
Median (Best)100Neutral starting point
# Best Practice
def safe_popularity(doc):
if "click_count" not in doc or doc["click_count"] is None:
# Use category median, not global
return get_category_median_clicks(doc["category"])
return doc["click_count"]

D. Semantic Duplication

Definition: Same real-world entity indexed multiple times.

// Seller A
{ "id": "seller-a-iphone15", "title": "Apple iPhone 15 128GB" }
// Seller B
{ "id": "seller-b-iphone15", "title": "iPhone 15 128 GB (Apple)" }
// Seller C
{ "id": "seller-c-iphone15", "title": "APPLE iPHONE 15 - 128GB" }

User Search: "iPhone 15" → Results Page: Same phone shown 3 times, variety destroyed

# Solution: Entity Resolution Pipeline
def deduplicate_results(results):
seen_entities = set()
unique_results = []
for result in results:
# Get canonical entity ID
entity_id = get_entity_id(result)
if entity_id not in seen_entities:
seen_entities.add(entity_id)
unique_results.append(result)
return unique_results

E. Join Loss

Definition: Related entities become stale/inconsistent.

Day 1: Brand "Facebook" exists with brand_id: 123
Day 2: Brand renamed to "Meta" in Brand table
Day 3: Products still show "Facebook" in search (stale join)
# Fix: Event-Driven Reindexing
@on_event("brand.updated")
def handle_brand_update(brand_id, new_data):
products = get_products_by_brand(brand_id)
for product in products:
# Reindex with fresh brand data
reindex_product(product.id)

Engineering: Building a Data Quality System

Knowing the failure modes is only half the battle. You need automated systems that catch problems before they reach production. The following patterns show how to build quality gates into your ingestion pipeline, turning reactive firefighting into proactive prevention.

# The Quality Score Card - Ingestion Validator
class DataQualityValidator:
RULES = {
"title": {
"required": True,
"min_length": 10,
"max_length": 200,
"no_html": True,
},
"price": {
"required": True,
"type": "float",
"min_value": 0.01,
"max_value": 1000000,
},
"image_url": {
"required": True,
"must_resolve": True, # HTTP 200
"min_dimensions": (100, 100),
},
}
def validate(self, doc):
score = 1.0
errors = []
for field, rules in self.RULES.items():
field_score, field_errors = self.validate_field(doc, field, rules)
score *= field_score
errors.extend(field_errors)
return {
"score": score,
"passed": score >= 0.8,
"errors": errors
}

Quality Metrics Dashboard

Every data quality system needs measurable metrics with alerting thresholds. The following five dimensions cover the essential aspects of data health. When any metric drops below its threshold, automated alerts should notify the team before bad data reaches users.

MetricFormulaAlert Threshold
Completenessdocs_with_field / total_docs< 99.9%
Validityvalid_values / non_null_values< 99%
Freshnessnow - last_updated> 24 hours
Uniquenessunique_entities / total_docs< 95%
Consistencydocs_matching_schema / total_docs< 99.99%

The Quality Score Formula

The quality score combines all field-level checks into a single number between 0 and 1. Each field contributes based on its completeness (C), validity (V), and business weight (w). The geometric mean ensures that a zero in any critical field tanks the entire score.

QS(d) = ∏i=1n (wi · C(fi) · V(fi))1/n
Where: C(fi) = Completeness (1 if present, 0 if null), V(fi) = Validity (1 if valid, 0-1 if partially valid), wi = Business weight
Worked Example: Product Document
{ "title": "Nike Air Max", "price": 129.99, "image_url": null }
title (w=1.0)
C=1, V=1 → 1.0 × 1 × 1 = 1.0
price (w=1.0)
C=1, V=1 → 1.0 × 1 × 1 = 1.0
image_url (w=0.8)
C=0, V=0 → 0.8 × 0 × 0 = 0
QS = (1.0 × 1.0 × 0)1/3 = 0.0
→ Missing image kills the score. Document rejected.

Decision Rules with Examples

QS ≥ 0.9
Index immediately
{
"title": "Nike Air Max 90",
"price": 129.99,
"image": "✓ Valid URL"
}
QS = 0.95
0.7 ≤ QS < 0.9
Index with warning
{
"title": "Shoe", // too short
"price": 129.99,
"image": "✓ Valid URL"
}
QS = 0.78 ⚠️
QS < 0.7
Reject document
{
"title": "Nike Air Max",
"price": null,
"image": "broken.jpg"
}
QS = 0.0

Key Takeaways

01

It's a Data Problem

80% of search quality failures are data issues, not algorithm issues. Ranking models cannot fix broken data.

02

The Five Failures

Field Contamination, Schema Drift, Implicit Nulls, Semantic Duplication, and Join Loss are the most common root causes.

03

High ROI Fixes

Simple ingestion gates (like rejecting null prices) can save millions in lost revenue with minimal engineering effort.

04

Automate Quality

Build automated quality gates in your ingestion pipeline. Reject bad data before it enters the index.