Chapter 4.1: Data Foundation

Quality as a Data Problem

Most engineering teams treat search as a Ranking Problem (Algorithms, Learning to Rank, Vectors). In reality, 80% of search quality failures are Data Problems. No ranking model can fix broken data.

The Core Thesis: The Multiplication Rule

Search quality is a product of multiple factors, not a sum. When you multiply components together, a weakness in any single factor drags down the entire result. This fundamental truth explains why teams who focus exclusively on ranking algorithms often fail to improve user experience they're optimizing a multiplier while ignoring a factor that may be stuck at 0.5.

Final_Score = DataQuality × QueryUnderstanding × RankingModel

If DataQuality = 0.5 (50% of data is correct), your maximum possible score is 0.5. No amount of BERT fine-tuning or vector embeddings can overcome garbage data. This is the fundamental law of search quality.

The Cost of Bad Data: Real-World Examples

Data quality failures aren't abstract concerns they have concrete, measurable costs. The following case studies come from real production systems and demonstrate how seemingly small data issues can cascade into millions of dollars in lost revenue or, in critical domains, put lives at risk.

Example 1: The $1.25M/day Null Price Bug

Company: Mid-size E-commerce (10M products)

Bug: 5% of products had price: null

UI Behavior: Displayed as "$0.00"

User Behavior: Clicked, saw real price in cart, abandoned

// The Math

Daily searches: 1,000,000

Searches showing null-price products: 5%

Users clicking "$0.00" items: 20%

Abandonment when real price seen: 90%

Average Order Value: $50

Lost orders = 1M × 0.05 × 0.20 × 0.90 = 9,000/day

Lost revenue = 9,000 × $50 = $450,000/day

Fix: Added ingestion gate that rejected products without valid price.
Time to fix: 2 hours of engineering.
ROI: $450K × 365 = $164M/year saved by a 2-hour fix.

Example 2: The Pandemic Mask Crisis (Field Nulls)

Company: Healthcare marketplace | Context: March 2020, N95 mask shortage
Bug: mask_type field was optional in schema.

// 80% indexed correctly

{ "title": "3M N95 Respirator", "mask_type": "N95" }

// 20% missing field entirely

{ "title": "KN95 Professional Mask" }

// mask_type field completely absent

The Query: { "query": { "term": { "mask_type": "N95" } } }
Result: 20% of legitimate N95/KN95 masks returned 0 results. During a pandemic. With life-or-death stakes.

# Fix: Ingestion validator

def validate_mask(doc):

if "mask" in doc["title"].lower():

if "mask_type" not in doc or doc["mask_type"] is None:

raise ValidationError("mask_type required for mask products")

Example 3: The "iPhone Case" SEO Spam (Field Contamination)

Company: Electronics marketplace | Bug: Sellers stuffed keywords into product titles

{

"title": "iPhone 15 Pro Max Case Cover Samsung Galaxy S24...",

"actual_compatibility": ["iPhone 15 Pro Max"]

}

The Problem:

User searches "Samsung Galaxy S24 case"
Results show iPhone cases (because "Samsung Galaxy S24" is in title)
User loses trust, leaves site

# Detection Algorithm

def detect_title_spam(title, category):

expected_brands = get_category_brands(category)

title_brands = extract_brand_mentions(title)

# If title has > 3 brand mentions, likely spam

if len(title_brands) > 3:

return True

# If title mentions brands outside category

foreign_brands = title_brands - expected_brands

if len(foreign_brands) > 1:

return True

return False

# Action: Flagged docs get quality_score: 0.1

The Five Data Quality Failures

After analyzing hundreds of search quality incidents across different companies and domains, a clear pattern emerges: most failures fall into one of five categories. Understanding this taxonomy helps you build systematic defenses and quickly diagnose issues when they occur.

A. Field Contamination

Definition: Wrong data in the right field.

❌ Bad: HTML in Description

description: "<div class=\"product-info\"><p>Great <b>shoes</b>!</p></div>"

Tokens: ["div", "class", "product", "info", "p", "great", "b", "shoes"]

✓ Fix: HTML stripping at ingestion

from bs4 import BeautifulSoup

def clean_html(text):

soup = BeautifulSoup(text, "html.parser")

return soup.get_text(separator=" ").strip()

B. Schema Drift

Definition: Field type changes over time without migration.

# Timeline Example

2022-01: color = "Red" # string

2022-06: color = "RED" # different case

2023-01: color = "#FF0000" # hex code

2023-06: color = {"name": "Red", "hex": "#FF0000"} # object!

Elasticsearch Behavior: First document sets the mapping. Conflicting types cause indexing failures. Silent data loss if dynamic mapping is off.

C. The Implicit Null Problem

Definition: Missing fields treated inconsistently.

# The ranking formula

def rank_score(doc):

relevance = calculate_bm25(query, doc)

popularity = doc.get("click_count", ???) # What value for new items?

return relevance * 0.7 + popularity * 0.3

Strategy	Value for Null	Effect
Zero	0	New items buried at bottom
Average	500	Spam gets free boost
Negative	-1	Explicitly deprioritized
Median (Best)	100	Neutral starting point

# Best Practice

def safe_popularity(doc):

if "click_count" not in doc or doc["click_count"] is None:

# Use category median, not global

return get_category_median_clicks(doc["category"])

return doc["click_count"]

D. Semantic Duplication

Definition: Same real-world entity indexed multiple times.

// Seller A

{ "id": "seller-a-iphone15", "title": "Apple iPhone 15 128GB" }

// Seller B

{ "id": "seller-b-iphone15", "title": "iPhone 15 128 GB (Apple)" }

// Seller C

{ "id": "seller-c-iphone15", "title": "APPLE iPHONE 15 - 128GB" }

User Search: "iPhone 15" → Results Page: Same phone shown 3 times, variety destroyed

# Solution: Entity Resolution Pipeline

def deduplicate_results(results):

seen_entities = set()

unique_results = []

for result in results:

# Get canonical entity ID

entity_id = get_entity_id(result)

if entity_id not in seen_entities:

seen_entities.add(entity_id)

unique_results.append(result)

return unique_results

E. Join Loss

Definition: Related entities become stale/inconsistent.

Day 1: Brand "Facebook" exists with brand_id: 123

Day 2: Brand renamed to "Meta" in Brand table

Day 3: Products still show "Facebook" in search (stale join)

# Fix: Event-Driven Reindexing

@on_event("brand.updated")

def handle_brand_update(brand_id, new_data):

products = get_products_by_brand(brand_id)

for product in products:

# Reindex with fresh brand data

reindex_product(product.id)

Engineering: Building a Data Quality System

Knowing the failure modes is only half the battle. You need automated systems that catch problems before they reach production. The following patterns show how to build quality gates into your ingestion pipeline, turning reactive firefighting into proactive prevention.

# The Quality Score Card - Ingestion Validator

class DataQualityValidator:

RULES = {

"title": {

"required": True,

"min_length": 10,

"max_length": 200,

"no_html": True,

"price": {

"required": True,

"type": "float",

"min_value": 0.01,

"max_value": 1000000,

"image_url": {

"required": True,

"must_resolve": True, # HTTP 200

"min_dimensions": (100, 100),

}

def validate(self, doc):

score = 1.0

errors = []

for field, rules in self.RULES.items():

field_score, field_errors = self.validate_field(doc, field, rules)

score *= field_score

errors.extend(field_errors)

return {

"score": score,

"passed": score >= 0.8,

"errors": errors

}

Quality Metrics Dashboard

Every data quality system needs measurable metrics with alerting thresholds. The following five dimensions cover the essential aspects of data health. When any metric drops below its threshold, automated alerts should notify the team before bad data reaches users.

Metric	Formula	Alert Threshold
Completeness	docs_with_field / total_docs	< 99.9%
Validity	valid_values / non_null_values	< 99%
Freshness	now - last_updated	> 24 hours
Uniqueness	unique_entities / total_docs	< 95%
Consistency	docs_matching_schema / total_docs	< 99.99%

The Quality Score Formula

The quality score combines all field-level checks into a single number between 0 and 1. Each field contributes based on its completeness (C), validity (V), and business weight (w). The geometric mean ensures that a zero in any critical field tanks the entire score.

QS(d) = ∏_i=1ⁿ (w_i · C(f_i) · V(f_i))^1/n

Where: C(f_i) = Completeness (1 if present, 0 if null), V(f_i) = Validity (1 if valid, 0-1 if partially valid), w_i = Business weight

Worked Example: Product Document

{ "title": "Nike Air Max", "price": 129.99, "image_url": null }

title (w=1.0)

C=1, V=1 → 1.0 × 1 × 1 = 1.0

price (w=1.0)

C=1, V=1 → 1.0 × 1 × 1 = 1.0

image_url (w=0.8)

C=0, V=0 → 0.8 × 0 × 0 = 0

QS = (1.0 × 1.0 × 0)^1/3 = 0.0

→ Missing image kills the score. Document rejected.

Decision Rules with Examples

QS ≥ 0.9

Index immediately

{

"title": "Nike Air Max 90",

"price": 129.99,

"image": "✓ Valid URL"

}

QS = 0.95 ✓

0.7 ≤ QS < 0.9

Index with warning

{

"title": "Shoe", // too short

"price": 129.99,

"image": "✓ Valid URL"

}

QS = 0.78 ⚠️

QS < 0.7

Reject document

{

"title": "Nike Air Max",

"price": null,

"image": "broken.jpg"

}

QS = 0.0 ✗

Key Takeaways

It's a Data Problem

80% of search quality failures are data issues, not algorithm issues. Ranking models cannot fix broken data.

The Five Failures

Field Contamination, Schema Drift, Implicit Nulls, Semantic Duplication, and Join Loss are the most common root causes.

High ROI Fixes

Simple ingestion gates (like rejecting null prices) can save millions in lost revenue with minimal engineering effort.

Automate Quality

Build automated quality gates in your ingestion pipeline. Reject bad data before it enters the index.

3.7 Write & Query Path Next: Types of Data