Systems Atlas
Chapter 4.5Data Foundation

Cleaning & Normalization

Search engines operate on a simple principle: Garbage In, Garbage Out. If you index "iPhone" and a user searches for "iphone" (lowercase), a naive system returns zero results.

The cleaning pipeline is the unsung hero of search relevance. It bridges the gap betweenhuman expression (messy, inconsistent, diverse) and machine indexing (rigid, binary, exact). This page walks through the six critical stages of transforming raw chaos into queryable order.


The Ingestion Pipeline

Data travels through a one-way transformation tunnel. Once indexed, the original raw state is often discarded from the inverted index (though kept in storage).

1. Raw Input
<p>Café!</p>
2. Char Filter
Café!
Strip HTML
3. Tokenizer
Café
4. Token Filter
cafe
Lowercase + ASCII Fold

1. Character Filtering

Before we even think about "words", we must sanitize the character stream. Common sources like CMS inputs often contain hidden garbage: invisible control characters, non-breaking spaces (` `), and leftover HTML tags that clutter the index.

HTML Stripping

Raw HTML adds noise. Ranking algorithms might think the word "div" or "span" is relevant if not removed.

Input
Output
<h1>Hello <b>World</b></h1>
Hello World

Mapping & Normalization

Normalizing non-standard characters before tokenization keeps the index clean.

Input
Output
:)
_happy_
  (Non-breaking space)
" " (Space)

2. Compliance & PII Redaction

In the age of GDPR and CCPA, accidentally indexing PII (Personally Identifiable Information) is a disaster. Search indices are often immutable segments; "deleting" a single email address often requires expensive segment merging. The best defense is to redact at the gate.

Use regex patterns to scan every incoming document. If a pattern matches, apply a redaction policy immediately.

Email
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
Phone
\d{3}-\d{3}-\d{4}

Redaction vs. Masking

Input
Call me at 555-0123
Redaction (Remove)
Call me at [PHONE]
Masking (Obscure)
Call me at ***-**23

3. Token Normalization

Once character streams are split into tokens (words), we face the challenge of variety. Users type fast and loose. "iphone", "iPhone", and "IPHONE" are distinct byte sequences but semantically identical. Normalization unifies these variations into a single canonical form.

Lowercasing & ASCII Folding

Searches are usually case-insensitive. We handle this at index time, not query time. ASCII Folding converts special characters like `é`, `ñ`, `ç` into their basic ASCII equivalents `e`, `n`, `c`.

Raw TermIndexed Term
iPhoneiphone
Cafécafe
Düsseldorfdusseldorf

Stemming

Reducing words to their root form. This increases Recall (finding more matches) at the slight cost of Precision.

Warning: Aggressive stemming can hurt. "Organization" → "Organ" changes meaning entirely.

Raw TermRoot Form
Runningrun
Banksbank

4. Phonetic Analysis

What if the user doesn't know the spelling? "Fuzzy" search isn't magic; it's often Phonetic Encoding. By reducing words to their pronunciation signature, we can match "Smith", "Smyth", and "Schmidt" because they sound the same. This is essential for names, medication brands, and user-generated tags.

The Soundex Algorithm

Invented in 1918. Keeps the first letter, drops vowels (A, E, I, O, U, H, W, Y), and maps consonants to 6 digits.

  1. B, F, P, V → 1
  2. C, G, J, K, Q, S, X, Z → 2
  3. D, T → 3
  4. L → 4
  5. M, N → 5
  6. R → 6
Smith
Smyth
Soundex
S530

Match found! Both resolve to S (Keep 1st) - 5 (m) - 3 (t) - 0 (pad)

5. Semantic Normalization

Strings are easy; Values are hard. If a user filters for "Laptops under 1kg", a product listed as "900g" will fail to match if you indexed it as a raw string.Value Canonicalization converts human-readable units into machine-sortable base types.

Unit Normalization

You cannot sort products by weight if some are "1kg" and others are "500g".Normalization to Base Units is critical for range queries.

"1.5 kg"1500 (grams)
"500g"500 (grams)

Date Standardization

Relative dates are user-friendly but un-indexable. Parse them into ISO-8601 timestamps.

"Last Friday""2023-11-24T00:00:00Z"
"2 days ago"NOW() - 48h

6. Record Hygiene

The previous steps focused on cleaning *fields*. Now we must clean *records*. Duplicate records dilute relevance signals if a user clicks Product A, but Product B is an identical duplicate, your ranking algorithm gets split signals. Entity Resolution is the fix.

Entity Resolution

Ingestion often gathers data from multiple sources. You might have "IBM", "I.B.M. Corp", and "International Business Machines" as separate rows.

The Golden Record strategy merges these into a single Document ID, combining the diverse fields (e.g., Stock Symbol from source A, Headquarters from source B).

Fragments
Jaccard Similarity > 0.85
Golden Record (ID: ibm-123)

7. Strategy: Context Matters

There is no "correct" way to clean data. It depends entirely on your domain. Aggressive cleaning increases **Recall** (finding things) but hurts **Precision** (finding the *right* thing).

E-Commerce

Maximum Recall

User wants to find the product even if they type "grey" instead of "gray" or "iphon" vs "iPhone".

  • Aggressive Stemming
  • Phonetic Matching
  • Synonyms (TV = Tele)
Log Analytics

Exact Precision

Searching for "Error 500" should NOT match "Error 502". Case sensitivity often matters (variable names).

  • Minimal Stemming
  • No Phonetic
  • Whitespace Only
Legal / Docs

Hybrid approach

Need synonyms ("contract" = "agreement") but distinct entities ("Cancer" != "Canker").

  • Lemmatization
  • Curated Synonyms

Key Takeaways

01

Index Meaning, Not Syntax

Raw strings are useless. Normalize units (kg→g), dates (ISO), and entities (Golden Records) to enable powerful non-text queries.

02

Redact Early

Never let PII enter the inverted index. It is technically difficult and expensive to remove individual terms later.

03

One Size Fits None

E-commerce needs 'fuzzy' logic to convert users. Log search needs 'exact' logic to debug errors. Choose your pipeline per index.

04

The Golden Record

Duplicate records split ranking signals. Always resolve entities to a single ID before indexing to maximize relevance.

Last Step: The Stopwords Debate

Stopwords are extremely common words (e.g., "the", "is", "at", "which") that traditionally were filtered out to save space. However, modern search needs context."To be or not to be" becomes nothing if you remove stopwords.

Strategy A: Remove Them

  • Pro: Smaller Index (30-40% reduction).
  • Pro: Faster queries (fewer tokens to match).
  • Con: Loss of Meaning ("The Who" becomes "Who").
  • Con: Breaks exact phrase searches.

Strategy B: Keep Them (Modern Standard)

  • Pro: Full semantic understanding (LLM ready).
  • Pro: Supports Natural Language Queries using position.
  • Con: Larger storage footprint.
  • Con: Slower common-term matching (mitigated by "Common-Grams").