Chapter 4.2: Data Foundation
Types of Data in Search
Search engines don't operate on a single data type. Understanding the different data categories and their characteristics is essential for building scalable systems.
The Data Taxonomy
Every piece of data in a search system has distinct characteristics that affect how it should be stored, indexed, and queried. Misunderstanding these differences leads to performance disasters like storing prices as text (100x slower filtering) or treating behavioral signals as static documents. The following taxonomy covers the five fundamental data types you'll encounter.
A. Text Data (Unstructured)
Definition: Human-readable content with no fixed schema.
Text is the foundation of search product titles, descriptions, reviews, articles. Unlike numbers, text has no inherent structure. Before it can be searched, it must be analyzed: broken into tokens, normalized to lowercase, and stemmed to root forms. This processing creates the inverted index.
| Property | Value |
|---|---|
| Entropy | High (unpredictable content) |
| Analysis Required | Yes (tokenization, stemming) |
| Query Type | Full-text search, fuzzy matching |
| Storage Cost | High (inverted index overhead) |
B. Structured Data (Exact Values)
Definition: Data with strict types and known value ranges.
Structured data includes prices, dates, ratings, stock counts, and coordinates. Unlike text, these values don't need tokenization they're stored and queried exactly as-is. The key insight: use BKD trees (not inverted indexes) for numeric ranges, achieving 100x faster filtering.
| Property | Value |
|---|---|
| Entropy | Low (bounded values) |
| Analysis Required | No (stored exactly) |
| Query Type | Range, exact match, geo |
| Storage Cost | Low (BKD trees) |
Here's what a price filter query looks like. Notice how the range is in the filter context, not the query context:
C. Semi-Structured Data (Flexible Schema)
Definition: JSON objects with varying fields per document.
Different product categories have different attributes: laptops have RAM and screen size, t-shirts have size and material. This flexibility is powerful but dangerous uncontrolled schema can explode your cluster's memory. The solution is the flattened type.
The solution is to use the flattened type, which stores all nested key-value pairs without creating individual mappings:
| Approach | Pros | Cons |
|---|---|---|
| Dynamic mapping | Full query power | Mapping explosion |
| Flattened | Controlled size | No range queries on values |
| Nested object | Structured queries | Update cost |
D. Graph Data (Relationships)
Definition: Connections between entities.
Search doesn't exist in isolation products belong to categories, categories have parents, users purchase products, brands make products. These relationships must be captured and made queryable. The challenge is that graph traversals are slow at query time, so we pre-compute the results into flat documents.
The Flattening Problem
In a graph database, entities are normalized and linked by IDs. For search, we must denormalize by embedding related data directly into each document. This trades storage for query speed.
Solution: Index all ancestor categories on product.
category_path: ["Electronics", "Mobiles", "Smartphones"]E. Behavioral Data (Signals)
Definition: User interaction logs used for ranking.
Every click, scroll, and purchase generates a data point. These signals feed machine learning models that personalize and rank results. The challenge: behavioral data arrives as a firehose of raw events that must be aggregated into features before they're useful.
| Metric | E-commerce (1M DAU) | Google Scale |
|---|---|---|
| Clicks/day | 50M | 8.5B |
| Events/second | 600 | 100,000 |
| Storage/day | 50GB | 10TB |
Raw events must be aggregated into features via streaming pipelines. Here's a Flink job that calculates click counts per product:
The Three Data Views
The same underlying data must be represented differently depending on its purpose. Your relational database optimizes for consistency and writes; your search index optimizes for fast reads; your feature store optimizes for low-latency ML inference. These are complementary views, not competing systems.
View 1: Source of Truth (OLTP)
Purpose: Reliability, ACID compliance
Technology: PostgreSQL, MySQL, DynamoDB
Schema: Normalized (3NF)
View 2: Search Index
Purpose: Fast retrieval
Technology: Elasticsearch, Solr
Schema: Denormalized
View 3: Feature Store
Purpose: Fast ML feature lookup
Technology: Redis, Feast
Schema: Aggregated signals
Data Flow Architecture
How does data flow from your source database to all three views? The answer is event-driven architecture with Kafka at the center. Changes in your database are captured via CDC (Change Data Capture), streamed through Kafka, and processed by Flink/Spark to populate each view appropriately. This pattern ensures consistency without dangerous dual-writes.
✓ Key Principles
- • Never dual-write: Don't write to DB and Search simultaneously
- • Single source of truth: DB is authoritative, Search is a view
- • Event-driven: Changes flow through Kafka, not direct API calls
- • Rebuildable: Search index can be recreated from DB + Feature Store
✗ Common Mistakes
- • Writing directly to Elasticsearch from application
- • No CDC, leading to stale data
- • Treating search as source of truth
- • No way to rebuild index from scratch
Key Takeaways
The Five Data Types
Text (Unstructured), Structured (Exact), Semi-Structured (Flexible), Graph (Relationships), and Behavioral (Signals).
The Three Views
Source of Truth (OLTP), Search Index (Denormalized), and Feature Store (ML Signals). Don't try to make one DB do it all.
Event-Driven Integrity
Use CDC (Debezium) and Kafka to populate your views. Never dual-write from the application.