Chapter 4.2: Data Foundation

Types of Data in Search

Search engines don't operate on a single data type. Understanding the different data categories and their characteristics is essential for building scalable systems.

The Data Taxonomy

Every piece of data in a search system has distinct characteristics that affect how it should be stored, indexed, and queried. Misunderstanding these differences leads to performance disasters like storing prices as text (100x slower filtering) or treating behavioral signals as static documents. The following taxonomy covers the five fundamental data types you'll encounter.

A. Text Data (Unstructured)

Definition: Human-readable content with no fixed schema.

Text is the foundation of search product titles, descriptions, reviews, articles. Unlike numbers, text has no inherent structure. Before it can be searched, it must be analyzed: broken into tokens, normalized to lowercase, and stemmed to root forms. This processing creates the inverted index.

{

"title": "Apple MacBook Pro 16",

"description": "The most powerful MacBook...",

"reviews": ["Amazing laptop!", "Worth every penny"]

}

Property	Value
Entropy	High (unpredictable content)
Analysis Required	Yes (tokenization, stemming)
Query Type	Full-text search, fuzzy matching
Storage Cost	High (inverted index overhead)

Real-World Scale: Amazon: Average product description = 500 words. With 500M products = 250 billion words to index. After tokenization: ~100 billion unique terms.

B. Structured Data (Exact Values)

Definition: Data with strict types and known value ranges.

Structured data includes prices, dates, ratings, stock counts, and coordinates. Unlike text, these values don't need tokenization they're stored and queried exactly as-is. The key insight: use BKD trees (not inverted indexes) for numeric ranges, achieving 100x faster filtering.

{

"price": 2499.99,

"stock_count": 1523,

"release_date": "2024-01-15",

"rating": 4.7

}

Property	Value
Entropy	Low (bounded values)
Analysis Required	No (stored exactly)
Query Type	Range, exact match, geo
Storage Cost	Low (BKD trees)

Here's what a price filter query looks like. Notice how the range is in the filter context, not the query context:

// User query: "Laptops between $500 and $1500"

{

"query": {

"bool": {

"must": { "match": { "category": "laptops" } },

"filter": { "range": { "price": { "gte": 500, "lte": 1500 } } }

}

100M products

Price filter

5ms

BKD Tree

500ms

String comparison (100x slower)

C. Semi-Structured Data (Flexible Schema)

Definition: JSON objects with varying fields per document.

Different product categories have different attributes: laptops have RAM and screen size, t-shirts have size and material. This flexibility is powerful but dangerous uncontrolled schema can explode your cluster's memory. The solution is the flattened type.

// Laptop specs

{

"specs": {

"screen": "16 inch",

"ram": "32GB"

}

// T-Shirt specs

{

"specs": {

"size": "Large",

"color": "Navy"

}

The Challenge: Without control, Elasticsearch creates 10,000+ unique field names = mapping explosion = slow cluster.

The solution is to use the flattened type, which stores all nested key-value pairs without creating individual mappings:

// Solution: Flattened type

{

"mappings": {

"properties": {

"specs": { "type": "flattened" }

}

Approach	Pros	Cons
Dynamic mapping	Full query power	Mapping explosion
Flattened	Controlled size	No range queries on values
Nested object	Structured queries	Update cost

D. Graph Data (Relationships)

Definition: Connections between entities.

Search doesn't exist in isolation products belong to categories, categories have parents, users purchase products, brands make products. These relationships must be captured and made queryable. The challenge is that graph traversals are slow at query time, so we pre-compute the results into flat documents.

User --[purchased]--> Product

Product --[belongs_to]--> Category

Product --[made_by]--> Brand

Brand --[headquartered_in]--> Country

User --[follows]--> Brand

The Flattening Problem

In a graph database, entities are normalized and linked by IDs. For search, we must denormalize by embedding related data directly into each document. This trades storage for query speed.

// Normalized (Graph DB)

// Product

{ "id": "prod_1", "brand_id": "brand_nike" }

// Brand

{ "id": "brand_nike", "name": "Nike" }

// Denormalized (Search Index)

{

"id": "prod_1",

"brand": { "name": "Nike" }

}

Flipkart Case Study: Category hierarchy 5 levels deep, 15,000 categories.
Solution: Index all ancestor categories on product.
category_path: ["Electronics", "Mobiles", "Smartphones"]

E. Behavioral Data (Signals)

Definition: User interaction logs used for ranking.

Every click, scroll, and purchase generates a data point. These signals feed machine learning models that personalize and rank results. The challenge: behavioral data arrives as a firehose of raw events that must be aggregated into features before they're useful.

{

"event": "click",

"timestamp": "2024-01-15T10:30:00Z",

"user_id": "user_12345",

"query": "running shoes",

"product_id": "prod_nike_123",

"position": 3,

"dwell_time_ms": 45000

}

Metric	E-commerce (1M DAU)	Google Scale
Clicks/day	50M	8.5B
Events/second	600	100,000
Storage/day	50GB	10TB

Raw events must be aggregated into features via streaming pipelines. Here's a Flink job that calculates click counts per product:

# Flink streaming job

def aggregate_signals(events_stream):

return events_stream \

.key_by(lambda e: e["product_id"]) \

.window(TumblingWindow.of(Time.minutes(15))) \

.aggregate(click_count=count())

# Output (stored in Feature Store)

{ "product_id": "prod_nike_123", "clicks_15m": 47 }

The Three Data Views

The same underlying data must be represented differently depending on its purpose. Your relational database optimizes for consistency and writes; your search index optimizes for fast reads; your feature store optimizes for low-latency ML inference. These are complementary views, not competing systems.

View 1: Source of Truth (OLTP)

Purpose: Reliability, ACID compliance

Technology: PostgreSQL, MySQL, DynamoDB

Schema: Normalized (3NF)

CREATE TABLE products (

id UUID PRIMARY KEY

);

View 2: Search Index

Purpose: Fast retrieval

Technology: Elasticsearch, Solr

Schema: Denormalized

{

"title": "Nike Air Max",

"brand": "Nike"

}

View 3: Feature Store

Purpose: Fast ML feature lookup

Technology: Redis, Feast

Schema: Aggregated signals

class ProductFeatures:

entities = ["product_id"]

features = [...]

Data Flow Architecture

How does data flow from your source database to all three views? The answer is event-driven architecture with Kafka at the center. Changes in your database are captured via CDC (Change Data Capture), streamed through Kafka, and processed by Flink/Spark to populate each view appropriately. This pattern ensures consistency without dangerous dual-writes.

Source DB

PostgreSQL

CDC (Debezium)

↓

Click Stream

Frontend Events

Raw Events

↓

Apache Kafka

Event Streaming Platform

↓

Stream Processor

Flink / Spark Streaming

↓↓↓

Search Index

Elasticsearch

Feature Store

Redis / Feast

Data Lake

S3 / GCS

✓ Key Principles

• Never dual-write: Don't write to DB and Search simultaneously
• Single source of truth: DB is authoritative, Search is a view
• Event-driven: Changes flow through Kafka, not direct API calls
• Rebuildable: Search index can be recreated from DB + Feature Store

✗ Common Mistakes

• Writing directly to Elasticsearch from application
• No CDC, leading to stale data
• Treating search as source of truth
• No way to rebuild index from scratch

Key Takeaways

The Five Data Types

Text (Unstructured), Structured (Exact), Semi-Structured (Flexible), Graph (Relationships), and Behavioral (Signals).

The Three Views

Source of Truth (OLTP), Search Index (Denormalized), and Feature Store (ML Signals). Don't try to make one DB do it all.

Event-Driven Integrity

Use CDC (Debezium) and Kafka to populate your views. Never dual-write from the application.

4.1 Data Quality Next: Document Modeling