Systems Atlas

Chapter 4.2: Data Foundation

Types of Data in Search

Search engines don't operate on a single data type. Understanding the different data categories and their characteristics is essential for building scalable systems.


The Data Taxonomy

Every piece of data in a search system has distinct characteristics that affect how it should be stored, indexed, and queried. Misunderstanding these differences leads to performance disasters like storing prices as text (100x slower filtering) or treating behavioral signals as static documents. The following taxonomy covers the five fundamental data types you'll encounter.

A. Text Data (Unstructured)

Definition: Human-readable content with no fixed schema.

Text is the foundation of search product titles, descriptions, reviews, articles. Unlike numbers, text has no inherent structure. Before it can be searched, it must be analyzed: broken into tokens, normalized to lowercase, and stemmed to root forms. This processing creates the inverted index.

{
"title": "Apple MacBook Pro 16",
"description": "The most powerful MacBook...",
"reviews": ["Amazing laptop!", "Worth every penny"]
}
PropertyValue
EntropyHigh (unpredictable content)
Analysis RequiredYes (tokenization, stemming)
Query TypeFull-text search, fuzzy matching
Storage CostHigh (inverted index overhead)
Real-World Scale: Amazon: Average product description = 500 words. With 500M products = 250 billion words to index. After tokenization: ~100 billion unique terms.

B. Structured Data (Exact Values)

Definition: Data with strict types and known value ranges.

Structured data includes prices, dates, ratings, stock counts, and coordinates. Unlike text, these values don't need tokenization they're stored and queried exactly as-is. The key insight: use BKD trees (not inverted indexes) for numeric ranges, achieving 100x faster filtering.

{
"price": 2499.99,
"stock_count": 1523,
"release_date": "2024-01-15",
"rating": 4.7
}
PropertyValue
EntropyLow (bounded values)
Analysis RequiredNo (stored exactly)
Query TypeRange, exact match, geo
Storage CostLow (BKD trees)

Here's what a price filter query looks like. Notice how the range is in the filter context, not the query context:

// User query: "Laptops between $500 and $1500"
{
"query": {
"bool": {
"must": { "match": { "category": "laptops" } },
"filter": { "range": { "price": { "gte": 500, "lte": 1500 } } }
}
}
}
100M products
Price filter
5ms
BKD Tree
500ms
String comparison (100x slower)

C. Semi-Structured Data (Flexible Schema)

Definition: JSON objects with varying fields per document.

Different product categories have different attributes: laptops have RAM and screen size, t-shirts have size and material. This flexibility is powerful but dangerous uncontrolled schema can explode your cluster's memory. The solution is the flattened type.

// Laptop specs
{
"specs": {
"screen": "16 inch",
"ram": "32GB"
}
}
// T-Shirt specs
{
"specs": {
"size": "Large",
"color": "Navy"
}
}
The Challenge: Without control, Elasticsearch creates 10,000+ unique field names = mapping explosion = slow cluster.

The solution is to use the flattened type, which stores all nested key-value pairs without creating individual mappings:

// Solution: Flattened type
{
"mappings": {
"properties": {
"specs": { "type": "flattened" }
}
}
}
ApproachProsCons
Dynamic mappingFull query powerMapping explosion
FlattenedControlled sizeNo range queries on values
Nested objectStructured queriesUpdate cost

D. Graph Data (Relationships)

Definition: Connections between entities.

Search doesn't exist in isolation products belong to categories, categories have parents, users purchase products, brands make products. These relationships must be captured and made queryable. The challenge is that graph traversals are slow at query time, so we pre-compute the results into flat documents.

User --[purchased]--> Product
Product --[belongs_to]--> Category
Product --[made_by]--> Brand
Brand --[headquartered_in]--> Country
User --[follows]--> Brand

The Flattening Problem

In a graph database, entities are normalized and linked by IDs. For search, we must denormalize by embedding related data directly into each document. This trades storage for query speed.

// Normalized (Graph DB)
// Product
{ "id": "prod_1", "brand_id": "brand_nike" }
// Brand
{ "id": "brand_nike", "name": "Nike" }
// Denormalized (Search Index)
{
"id": "prod_1",
"brand": { "name": "Nike" }
}
Flipkart Case Study: Category hierarchy 5 levels deep, 15,000 categories.
Solution: Index all ancestor categories on product.
category_path: ["Electronics", "Mobiles", "Smartphones"]

E. Behavioral Data (Signals)

Definition: User interaction logs used for ranking.

Every click, scroll, and purchase generates a data point. These signals feed machine learning models that personalize and rank results. The challenge: behavioral data arrives as a firehose of raw events that must be aggregated into features before they're useful.

{
"event": "click",
"timestamp": "2024-01-15T10:30:00Z",
"user_id": "user_12345",
"query": "running shoes",
"product_id": "prod_nike_123",
"position": 3,
"dwell_time_ms": 45000
}
MetricE-commerce (1M DAU)Google Scale
Clicks/day50M8.5B
Events/second600100,000
Storage/day50GB10TB

Raw events must be aggregated into features via streaming pipelines. Here's a Flink job that calculates click counts per product:

# Flink streaming job
def aggregate_signals(events_stream):
return events_stream \
.key_by(lambda e: e["product_id"]) \
.window(TumblingWindow.of(Time.minutes(15))) \
.aggregate(click_count=count())
# Output (stored in Feature Store)
{ "product_id": "prod_nike_123", "clicks_15m": 47 }

The Three Data Views

The same underlying data must be represented differently depending on its purpose. Your relational database optimizes for consistency and writes; your search index optimizes for fast reads; your feature store optimizes for low-latency ML inference. These are complementary views, not competing systems.

View 1: Source of Truth (OLTP)

Purpose: Reliability, ACID compliance

Technology: PostgreSQL, MySQL, DynamoDB

Schema: Normalized (3NF)

CREATE TABLE products (
id UUID PRIMARY KEY
);

View 2: Search Index

Purpose: Fast retrieval

Technology: Elasticsearch, Solr

Schema: Denormalized

{
"title": "Nike Air Max",
"brand": "Nike"
}

View 3: Feature Store

Purpose: Fast ML feature lookup

Technology: Redis, Feast

Schema: Aggregated signals

class ProductFeatures:
entities = ["product_id"]
features = [...]

Data Flow Architecture

How does data flow from your source database to all three views? The answer is event-driven architecture with Kafka at the center. Changes in your database are captured via CDC (Change Data Capture), streamed through Kafka, and processed by Flink/Spark to populate each view appropriately. This pattern ensures consistency without dangerous dual-writes.

Source DB
PostgreSQL
CDC (Debezium)
Click Stream
Frontend Events
Raw Events
Apache Kafka
Event Streaming Platform
Stream Processor
Flink / Spark Streaming
Search Index
Elasticsearch
Feature Store
Redis / Feast
Data Lake
S3 / GCS

✓ Key Principles

  • Never dual-write: Don't write to DB and Search simultaneously
  • Single source of truth: DB is authoritative, Search is a view
  • Event-driven: Changes flow through Kafka, not direct API calls
  • Rebuildable: Search index can be recreated from DB + Feature Store

✗ Common Mistakes

  • • Writing directly to Elasticsearch from application
  • • No CDC, leading to stale data
  • • Treating search as source of truth
  • • No way to rebuild index from scratch

Key Takeaways

01

The Five Data Types

Text (Unstructured), Structured (Exact), Semi-Structured (Flexible), Graph (Relationships), and Behavioral (Signals).

02

The Three Views

Source of Truth (OLTP), Search Index (Denormalized), and Feature Store (ML Signals). Don't try to make one DB do it all.

03

Event-Driven Integrity

Use CDC (Debezium) and Kafka to populate your views. Never dual-write from the application.