Chapter 4.4: Data Foundation

Text vs Structured Data

Understanding when to use text analysis versus exact matching is fundamental to search performance and correctness. Treating a SKU like a sentence is the most common variety of "silent killer" in search applications.

The Fundamental Difference

Text Fields (Analyzed)

Designed for Human Language. The engine breaks strings into "tokens" to support partial matches.

// Input

"Apple MacBook Pro"

// Analyzer Output (Tokens)

"apple""macbook""pro"

// Query: "macbook"

MATCH ✅

Structured Fields (Exact)

Designed for Machine Logic. The engine stores the value exactly as is.

// Input

"Apple"

// Stored Value

"Apple"

// Query: "apple"

NO MATCH ❌ (Case sensitive)

Case Sensitivity & Normalization

keyword fields are byte-level exact. "Apple" ≠ "apple". To fix this without analyzing, use a Normalizer.

// Mapping

"brand": {
  "type": "keyword",
  "normalizer": "lowercase"
}

Result: "Apple" stores as "apple"

Analyzer vs Normalizer

Analyzer (Text): Tokenizes + Filters. Breaks string into multiple terms.
Normalizer (Keyword): Whole string transformation (lowercase, asciifolding). Keep as single term.

Internal Data Structures

Under the hood, Lucene uses completely different data structures for these two types. Understanding this explains why "Range Queries on Strings" are slow and why "Full Text Search on Numbers" is wrong.

Inverted Index (Text)

TermDoc IDs

"brown"[ 1, 2 ]

"fox"[ 1 ]

"quick"[ 1, 3 ]

Why it's fast: O(1) lookup. You ask for "fox", it gives you Doc 1 immediately.
Cost: High storage for "Posting Lists" (50GB+ for 1B docs).

BKD Tree (Numeric/Geo)

[ All Values ]

< 1000

>= 1000

Why it's fast: O(log N) numeric range queries. It skips entire subtrees of data.
Optimization: Hundreds of times faster than comparing string "100" vs "200".

Why Text Range Queries Are Slow

Lexicographical sort is not numeric sort.
"100" > "2" is FALSE in string world ("1" < "2").

Terms: "1", "10", "100", "2", "20"

Mechanism: To find range, Lucene must scan the Term Dictionary.
BKD (Numeric): Skips entire blocks of data using the tree index.

Doc Values vs Inverted Index

Search uses the Inverted Index (Term → Docs). Sorting and Aggregations need the reverse (Doc → Terms), called Doc Values.

Structure	Used For	On Disk?	In Heap?
Inverted Index	Search / Filtering	Yes	No (FST only)
Doc Values	Sorting / Aggregations	Yes	Memory Mapped (OS Cache)

* text fields don't have Doc Values by default (too heavy). That's why you can't sort on them.

Common Mistakes

1. The "ID as Text" Trap

BAD: SKU as Text

"sku": { "type": "text" }

// Document: "ABC-123"

// Tokens: ["abc", "123"]

Query: "ABC-999"

MATCHES! (Shares "abc")

Result: False Positives on IDs.

GOOD: SKU as Keyword

"sku": { "type": "keyword" }

// Document: "ABC-123"

// Token: "ABC-123" (Exact)

Query: "ABC-999"

NO MATCH (Different)

Result: Exact ID retrieval.

2. The "Numeric String" Trap

BAD: Price as String

// Values: "10.00", "2.00"

Sort: Ascending

1. "10.00"
2. "2.00"

Result: "1" comes before "2". Sorting broken.

GOOD: Price as Float/Long

// Values: 10.00, 2.00

Sort: Ascending

1. 2.00
2. 10.00

Result: Correct numeric sorting.

The "Everything as Text" Anti-Pattern

Teams often index everything as text "just in case" they need search.

Consequences

Sorting breaks (lexicographical)
Aggregations are slow/impossible
Heap explodes (Field Data)

Relevance Pollution

Matching on low-value fields (like UUIDs or status codes) dilutes the score of actual matches in Title/Description.

High Cardinality Warning (User IDs)

Aggregating on a high-cardinality keyword field (like user_id with 100M values) forces Lucene to build Global Ordinals.

Memory Cost = 100M values × 8 bytes ≈ 800MB Heap
Build Time = 30 seconds (Initial Query Latency Spike)

Fix: Use execution_hint: "map" or Composite Aggregations for high-cardinality fields.

Cardinality	Examples	Safe Operations
Low (< 1k)	Status, Category, Country	Aggregations OK
Medium (1k - 1M)	Brand, Tags, Author	Aggs with care
High (> 1M)	User ID, Session ID, IP	Avoid Terms Aggs

Rule of thumb: If unique values ≈ document count, treat as toxic for aggregations.

Performance Benchmarks

The choice of data type impacts query speed by orders of magnitude. Data based on 1M Documents.

Query Type	Field Type	Latency	Mechanism
Exact Match	keyword	2ms	Hash Lookup
Range (> 100)	long	3ms	BKD Tree Traversal
Range (> "100")	keyword	50ms	Full Index Scan
Wildcard (abc)	text	500ms+	DFA Pattern Match

Filter Context vs Query Context

Filter Context (No Score)

Binary Yes/No
Cacheable (Fast)
Use for: Status, Brand, Price Range

"filter": { "term": { "status": "active" } }

Query Context (Scoring)

Calculates Relevance Score
Not Cacheable (Slower)
Use for: Full-text search

"must": { "match": { "title": "iphone" } }

Decision Matrix

Use TEXT (Analyzed) when:

Human language (Descriptions, Reviews)
Fuzzy matching / Spell correction needed
Stemming required (running → run)
Relevance scoring is the priority

Use KEYWORD (Exact) when:

IDs, SKUs, Codes, Emails
Enums (Status: "Active", "Pending")
Aggregations (Facets) needed
Exact filtering required

Use NUMERIC/DATE when:

Range queries (Price, Age, Date)
Sorting by value
Math aggregations (Sum, Avg)

🏆 Best Practice: The Multi-Field Pattern

"title": {
  "type": "text",      // 1. Search (Fuzzy)
  "fields": {
    "raw": { 
      "type": "keyword" // 2. Sort/Aggs
    }
  }
},
"brand": {
  "type": "keyword",   // 1. Filter (Exact)
  "fields": {
    "search": { 
      "type": "text"    // 2. Search
    }
  }
}

Most fields need to be capable of both. Don't choose one. Use multi-fields to get the best of both worlds.

When Multi-Fields Explode Index Size

Every multi-field creates a duplicated inverted index (and doc values) on disk. 10 text fields × 3 sub-fields = 30 actual fields.

Rule: Only multi-field what you actually sort or aggregate on.

Date & Time Gotchas

The Problem

Timezones are hard. "2024-01-01" implies UTC in Elasticsearch, but your user might be in EST. Range queries often miss the "last day" due to millisecond precision.

The Fix

Always store dates in UTC internally (Epoch millis or ISO-8601).
Normalize query timezones at the application layer.

Production Mapping Rules

NEVER

• Use text for IDs (UUID, SKU)
• Use keyword for full paragraphs
• Use string for numbers or dates

ALWAYS

• Use keyword for Filters/Aggs
• Use numeric for Ranges
• Use text for Relevance/Scoring

Key Takeaways

Text is for Search

Use 'text' fields for human language, fuzzy matching, and relevance scoring. Use 'keyword' for IDs, tags, and exact filters.

Structured is for Filters

Use Numeric/Date types (BKD Trees) for ranges and sorting. Never store numbers as strings.

The ID Trap

Never index SKUs or UUIDs as 'text'. It causes false positive matches on partial tokens.

Type = Performance

Wrong field types silently destroy latency and memory. Measure twice, index once.

No Easy Fix

Mapping mistakes usually require a full reindex. Measure twice, cut once.

4.3 Modeling Next: Cleaning Data