Chapter 4.4: Data Foundation
Text vs Structured Data
Understanding when to use text analysis versus exact matching is fundamental to search performance and correctness. Treating a SKU like a sentence is the most common variety of "silent killer" in search applications.
The Fundamental Difference
Designed for Human Language. The engine breaks strings into "tokens" to support partial matches.
Designed for Machine Logic. The engine stores the value exactly as is.
Case Sensitivity & Normalization
keyword fields are byte-level exact. "Apple" ≠ "apple". To fix this without analyzing, use a Normalizer.
"brand": {
"type": "keyword",
"normalizer": "lowercase"
}Analyzer vs Normalizer
- Analyzer (Text): Tokenizes + Filters. Breaks string into multiple terms.
- Normalizer (Keyword): Whole string transformation (lowercase, asciifolding). Keep as single term.
Internal Data Structures
Under the hood, Lucene uses completely different data structures for these two types. Understanding this explains why "Range Queries on Strings" are slow and why "Full Text Search on Numbers" is wrong.
Inverted Index (Text)
Why it's fast: O(1) lookup. You ask for "fox", it gives you Doc 1 immediately.
Cost: High storage for "Posting Lists" (50GB+ for 1B docs).
BKD Tree (Numeric/Geo)
Why it's fast: O(log N) numeric range queries. It skips entire subtrees of data.
Optimization: Hundreds of times faster than comparing string "100" vs "200".
Why Text Range Queries Are Slow
Lexicographical sort is not numeric sort.
"100" > "2" is FALSE in string world ("1" < "2").
Mechanism: To find range, Lucene must scan the Term Dictionary.
BKD (Numeric): Skips entire blocks of data using the tree index.
Doc Values vs Inverted Index
Search uses the Inverted Index (Term → Docs). Sorting and Aggregations need the reverse (Doc → Terms), called Doc Values.
| Structure | Used For | On Disk? | In Heap? |
|---|---|---|---|
| Inverted Index | Search / Filtering | Yes | No (FST only) |
| Doc Values | Sorting / Aggregations | Yes | Memory Mapped (OS Cache) |
* text fields don't have Doc Values by default (too heavy). That's why you can't sort on them.
Common Mistakes
1. The "ID as Text" Trap
Result: False Positives on IDs.
Result: Exact ID retrieval.
2. The "Numeric String" Trap
2. "2.00"
Result: "1" comes before "2". Sorting broken.
2. 10.00
Result: Correct numeric sorting.
The "Everything as Text" Anti-Pattern
Teams often index everything as text "just in case" they need search.
- Sorting breaks (lexicographical)
- Aggregations are slow/impossible
- Heap explodes (Field Data)
Matching on low-value fields (like UUIDs or status codes) dilutes the score of actual matches in Title/Description.
High Cardinality Warning (User IDs)
Aggregating on a high-cardinality keyword field (like user_id with 100M values) forces Lucene to build Global Ordinals.
Build Time = 30 seconds (Initial Query Latency Spike)
| Cardinality | Examples | Safe Operations |
|---|---|---|
| Low (< 1k) | Status, Category, Country | Aggregations OK |
| Medium (1k - 1M) | Brand, Tags, Author | Aggs with care |
| High (> 1M) | User ID, Session ID, IP | Avoid Terms Aggs |
Performance Benchmarks
The choice of data type impacts query speed by orders of magnitude. Data based on 1M Documents.
| Query Type | Field Type | Latency | Mechanism |
|---|---|---|---|
| Exact Match | keyword | 2ms | Hash Lookup |
| Range (> 100) | long | 3ms | BKD Tree Traversal |
| Range (> "100") | keyword | 50ms | Full Index Scan |
| Wildcard (*abc*) | text | 500ms+ | DFA Pattern Match |
Filter Context vs Query Context
- Binary Yes/No
- Cacheable (Fast)
- Use for: Status, Brand, Price Range
- Calculates Relevance Score
- Not Cacheable (Slower)
- Use for: Full-text search
Decision Matrix
Use TEXT (Analyzed) when:
- Human language (Descriptions, Reviews)
- Fuzzy matching / Spell correction needed
- Stemming required (running → run)
- Relevance scoring is the priority
Use KEYWORD (Exact) when:
- IDs, SKUs, Codes, Emails
- Enums (Status: "Active", "Pending")
- Aggregations (Facets) needed
- Exact filtering required
Use NUMERIC/DATE when:
- Range queries (Price, Age, Date)
- Sorting by value
- Math aggregations (Sum, Avg)
"title": {
"type": "text", // 1. Search (Fuzzy)
"fields": {
"raw": {
"type": "keyword" // 2. Sort/Aggs
}
}
},
"brand": {
"type": "keyword", // 1. Filter (Exact)
"fields": {
"search": {
"type": "text" // 2. Search
}
}
}Most fields need to be capable of both. Don't choose one. Use multi-fields to get the best of both worlds.
When Multi-Fields Explode Index Size
Every multi-field creates a duplicated inverted index (and doc values) on disk. 10 text fields × 3 sub-fields = 30 actual fields.
Date & Time Gotchas
The Problem
Timezones are hard. "2024-01-01" implies UTC in Elasticsearch, but your user might be in EST. Range queries often miss the "last day" due to millisecond precision.
The Fix
- Always store dates in UTC internally (Epoch millis or ISO-8601).
- Normalize query timezones at the application layer.
Production Mapping Rules
- • Use
textfor IDs (UUID, SKU) - • Use
keywordfor full paragraphs - • Use
stringfor numbers or dates
- • Use
keywordfor Filters/Aggs - • Use
numericfor Ranges - • Use
textfor Relevance/Scoring
Key Takeaways
Text is for Search
Use 'text' fields for human language, fuzzy matching, and relevance scoring. Use 'keyword' for IDs, tags, and exact filters.
Structured is for Filters
Use Numeric/Date types (BKD Trees) for ranges and sorting. Never store numbers as strings.
The ID Trap
Never index SKUs or UUIDs as 'text'. It causes false positive matches on partial tokens.
Type = Performance
Wrong field types silently destroy latency and memory. Measure twice, index once.
No Easy Fix
Mapping mistakes usually require a full reindex. Measure twice, cut once.