Systems Atlas

Chapter 2.3

Intent vs Tokens

Tokens are what the user typed. Intent is what they meant. These often diverge and understanding this gap is fundamental to building good search.


The Core Problem

Traditional search engines work by matching tokens (words) in the query against tokens in documents. This sounds reasonable until you realize that users express intent through words, but intent and words are not the same thing.

When a user types "cheap laptop", they don't want documents containing the word "cheap" they want laptops under a certain price. When they type "laptop without touchscreen", matching "touchscreen" actually gives them the opposite of what they want.

Intent Categories (Google Framework)

Google famously categorizes queries into "Do, Know, Go". Here is how we handle them differently.

GO

Navigational Intent

The user wants to go to a specific website or page. They are using search as a bookmark bar.

Examples

  • • "facebook login"
  • • "youtube" (Homepage)
  • • "hbo max"
  • • "united airlines support"

System Action

  • • Show single official link at #1
  • • Show sitelinks (sub-pages)
  • • Don't show ads if brand owner
KNOW

Informational Intent

The user wants to learn something. These queries make up 80% of web searches but monetize poorly.

Examples

  • • "how to tie a tie"
  • • "how to upload video to youtube"
  • • "capital of france"

System Action

  • • Show Direct Answer / Snippet
  • • Show "People Also Ask"
  • • Rank authoritative content (Wikipedia)
DO

Transactional Intent

The user wants to buy or perform an action. This is where the money is (Ads, E-commerce).

Examples

  • • "buy iphone 15"
  • • "cheap flights to nyc"
  • • "download spotify"

System Action

  • • Show Shopping Grid
  • • Show Filters (Price, Brand)
  • • Show Reviews and Ratings

Quantifying the Gap

This checkout breakdown shows the gap between token matching and intent understanding.

Success Rate: Token vs Intent Matching

Negation Failure

Token matching fails 90% of the time on negation because it treats "without" as just another word or noise.

Synonym Gap

Token matching misses 75% of relevant results when users use synonyms (e.g., "sofa" instead of "couch") that aren't in the product text.

Deep Dive: Tokens vs Entities

analysis.py
def analyze_query(query: str):
# 1. Naive Tokenization (What Solr/ES do by default)
tokens = query.split()
print(tokens) # ["iphone", "without", "camera"]
# RESULT: Returns phones that have "camera" in description
# 2. Entity First Strategy (What we want)
entities = ner.extract(query)
print(entities)
{
"product": "iphone",
"exclusion": "camera" # 'without' triggers exclusion logic
}
# RESULT: Apply filter `must_not: features.contains("camera")`

Solutions for Common Failures

The "Cheap" Problem

Removing the word "cheap" and applying a price filter.

if "cheap" in query_tokens:
query_tokens.remove("cheap")
filters["price"] = { "lt": median_price * 0.5 }

The Negation Problem

Converting "without X" to rigid exclusion.

if "without" in tokens:
idx = tokens.index("without")
excluded_term = tokens[idx + 1]
must_not_terms.append(excluded_term)

Precision vs Recall Strategy

A token-only search has high precision but low recall (misses synonyms). A semantic-only search has high recall but low precision (drifts topic). The industry standard is Dynamic Hybrid Retrieval.

def calculate_hybrid_score(bm25_score, vector_score):
# 60% weight to exact matches (Precision)
# 40% weight to semantic matches (Recall)
return (0.6 * normalize(bm25_score)) + (0.4 * vector_score)

Bridging the Gap

There are three main approaches to bridging the token-intent gap, each with trade-offs:

1. Synonym Expansion

"couch" → "couch OR sofa OR loveseat"

Add known synonyms to the query. Simple and interpretable, but requires manual curation and can reduce precision if synonyms are too broad.

✓ Interpretable✗ Manual work

2. Semantic Search

embed("couch") ≈ embed("sofa")

Use embeddings to find semantically similar content. Automatic and handles unseen synonyms, but can over-generalize and is less interpretable.

✓ Automatic✗ Black box

3. Hybrid Approach

1. Token match (BM25) for precision

2. Semantic rerank for relevance

3. Fallback to semantic if needed

Combine both: use tokens for high-confidence matches, semantics for recall. This is the approach most production systems use.

✓ Best of both

Key Takeaways

01

Tokens vs Intent

The literal words typed are just a hint. You must infer the underlying goal.

02

Synonyms

Table stakes. You must handle "couch" vs "sofa" and "sneakers" vs "shoes".

03

Hard Problems

Negation ("without") and modifiers ("cheap", "best") require logic, not just matching.

04

Hybrid Wins

Use tokens for precision (exact match) and semantics (vectors) for recall.