Chapter 7.2Training Embedding Models

Training Data: Click Pairs & Hard Negatives

Why data quality matters more than model cleverness. How to turn messy behavioral exhaust into clean training triplets, why hard negatives are essential, and what false negatives silently do to your vector space.

Filter Rate

> 50%

Expected discard rate of raw clicks due to pogo-sticking, low dwell time, and positional bias noise.

Primary Toxin

False Negatives

The most dangerous training flaw — pushing actually relevant documents away from queries in vector space.

Negative Strategy

In-Batch + Hard

The standard mix: in-batch negatives for volume, hard negatives for fine-grained semantic boundaries.

The Core Principle: An embedding model is not given direct definitions of relevance. It infers relevance from examples. If your positives are weak, your embedding space becomes fuzzy. If your negatives are too easy, the model learns only coarse separation. If your dataset is biased by ranking, your model faithfully reproduces that bias.

1. Executive Summary: Data Dominates Outcomes

For embedding training, data quality matters more than model cleverness. A mediocre training recipe with high-quality domain pairs usually beats an elegant loss function trained on noisy signals. In production search, the most valuable supervision often comes from user behavior: what they searched, what they clicked, what they ignored, how long they stayed, and whether the session ended successfully.

That does not mean raw click logs are ground truth. Clicks are shaped by rank position, UI bias, freshness, brand familiarity, and even accidental taps. The central job of this chapter is to explain how to turn messy behavioral exhaust into training examples that teach the model something useful rather than simply recreating old ranking mistakes.

Team A (Poor outcome)

Uses raw clicks with no filtering and random negatives. Model faithfully learns the old system's rank distribution — including all its biases.

Team B (Good outcome)

Uses cleaned sessions, dwell thresholds, hard negatives, and leakage controls. Gives the model a better map of what "good retrieval" actually means.

Team B is not using a better architecture. They are giving the model better training signal. The difference in outcomes is almost entirely determined by data quality decisions made before the first training step.

2. Types of Training Signal

Search training data comes from three main sources. Understanding the trade-offs of each helps you decide what mix to use and how much cleanup is needed before training.

Explicit Labels

Human annotation, thumbs up/down, quality reviews, or labeled duplicate questions from experts.

Advantages

High precision
Unambiguous
Good for eval too

Disadvantages

Expensive & slow
Hard to scale
Often too small

Implicit Feedback

Result clicks, dwell time, add-to-cart, ticket resolution, purchases, copy/download actions, session abandonment.

Advantages

Large volume
Continuously refreshed
Real user intent

Disadvantages

Biased by rank/UI
Position effects
Requires debiasing

Synthetic Data

LLM-generated pseudo-queries from documents, expert seeds, or distilled signals from rerankers and FAQ mappings.

Advantages

Works without clicks
Cold start support
Controllable quality

Disadvantages

Too clean / polished
Misses real patterns
Bootstrap only

3. Turning Logs Into Query-Positive Pairs

A single click event is too thin to be reliable. Good pipelines reconstruct full sessions: the original query, the ranked list shown, the clicked result, its position, dwell time, any follow-up queries, and the final success event if available. This lets you distinguish genuinely useful clicks from positional accidents and pogo-sticking.

Behavioral data is powerful because it links a query to a decision. A user typed something, saw results, chose one item, maybe spent time with it, maybe converted, maybe reformulated. That sequence contains much richer supervision than isolated text similarity. The user never explicitly labeled the document, but their behavior strongly implies it.

Session Signal	Interpretation
Click + long dwell time	Strong positive — document was useful
Click + conversion (purchase, close ticket)	Very strong positive — document solved the need
Click + short dwell + immediate reformulation	Weak or noisy positive — pogo-sticking, likely discard
High exposure, no clicks	Potential negative evidence — shown prominently but skipped

Session Pipeline in Code

build_positive_pairs.py

def build_positive_pairs(session_logs, min_dwell=20):

pairs = []

for session in session_logs:

if not session.query: continue

for click in session.clicks:

# Ignore accidental / short-dwell clicks

if click.dwell_seconds < min_dwell: continue

# Ignore pogo-sticking (quick reformulation)

if session.reformulated_within_seconds(10): continue

pairs.append({"query": session.query, "doc_id": click.doc_id})

return pairs

4. The Biases Hidden in Click Data

Raw click data is biased in multiple compounding ways. Training naively on unfiltered clicks means teaching the model to reproduce the old ranking system's biases, not to learn genuine relevance. Understanding these biases is essential before deciding how much to trust any behavioral signal.

Position Bias

Users examine top results more than lower results. A rank-1 result gets clicked partly because it is better, but also because it is seen first. Training naively on raw clicks teaches the model the old system's rank distribution, not actual relevance.

Presentation Bias

A result with a bright thumbnail, trusted brand, or bolded snippet gets more clicks even if less relevant. In e-commerce, image quality and price badges distort behavior. In knowledge search, document titles dominate clicks even when content is weak.

Freshness Bias

Users may prefer newer content because it appears recent, not because it is more semantically relevant. This can leak freshness preference into the embedding space even when it should be handled later by a separate freshness signal.

Popularity Bias

Popular items absorb more clicks. A model trained only on clicks may collapse toward popular entities and under-serve the long tail — exactly the tail that training was supposed to fix.

5. Negative Sampling: Where Systems Win or Waste Training

If the query is "reset mac vpn" and your negative is a cooking recipe, the model learns almost nothing. That negative was already far away in semantic space before training. Easy negatives give coarse separation. They do not teach the model the fine semantic boundaries that distinguish good retrieval from mediocre retrieval in production.

Negative Type	Description	Value
Random negative	Arbitrary document from corpus	Easy baseline signal only — use early, not exclusively
In-batch negative	Positive for another query in same batch	Highly efficient, huge volume from batch size
Lexical hard negative	Shares many words but is wrong factually	Strong for fine-grained semantic separation
Behavioral negative	Highly exposed, consistently not clicked	Strong if debiased well — removes position effects
Model-mined negative	Retrieved by current dense model but judged wrong	Excellent for late-stage refinement

Hard Negatives Teach Distinctions

Hard negatives are valuable precisely because they are dangerous. They look relevant. The model must learn the subtle difference between a document that almost answers the query and one that actually does. That distinction is exactly what separates a mediocre retrieval system from a precise one.

A practical negative mining curriculum: start with in-batch negatives, add BM25-mined lexical lookalikes, then add top-k ANN results from the current model. This curriculum usually works better than jumping straight to only the hardest negatives from the start.

Hard Negatives in Action

Query: "iphone 15 charger"

✔ Positive: USB-C charging adapter for iPhone 15

✘ Hard Negative: iPhone 15 protective case (looks close!)

Query: "postgres replication lag"

✔ Positive: Runbook for diagnosing replica lag

✘ Hard Negative: Documentation for replication setup

6. False Negatives: The Silent Model Killer

A false negative is a document that is labeled or treated as irrelevant even though it is actually relevant to the query. In search, this happens constantly because multiple documents can satisfy any given information need. If you train the model to push away a document that users would actually find helpful, you permanently damage the embedding space in ways that are hard to diagnose.

Example

Query: "benefits enrollment deadline"

Positive doc from clicks: HR annual enrollment guide

False "Negative" doc: Benefits FAQ page with the same enrollment deadline

If you push the FAQ page away from this query, you damage the embedding space for anyone who would have found it useful — and those are real users.

How Teams Reduce False Negatives

Exclude documents from the same category or entity cluster as the positive
Remove near-duplicates and documents with very high lexical overlap with the positive
Use teacher rerankers or LLM judges to filter documents that are plausibly relevant before treating them as negatives
Prefer "shown above clicked result and skipped" as negatives — that is much stronger evidence than being simply unclicked
Skip any negative whose lexical AND semantic similarity to the query are both high — ambiguity means uncertainty about its relevance

7. Building the Dataset: Schemas and Splits

The concrete output of a good training data pipeline is a dataset of query-document pairs or triplets, formatted consistently and split by query family rather than by random sampling.

Recommended Record Formats

Pair Format

{

"query": "reset sso token",

"positive": "Runbook for SSO token reset"

}

Simplest. Used with in-batch negatives from the loss function.

Triplet Format

{

"query": "reset sso token",

"positive": "Runbook for SSO...",

"negative": "SSO admin policy..."

}

Includes an explicit hard negative. Best for fine-grained domains.

Multi-Positive Format

{

"query": "benefits deadline",

"positives": [

"Annual enrollment guide...",

"Benefits FAQ policy..."

]

}

Use when multiple docs are genuinely relevant. Avoids false negatives.

Split by Query Family, Not Just Rows

One of the easiest ways to fool yourself is to let near-identical query variants leak across train and test splits. Random row splitting is not sufficient. Good splits require:

❌ Bad Split (Leaky)

Train: "reset okta token"

Test: "how do i reset okta token"

These are the same intent. The model will appear to generalize when it has merely memorized the intent.

✔ Good Split (Family-Based)

Group by normalized query intent first

Split chronologically when possible

Ensures real generalization is measured, not intent memorization.

Additional principles: keep head and tail queries balanced in every split. Preserve domain slices so you can diagnose failures. Never let future clicks appear in the training split when splitting by time.

8. Bootstrap Strategies When You Lack Click Data

Cold start systems, private knowledge bases, new products, and low-traffic enterprise search systems often have no usable click logs. In these cases, there are three viable paths to bootstrap a training dataset before real user behavior accumulates.

Synthetic Queries from Documents

LLMs can generate plausible user queries from each document or chunk. This works well for new products, private knowledge bases, and high-value domains with no click logs. But synthetic data tends to be too clean and polished — real user queries are messy, abbreviated, and misspelled. Synthetic query generation should include prompt diversity:

• Short keyword queries (2-4 words)• Natural-language questions• Error-message style queries• Synonym-rich variants• Abbreviations and acronyms• Novice and expert phrasing

Expert-Labeled Seeds

Even 500 to 2,000 carefully labeled query-document examples can be extremely useful for both evaluation and early fine-tuning when combined with in-batch negatives. Small expert-labeled sets are especially valuable for high-stakes domains (medical, legal, financial) where synthetic data quality cannot be trusted.

Weak Supervision from Existing Systems

Distill training signal from high-performing rerankers, curated FAQ mappings (question → article), support macros linked to specific articles, manual merchandised results, or human-curated collection pages. These are often underused data sources that give you pseudo-labels without requiring full annotation.

9. Dataset Quality Checklist

Before investing compute in training, validate your dataset against these six questions. If several answers are "no," better training code will not save the project. Data quality decisions made here determine outcomes more than any architectural choice.

Are positives actually useful, or merely clicked?

Click-through alone is not sufficient. Verify with dwell time, conversion, or explicit quality review.

Are negatives hard enough to teach real distinctions?

If all negatives are already far away from the query, the model won't learn fine-grained boundaries.

Have we filtered likely false negatives?

Documents in the same category as the positive that are unlabeled may actually be relevant.

Is the dataset balanced across head and tail queries?

If 90% of pairs are head queries, the model over-fits to popular queries and leaves the tail uncovered.

Are train/dev/test splits leakage-safe?

Split by query family or time window, not by random row selection.

Does the dataset reflect live production behavior?

If collected 18 months ago before a major product change, the signal may be stale or misleading.

Key Takeaways

Data quality beats model cleverness

High-quality training data is the main determinant of custom embedding performance. A mediocre training recipe with great domain pairs usually beats an elegant loss function trained on noisy signals. Two teams can start from the same model and get completely different results based solely on data quality.

Build positive pairs from sessions, not single events

A single click is too thin. Good positive pairs come from session-aware behavioral evidence — click + long dwell time + no immediate reformulation + successful session ending. Filtering out pogo-sticking and accidental taps is as important as collecting clicks.

Hard negatives teach the fine distinctions that matter

Random negatives are useful early but don't teach the model the fine semantic boundaries your users care about. Hard negatives — like 'iPhone 15 case' vs 'iPhone 15 charger' — force the model to learn precise distinctions that random sampling never provides.

False negatives silently damage the vector space

A false negative (treating an actually relevant document as irrelevant) is the most dangerous training flaw. It pushes good documents away from relevant queries permanently. Filtering, debiasing, and excluding near-duplicates matter as much as data volume.

7.1 Why Train Your Own 7.3 Contrastive Learning