Training Data: Click Pairs & Hard Negatives
Why data quality matters more than model cleverness. How to turn messy behavioral exhaust into clean training triplets, why hard negatives are essential, and what false negatives silently do to your vector space.
> 50%
Expected discard rate of raw clicks due to pogo-sticking, low dwell time, and positional bias noise.
False Negatives
The most dangerous training flaw — pushing actually relevant documents away from queries in vector space.
In-Batch + Hard
The standard mix: in-batch negatives for volume, hard negatives for fine-grained semantic boundaries.
1. Executive Summary: Data Dominates Outcomes
For embedding training, data quality matters more than model cleverness. A mediocre training recipe with high-quality domain pairs usually beats an elegant loss function trained on noisy signals. In production search, the most valuable supervision often comes from user behavior: what they searched, what they clicked, what they ignored, how long they stayed, and whether the session ended successfully.
That does not mean raw click logs are ground truth. Clicks are shaped by rank position, UI bias, freshness, brand familiarity, and even accidental taps. The central job of this chapter is to explain how to turn messy behavioral exhaust into training examples that teach the model something useful rather than simply recreating old ranking mistakes.
Uses raw clicks with no filtering and random negatives. Model faithfully learns the old system's rank distribution — including all its biases.
Uses cleaned sessions, dwell thresholds, hard negatives, and leakage controls. Gives the model a better map of what "good retrieval" actually means.
Team B is not using a better architecture. They are giving the model better training signal. The difference in outcomes is almost entirely determined by data quality decisions made before the first training step.
2. Types of Training Signal
Search training data comes from three main sources. Understanding the trade-offs of each helps you decide what mix to use and how much cleanup is needed before training.
Explicit Labels
Human annotation, thumbs up/down, quality reviews, or labeled duplicate questions from experts.
- High precision
- Unambiguous
- Good for eval too
- Expensive & slow
- Hard to scale
- Often too small
Implicit Feedback
Result clicks, dwell time, add-to-cart, ticket resolution, purchases, copy/download actions, session abandonment.
- Large volume
- Continuously refreshed
- Real user intent
- Biased by rank/UI
- Position effects
- Requires debiasing
Synthetic Data
LLM-generated pseudo-queries from documents, expert seeds, or distilled signals from rerankers and FAQ mappings.
- Works without clicks
- Cold start support
- Controllable quality
- Too clean / polished
- Misses real patterns
- Bootstrap only
3. Turning Logs Into Query-Positive Pairs
A single click event is too thin to be reliable. Good pipelines reconstruct full sessions: the original query, the ranked list shown, the clicked result, its position, dwell time, any follow-up queries, and the final success event if available. This lets you distinguish genuinely useful clicks from positional accidents and pogo-sticking.
Behavioral data is powerful because it links a query to a decision. A user typed something, saw results, chose one item, maybe spent time with it, maybe converted, maybe reformulated. That sequence contains much richer supervision than isolated text similarity. The user never explicitly labeled the document, but their behavior strongly implies it.
| Session Signal | Interpretation |
|---|---|
| Click + long dwell time | Strong positive — document was useful |
| Click + conversion (purchase, close ticket) | Very strong positive — document solved the need |
| Click + short dwell + immediate reformulation | Weak or noisy positive — pogo-sticking, likely discard |
| High exposure, no clicks | Potential negative evidence — shown prominently but skipped |
Session Pipeline in Code
4. The Biases Hidden in Click Data
Raw click data is biased in multiple compounding ways. Training naively on unfiltered clicks means teaching the model to reproduce the old ranking system's biases, not to learn genuine relevance. Understanding these biases is essential before deciding how much to trust any behavioral signal.
Users examine top results more than lower results. A rank-1 result gets clicked partly because it is better, but also because it is seen first. Training naively on raw clicks teaches the model the old system's rank distribution, not actual relevance.
A result with a bright thumbnail, trusted brand, or bolded snippet gets more clicks even if less relevant. In e-commerce, image quality and price badges distort behavior. In knowledge search, document titles dominate clicks even when content is weak.
Users may prefer newer content because it appears recent, not because it is more semantically relevant. This can leak freshness preference into the embedding space even when it should be handled later by a separate freshness signal.
Popular items absorb more clicks. A model trained only on clicks may collapse toward popular entities and under-serve the long tail — exactly the tail that training was supposed to fix.
5. Negative Sampling: Where Systems Win or Waste Training
If the query is "reset mac vpn" and your negative is a cooking recipe, the model learns almost nothing. That negative was already far away in semantic space before training. Easy negatives give coarse separation. They do not teach the model the fine semantic boundaries that distinguish good retrieval from mediocre retrieval in production.
| Negative Type | Description | Value |
|---|---|---|
| Random negative | Arbitrary document from corpus | Easy baseline signal only — use early, not exclusively |
| In-batch negative | Positive for another query in same batch | Highly efficient, huge volume from batch size |
| Lexical hard negative | Shares many words but is wrong factually | Strong for fine-grained semantic separation |
| Behavioral negative | Highly exposed, consistently not clicked | Strong if debiased well — removes position effects |
| Model-mined negative | Retrieved by current dense model but judged wrong | Excellent for late-stage refinement |
Hard Negatives Teach Distinctions
Hard negatives are valuable precisely because they are dangerous. They look relevant. The model must learn the subtle difference between a document that almost answers the query and one that actually does. That distinction is exactly what separates a mediocre retrieval system from a precise one.
A practical negative mining curriculum: start with in-batch negatives, add BM25-mined lexical lookalikes, then add top-k ANN results from the current model. This curriculum usually works better than jumping straight to only the hardest negatives from the start.
Hard Negatives in Action
6. False Negatives: The Silent Model Killer
A false negative is a document that is labeled or treated as irrelevant even though it is actually relevant to the query. In search, this happens constantly because multiple documents can satisfy any given information need. If you train the model to push away a document that users would actually find helpful, you permanently damage the embedding space in ways that are hard to diagnose.
Query: "benefits enrollment deadline"
Positive doc from clicks: HR annual enrollment guide
False "Negative" doc: Benefits FAQ page with the same enrollment deadline
If you push the FAQ page away from this query, you damage the embedding space for anyone who would have found it useful — and those are real users.
How Teams Reduce False Negatives
- Exclude documents from the same category or entity cluster as the positive
- Remove near-duplicates and documents with very high lexical overlap with the positive
- Use teacher rerankers or LLM judges to filter documents that are plausibly relevant before treating them as negatives
- Prefer "shown above clicked result and skipped" as negatives — that is much stronger evidence than being simply unclicked
- Skip any negative whose lexical AND semantic similarity to the query are both high — ambiguity means uncertainty about its relevance
7. Building the Dataset: Schemas and Splits
The concrete output of a good training data pipeline is a dataset of query-document pairs or triplets, formatted consistently and split by query family rather than by random sampling.
Recommended Record Formats
Simplest. Used with in-batch negatives from the loss function.
Includes an explicit hard negative. Best for fine-grained domains.
Use when multiple docs are genuinely relevant. Avoids false negatives.
Split by Query Family, Not Just Rows
One of the easiest ways to fool yourself is to let near-identical query variants leak across train and test splits. Random row splitting is not sufficient. Good splits require:
Train: "reset okta token"
Test: "how do i reset okta token"
These are the same intent. The model will appear to generalize when it has merely memorized the intent.
Group by normalized query intent first
Split chronologically when possible
Ensures real generalization is measured, not intent memorization.
Additional principles: keep head and tail queries balanced in every split. Preserve domain slices so you can diagnose failures. Never let future clicks appear in the training split when splitting by time.
8. Bootstrap Strategies When You Lack Click Data
Cold start systems, private knowledge bases, new products, and low-traffic enterprise search systems often have no usable click logs. In these cases, there are three viable paths to bootstrap a training dataset before real user behavior accumulates.
Synthetic Queries from Documents
LLMs can generate plausible user queries from each document or chunk. This works well for new products, private knowledge bases, and high-value domains with no click logs. But synthetic data tends to be too clean and polished — real user queries are messy, abbreviated, and misspelled. Synthetic query generation should include prompt diversity:
Expert-Labeled Seeds
Even 500 to 2,000 carefully labeled query-document examples can be extremely useful for both evaluation and early fine-tuning when combined with in-batch negatives. Small expert-labeled sets are especially valuable for high-stakes domains (medical, legal, financial) where synthetic data quality cannot be trusted.
Weak Supervision from Existing Systems
Distill training signal from high-performing rerankers, curated FAQ mappings (question → article), support macros linked to specific articles, manual merchandised results, or human-curated collection pages. These are often underused data sources that give you pseudo-labels without requiring full annotation.
9. Dataset Quality Checklist
Before investing compute in training, validate your dataset against these six questions. If several answers are "no," better training code will not save the project. Data quality decisions made here determine outcomes more than any architectural choice.
Are positives actually useful, or merely clicked?
Click-through alone is not sufficient. Verify with dwell time, conversion, or explicit quality review.
Are negatives hard enough to teach real distinctions?
If all negatives are already far away from the query, the model won't learn fine-grained boundaries.
Have we filtered likely false negatives?
Documents in the same category as the positive that are unlabeled may actually be relevant.
Is the dataset balanced across head and tail queries?
If 90% of pairs are head queries, the model over-fits to popular queries and leaves the tail uncovered.
Are train/dev/test splits leakage-safe?
Split by query family or time window, not by random row selection.
Does the dataset reflect live production behavior?
If collected 18 months ago before a major product change, the signal may be stale or misleading.
Key Takeaways
Data quality beats model cleverness
High-quality training data is the main determinant of custom embedding performance. A mediocre training recipe with great domain pairs usually beats an elegant loss function trained on noisy signals. Two teams can start from the same model and get completely different results based solely on data quality.
Build positive pairs from sessions, not single events
A single click is too thin. Good positive pairs come from session-aware behavioral evidence — click + long dwell time + no immediate reformulation + successful session ending. Filtering out pogo-sticking and accidental taps is as important as collecting clicks.
Hard negatives teach the fine distinctions that matter
Random negatives are useful early but don't teach the model the fine semantic boundaries your users care about. Hard negatives — like 'iPhone 15 case' vs 'iPhone 15 charger' — force the model to learn precise distinctions that random sampling never provides.
False negatives silently damage the vector space
A false negative (treating an actually relevant document as irrelevant) is the most dangerous training flaw. It pushes good documents away from relevant queries permanently. Filtering, debiasing, and excluding near-duplicates matter as much as data volume.