Bi-Encoder vs Cross-Encoder
The critical architectural decision in neural search: process query and document independently (fast, scalable) or jointly (accurate, slow). This distinction fundamentally shapes how a search system is built, what hardware it needs, and what quality users experience.
~15ms
Query encode (10ms) + ANN search (5ms). Regardless of corpus size.
MRR +0.06
MS MARCO: bi-encoder MRR@10 ≈ 0.33-0.38, cross-encoder ≈ 0.39-0.42.
~165ms
Retrieve-then-rerank: 15ms retrieval + 150ms reranking = near-instant.
1. Bi-Encoder (Dual Encoder)
A bi-encoder processes two inputs independently through separate (or shared) encoder networks. The query and document never "see" each other during encoding — each is transformed into a fixed-size vector in isolation, and similarity is measured post-hoc using cosine similarity or dot product. This separation is both the defining feature and the key advantage.
Because the document encoder runs independently of any query, document embeddings can be pre-computed once and stored. At query time, only the query needs encoding (a single forward pass, ~5-10ms), then the pre-computed document vectors are searched using an ANN index. Consider a corpus of 10M documents: encode all 10M offline (takes hours on GPU), store in HNSW, and search in ~15ms per query regardless of corpus size.
| Feature | Benefit |
|---|---|
| Pre-computation | Document vectors computed offline once → query-time cost is minimal |
| Scalability | Search billions of documents in <10ms using ANN indexes |
| Caching | Frequently repeated queries can cache their query vectors |
| Independence | New documents indexed without re-encoding existing vectors |
The fundamental weakness of bi-encoders: the entire meaning of a query or document is compressed into a single fixed-size vector (768 floats = 3KB). This compression is inherently lossy. "Not available in blue" and "available in blue"produce nearly identical embeddings. A 10-page report gets the same 768 floats as a 2-sentence tweet.
2. Cross-Encoder Architecture
A cross-encoder processes query and document jointly in a single transformer pass. The two inputs are concatenated with a separator token and fed through the model together, allowing full cross-attention between all tokens from both inputs across all 12 transformer layers.
Every token in the query can attend to every token in the document. The model learns complex patterns: "the query asks about price ('cheap') and the document mentions price ('budget')" or "the query mentions 'NYC' and the document says 'New York.'" These token-level interactions produce the cross-encoder's superior accuracy.
Why Cross-Encoders Win on Hard Queries
Handling Negation
"Hotels with a pool" vs "Hotels without a pool" — cross-encoder sees "with" and "without" in direct attention with "pool." Bi-encoder produces nearly identical vectors.
Resolving Ambiguity
For "Paris Hilton," the cross-encoder looks at the document to determine if it's about the celebrity or a hotel in Paris. Bi-encoder commits to one interpretation.
Question-Answer Alignment
"Who invented the telephone?" + "Bell patented the telephone in 1876" — cross-encoder determines this directly answers the question. Bi-encoder knows both are about telephones.
Comparative Context
"Is Python faster than Java?" — cross-encoder distinguishes a comparative article (relevant) from one that merely mentions both languages (not relevant).
3. Head-to-Head Comparison
| Criterion | Bi-Encoder | Cross-Encoder |
|---|---|---|
| Encoding | Independent (query and doc never interact) | Joint (full cross-attention) |
| Pre-computation | ✅ Yes (offline) | ❌ No (must process pairs at query time) |
| Query latency (1M docs) | ~10-15ms total | ~50ms × 1M = impossible |
| Scalability | Billions via ANN | Hundreds (practical limit) |
| MS MARCO MRR@10 | 0.33-0.38 | 0.39-0.42 |
| Negation handling | Poor | Excellent |
| Use case | First-stage retrieval | Second-stage reranking |
The accuracy gap may look small in aggregate metrics, but it's concentrated in the hardest queries — exactly the ones where user satisfaction matters most. On simple queries, both approaches perform similarly. On ambiguous, nuanced, or complex queries, cross-encoders substantially outperform.
4. The Production Pipeline: Retrieve-Then-Rerank
Understanding the tradeoffs above leads to an elegant solution: use both architectures in a cascade. The bi-encoder handles the computationally hard part (searching millions of documents), and the cross-encoder handles the quality-critical part (fine-grained ranking of the top candidates).
Two-Stage Pipeline
Search 10M docs → Top 100 candidates
Latency: ~10-15ms
Method: encode query + HNSW search
Goal: 95-98% recall@100
Re-score 100 candidates → Top 10
Latency: ~150-200ms (100 × ~1.5ms GPU)
Method: joint encoding per pair
Goal: Optimal ordering of final results
5. Knowledge Distillation
A powerful technique: train the bi-encoder to mimic the cross-encoder's judgments. Score training pairs with the cross-encoder (teacher), then train the bi-encoder (student) to produce embeddings whose cosine similarity matches the teacher's relevance scores. This produces bi-encoders that are significantly better than those trained only on labeled data, because the cross-encoder provides rich, nuanced relevance signals beyond simple binary labels.
6. ColBERT: The Late Interaction Middle Ground
ColBERT (Khattab & Zaharia, 2020) introduces late interaction: encode query and document independently (like a bi-encoder), but keep all token embeddings instead of compressing to a single vector. At search time, compute MaxSim: for each query token, find its maximum similarity to any document token, then sum these maxima.
| Approach | Pre-compute? | Token Interaction | Accuracy | Storage |
|---|---|---|---|---|
| Bi-encoder | ✅ fully | None | Good | 1 vector/doc |
| ColBERT | ✅ tokens | Late (MaxSim) | Very good | N_tokens vectors/doc |
| Cross-encoder | ❌ none | Full attention | Best | N/A (no vectors) |
7. Choosing the Right Architecture
Bi-Encoder Alone
- Need sub-10ms latency
- Corpus >100M documents
- Moderate accuracy requirements
- No GPU budget for reranking
Bi-Encoder + Cross-Encoder
- Can tolerate ~200ms latency
- Hard queries matter (e-commerce, legal)
- Have GPU capacity for reranking
- Corpus >1M documents
Cross-Encoder Alone
- Corpus <10K documents
- Can afford O(N) scoring per query
- Maximum accuracy critical (medical)
ColBERT
- Need bi-encoder speed + better accuracy
- Have storage budget for ~32x more per doc
- Fine-grained relevance distinctions
Key Takeaways
Bi-Encoders for Speed
Encode query and documents independently → precompute all document vectors offline. Search is a single ANN lookup. Handles billions of candidates in <15ms. The entire first-stage retrieval of every major search engine.
Cross-Encoders for Accuracy
Process query and document jointly through full cross-attention. 10-20% more accurate than bi-encoders on hard queries (negation, ambiguity) but 1000x slower — only viable for reranking the top 20-100 candidates.
Retrieve-Then-Rerank Pipeline
The industry standard: bi-encoder retrieves top 100 candidates (~15ms), cross-encoder reranks those 100 (~150ms). Total ~165ms. Used by Google, Bing, Amazon, and virtually all modern search systems.
ColBERT: Late Interaction Middle Ground
Stores per-token embeddings and computes MaxSim at search time. Pre-computable like bi-encoders but retains token-level matching like cross-encoders. The tradeoff is ~32x more storage per document.
Knowledge Distillation Closes the Gap
Train a bi-encoder to mimic cross-encoder judgments. The student bi-encoder recovers ~50-90% of the accuracy gap while keeping full pre-computation benefits. A key technique for production quality.