Chapter 6.3Vector & Semantic Search

Bi-Encoder vs Cross-Encoder

The critical architectural decision in neural search: process query and document independently (fast, scalable) or jointly (accurate, slow). This distinction fundamentally shapes how a search system is built, what hardware it needs, and what quality users experience.

Bi-Encoder Latency

~15ms

Query encode (10ms) + ANN search (5ms). Regardless of corpus size.

Cross-Encoder Gain

MRR +0.06

MS MARCO: bi-encoder MRR@10 ≈ 0.33-0.38, cross-encoder ≈ 0.39-0.42.

Pipeline Total

~165ms

Retrieve-then-rerank: 15ms retrieval + 150ms reranking = near-instant.

1. Bi-Encoder (Dual Encoder)

A bi-encoder processes two inputs independently through separate (or shared) encoder networks. The query and document never "see" each other during encoding — each is transformed into a fixed-size vector in isolation, and similarity is measured post-hoc using cosine similarity or dot product. This separation is both the defining feature and the key advantage.

Because the document encoder runs independently of any query, document embeddings can be pre-computed once and stored. At query time, only the query needs encoding (a single forward pass, ~5-10ms), then the pre-computed document vectors are searched using an ANN index. Consider a corpus of 10M documents: encode all 10M offline (takes hours on GPU), store in HNSW, and search in ~15ms per query regardless of corpus size.

# Bi-Encoder architecture

Query

"cheap flights to NYC"

↓

[Encoder A]

↓

query_vector [768]

Document

"Budget airline tickets New York"

↓

[Encoder B]

↓

doc_vector [768]

↓

cosine_similarity(q, d) = 0.87

bi_encoder_search.py

def bi_encoder_search(query, model, ann_index, top_k=10):

"""

Bi-encoder: encode query and documents independently.

Documents are pre-encoded offline — this is where the speed comes from.

"""

# Step 1: Encode query at search time (~5-10ms)

query_vec = model.encode(query) # [768]

# Step 2: ANN search against pre-encoded doc vectors (~1-5ms)

results = ann_index.search(query_vec, top_k)

return results # Total: ~10-15ms for millions of documents

Feature	Benefit
Pre-computation	Document vectors computed offline once → query-time cost is minimal
Scalability	Search billions of documents in <10ms using ANN indexes
Caching	Frequently repeated queries can cache their query vectors
Independence	New documents indexed without re-encoding existing vectors

The Information Bottleneck

The fundamental weakness of bi-encoders: the entire meaning of a query or document is compressed into a single fixed-size vector (768 floats = 3KB). This compression is inherently lossy. "Not available in blue" and "available in blue"produce nearly identical embeddings. A 10-page report gets the same 768 floats as a 2-sentence tweet.

2. Cross-Encoder Architecture

A cross-encoder processes query and document jointly in a single transformer pass. The two inputs are concatenated with a separator token and fed through the model together, allowing full cross-attention between all tokens from both inputs across all 12 transformer layers.

Every token in the query can attend to every token in the document. The model learns complex patterns: "the query asks about price ('cheap') and the document mentions price ('budget')" or "the query mentions 'NYC' and the document says 'New York.'" These token-level interactions produce the cross-encoder's superior accuracy.

# Cross-Encoder architecture

[CLS] cheap flights to NYC [SEP] Budget airline tickets New York [EOS]

↓

[Single Transformer]

(12 layers, full cross-attention between ALL tokens)

↓

relevance_score = 0.92

cross_encoder_score.py

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(

'cross-encoder/ms-marco-MiniLM-L-6-v2')

def cross_encoder_score(query, document):

"""Must be computed for EVERY query-document pair. No pre-computation."""

inputs = tokenizer(query, document, return_tensors='pt',

truncation=True, max_length=512)

score = model(**inputs).logits[0]

return score.item()

# 1000 candidates × 50ms = 50 seconds ← too slow for retrieval

Why Cross-Encoders Win on Hard Queries

Handling Negation

"Hotels with a pool" vs "Hotels without a pool" — cross-encoder sees "with" and "without" in direct attention with "pool." Bi-encoder produces nearly identical vectors.

Resolving Ambiguity

For "Paris Hilton," the cross-encoder looks at the document to determine if it's about the celebrity or a hotel in Paris. Bi-encoder commits to one interpretation.

Question-Answer Alignment

"Who invented the telephone?" + "Bell patented the telephone in 1876" — cross-encoder determines this directly answers the question. Bi-encoder knows both are about telephones.

Comparative Context

"Is Python faster than Java?" — cross-encoder distinguishes a comparative article (relevant) from one that merely mentions both languages (not relevant).

3. Head-to-Head Comparison

Criterion	Bi-Encoder	Cross-Encoder
Encoding	Independent (query and doc never interact)	Joint (full cross-attention)
Pre-computation	✅ Yes (offline)	❌ No (must process pairs at query time)
Query latency (1M docs)	~10-15ms total	~50ms × 1M = impossible
Scalability	Billions via ANN	Hundreds (practical limit)
MS MARCO MRR@10	0.33-0.38	0.39-0.42
Negation handling	Poor	Excellent
Use case	First-stage retrieval	Second-stage reranking

The accuracy gap may look small in aggregate metrics, but it's concentrated in the hardest queries — exactly the ones where user satisfaction matters most. On simple queries, both approaches perform similarly. On ambiguous, nuanced, or complex queries, cross-encoders substantially outperform.

4. The Production Pipeline: Retrieve-Then-Rerank

Understanding the tradeoffs above leads to an elegant solution: use both architectures in a cascade. The bi-encoder handles the computationally hard part (searching millions of documents), and the cross-encoder handles the quality-critical part (fine-grained ranking of the top candidates).

Two-Stage Pipeline

Stage 1: Bi-Encoder Retrieval

Search 10M docs → Top 100 candidates

Latency: ~10-15ms

Method: encode query + HNSW search

Goal: 95-98% recall@100

Stage 2: Cross-Encoder Reranking

Re-score 100 candidates → Top 10

Latency: ~150-200ms (100 × ~1.5ms GPU)

Method: joint encoding per pair

Goal: Optimal ordering of final results

Total: ~15ms + ~150ms = ~165ms — well within the <200ms threshold users perceive as "instant." Quality approaches what you'd get running the cross-encoder over all 10M documents.

5. Knowledge Distillation

A powerful technique: train the bi-encoder to mimic the cross-encoder's judgments. Score training pairs with the cross-encoder (teacher), then train the bi-encoder (student) to produce embeddings whose cosine similarity matches the teacher's relevance scores. This produces bi-encoders that are significantly better than those trained only on labeled data, because the cross-encoder provides rich, nuanced relevance signals beyond simple binary labels.

distillation.py

# Step 1: Score training pairs with cross-encoder (teacher)

teacher_scores = []

for query, doc in training_pairs:

score = cross_encoder_score(query, doc)

teacher_scores.append(score)

# Step 2: Train bi-encoder (student) to match teacher scores

def distillation_loss(bi_encoder, query, doc, teacher_score):

q_vec = bi_encoder.encode(query)

d_vec = bi_encoder.encode(doc)

student_score = cosine_similarity(q_vec, d_vec)

return MSE(student_score, teacher_score)

# Result: bi-encoder with ~90-95% of cross-encoder accuracy

# while retaining full pre-computation benefits

6. ColBERT: The Late Interaction Middle Ground

ColBERT (Khattab & Zaharia, 2020) introduces late interaction: encode query and document independently (like a bi-encoder), but keep all token embeddings instead of compressing to a single vector. At search time, compute MaxSim: for each query token, find its maximum similarity to any document token, then sum these maxima.

colbert_maxsim.py

# ColBERT: keep ALL token embeddings (not just one vector)

query_tokens = model.encode_query("cheap flights NYC") # [5 × 128]

doc_tokens = model.encode_doc("Budget airline tickets") # [4 × 128]

def maxsim(query_tokens, doc_tokens):

"""For each query token, find max-similarity doc token. Sum."""

score = 0

for q_tok in query_tokens:

max_sim = max(cosine_sim(q_tok, d_tok) for d_tok in doc_tokens)

score += max_sim

return score

Approach	Pre-compute?	Token Interaction	Accuracy	Storage
Bi-encoder	✅ fully	None	Good	1 vector/doc
ColBERT	✅ tokens	Late (MaxSim)	Very good	N_tokens vectors/doc
Cross-encoder	❌ none	Full attention	Best	N/A (no vectors)

7. Choosing the Right Architecture

Bi-Encoder Alone

Need sub-10ms latency
Corpus >100M documents
Moderate accuracy requirements
No GPU budget for reranking

Bi-Encoder + Cross-Encoder

Can tolerate ~200ms latency
Hard queries matter (e-commerce, legal)
Have GPU capacity for reranking
Corpus >1M documents

Cross-Encoder Alone

Corpus <10K documents
Can afford O(N) scoring per query
Maximum accuracy critical (medical)

ColBERT

Need bi-encoder speed + better accuracy
Have storage budget for ~32x more per doc
Fine-grained relevance distinctions

Key Takeaways

Bi-Encoders for Speed

Encode query and documents independently → precompute all document vectors offline. Search is a single ANN lookup. Handles billions of candidates in <15ms. The entire first-stage retrieval of every major search engine.

Cross-Encoders for Accuracy

Process query and document jointly through full cross-attention. 10-20% more accurate than bi-encoders on hard queries (negation, ambiguity) but 1000x slower — only viable for reranking the top 20-100 candidates.

Retrieve-Then-Rerank Pipeline

The industry standard: bi-encoder retrieves top 100 candidates (~15ms), cross-encoder reranks those 100 (~150ms). Total ~165ms. Used by Google, Bing, Amazon, and virtually all modern search systems.

ColBERT: Late Interaction Middle Ground

Stores per-token embeddings and computes MaxSim at search time. Pre-computable like bi-encoders but retains token-level matching like cross-encoders. The tradeoff is ~32x more storage per document.

Knowledge Distillation Closes the Gap

Train a bi-encoder to mimic cross-encoder judgments. The student bi-encoder recovers ~50-90% of the accuracy gap while keeping full pre-computation benefits. A key technique for production quality.

6.2 Embeddings 101 6.4 Vector Indexing