Chapter 7.4Training Embedding Models

Fine-Tuning Strategies

Full fine-tuning, LoRA adapters, Matryoshka embeddings, and knowledge distillation. How to choose the right strategy for your constraints, and the failure modes that silently damage models in production.

Full Fine-Tune

100% VRAM

All model parameters are updated. Maximum expressive power but highest memory cost and forgetting risk.

LoRA

20–40% VRAM

Trains only small rank-decomposed matrices per layer. Minimal forgetting. Production default for most teams.

Matryoshka

3–10x Compression

Truncatable multi-scale embedding. One model serves many latency and storage operating points without retraining.

Core Strategic Question: You are not choosing how to train a model. You are choosing how to update a model that already understands language, so that it understands your domain's vocabulary, entity relationships, and relevance judgments — at the lowest cost and risk of regression.

1. The Four Main Fine-Tuning Strategies

There is not a single fine-tuning method. The right strategy depends on dataset size, hardware constraints, how specialized the domain is, and whether you need flexible serving dimensions. These four strategies cover the practical spectrum from simple adaptation to full re-shaping of the embedding space.

🔥

Full Fine-Tuning

Risky

All parameters updated. Best for very specialized domains where the base model has little coverage. High VRAM, highest catastrophic forgetting risk.

⚡

LoRA (Low-Rank Adaptation)

Recommended Default

Freeze base weights. Add rank-decomposed adapter matrices. Only adapters are trained. Strong performance at fraction of the VRAM cost.

🪆

Matryoshka Representation Learning

Best for multi-SLA serving

Train so that first-N dimesnions are always a valid embedding. Single model supports multiple vector sizes at serving time.

🎓

Knowledge Distillation

Best for precision transfer

Match score distributions from a teacher cross-encoder or LLM judge. Transfers fine-grained ranking quality without requiring explicit triplets.

2. Full Fine-Tuning

Full fine-tuning enables all model parameters to be updated during training. This gives the model maximum flexibility to reshape the embedding space toward domain-specific concepts, including specialized vocabulary, entity types, and ranking preferences that the base model was never exposed to.

full_finetune.py

# All parameters are trainable

for param in model.parameters():

param.requires_grad = True

optimizer = AdamW(model.parameters(), lr=2e-5)

# Low learning rate is critical to avoid forgetting

# Use warmup + cosine decay schedule

When to Use

• Very specialized domains (biomedical, legal)
• Large domain-specific training sets (>100K pairs)
• Base model has very limited coverage of your vocabulary

Risks

• Catastrophic forgetting of general language
• High VRAM requirement
• Requires regularization (weight decay, EWC)

3. LoRA: Low-Rank Adaptation

LoRA freezes all base model weights and adds small pairs of low-rank matrices to each transformer layer. Only these adapter matrices are trained. The effective weight update is computed as the product of the two small matrices, which is added to the frozen base weight at inference time.

lora_concept.py

# Base weight is frozen entirely

W_frozen = pretrained_weight # [d_out, d_in] — never updated

# Two small learned adapter matrices

A = nn.Parameter(torch.randn(rank, d_in) * 0.01)

B = nn.Parameter(torch.zeros(d_out, rank))

# Effective weight = frozen base + learned low-rank update

W_effective = W_frozen + (alpha / rank) * B @ A

# At inference: merge adapters into weights, zero serving overhead

Hyperparameter	Typical Value	Effect
rank	8, 16, 32	Controls adapter expressivity. Higher rank = more parameters but stronger adaptation.
alpha	rank to 2×rank	Scales the adapter update magnitude. Start at rank value as default.
dropout	0.05 to 0.1	Regularizes adapter matrices. Prevents overfit on small datasets.
target_modules	q_proj, v_proj	Apply adapters to query and value projections. Adding k_proj improves quality but increases cost.

4. Matryoshka Representation Learning

Matryoshka Representation Learning (MRL) trains the model so that the first N dimensions are always a valid, coherent embedding on their own. This allows you to truncate the vector to any size (64d, 128d, 256d, 512d, 1024d) after training and still get useful retrieval quality — at meaningfully different latency and storage costs.

The architecture is identical to a standard bi-encoder. Only the loss function changes: instead of a single MNR loss on the full embedding, you compute MNR loss at each target dimensionality and sum them.

matryoshka_loss.py

def matryoshka_loss(q_embs, d_embs, dims=[64, 128, 256, 512, 1024]):

total = 0.0

for d in dims:

# Truncate to first d dimensions

q_d = q_embs[:, :d]

d_d = d_embs[:, :d]

# Full MNR loss at this dimensionality

total += mnr_loss(q_d, d_d)

return total / len(dims)

Production Operating Points

Dimensions	Use Case	Tradeoff
64d	Autocomplete, fast type-ahead	Fastest, lowest memory, minor quality loss
256d	Main retrieval first-pass	Good balance — default for most production deployments
1024d	High-precision reranking stage	Slower, higher memory, maximum quality

5. Choosing the Right Base Model

The base model you start from affects fine-tuning outcomes more than training recipe details. Models that are already trained for retrieval converge faster, need less data, and produce better results after domain adaptation. Raw language models (BERT, RoBERTa) without retrieval pre-training require much larger datasets and more epochs to rival retrieval-specialized bases.

Model Family	Params	Standard Dim	Best Use Case
BGE (BAAI)	110M–335M	768, 1024	Strong MTEB scores, English/multilingual retrieval, instruction-prefix support
E5 (Microsoft)	110M–560M	768, 1024	Consistent cross-domain results, straightforward fine-tuning
MiniLM (Sentence Transformers)	22M–33M	384	Latency-constrained mobile/edge deployments, lightweight APIs

6. Common Strategy Combinations

These four combinations cover most real-world deployment scenarios. Higher combinations trade cost for capability.

LoRA + In-Batch Negatives

Start here

The simplest and most common first step. Adapts domain vocabulary with minimal hardware requirements. Good starting point for any team.

LoRA + Hard Negatives + MNR

Production default

Adds hard negative mining to LoRA training. Teaches fine-grained boundaries. Most production deployments evolve to this.

Full Fine-Tune + Matryoshka + Distillation

High-investment

High-capability setup for specialized domains. Rewrites the base model, trains multi-scale embeddings, and distills from a cross-encoder teacher.

LoRA + Matryoshka

Efficiency optimized

Memory-efficient fine-tuning combined with multi-scale serving flexibility. Allows a single model to serve multiple SLAs at different dimensions.

7. Practical Decision Table

Situation	Recommended Strategy
Limited GPU (<1 GPU, 24GB), any domain	LoRA on small base (MiniLM or BGE-small)
Specialized domain, 50K+ training pairs	Full fine-tuning with BGE or E5, low learning rate
Multiple latency SLAs from one model	Matryoshka loss, any base, multiple serving dims
Have a strong cross-encoder teacher model	Distillation (MarginMSE) to transfer precision
Cold start, no click data yet	LoRA on base model + synthetic LLM-generated pairs

8. Fine-Tuning Failure Modes

Catastrophic Forgetting

Aggressive full fine-tuning rewrites general language understanding. Out-of-domain recall collapses while in-domain metrics improve. Fix: regularize toward base weights, use LoRA, reduce learning rate.

Adapter Rank Lock-In

Using LoRA rank too low (rank=2 or rank=4) for a highly specialized domain. The adapters can't express enough domain-specific geometry shifts. Fix: increase rank incrementally and monitor nDCG@10.

Dimension Truncation Cliff

Matryoshka model performs well at full dimensions but drops sharply at 64d or 128d, suggesting the early dimensions were not trained to be independently coherent. Fix: ensure all target dims are in the loss function during training.

Query-Document Serving Mismatch

Queries encoded with the new fine-tuned model, documents still indexed with the pre-fine-tune model. Scores become incoherent. Fix: always re-embed the full corpus after any weight update to the encoder.

Key Takeaways

The strategic question is adapting, not creating

You start from a model that already understands language. Fine-tuning teaches it your domain's vocabulary, entity relationships, and relevance judgments. The goal is not to create a new language understanding from scratch but to shift the embedding geometry toward your users' intent.

LoRA gives 70–80% of the benefit at 20–40% of the VRAM cost

Low-Rank Adaptation freezes all base weights and trains only two small matrices per layer. This makes fine-tuning feasible on smaller hardware and reduces catastrophic forgetting. In most production deployments, LoRA is the default choice unless the domain is extremely specialized.

Matryoshka enables flexible quality-cost tradeoffs at serving time

Matryoshka representations are trained so the first N dimensions are always a valid embedding on their own. This means you can serve at 64d, 256d, or 1024d from a single model, tuning for latency vs. quality dynamically per use case without retraining.

Forgetting is the silent cost of full fine-tuning without anchors

Full fine-tuning can over-write general language understanding in favor of domain specifics, causing catastrophic forgetting on queries outside your training distribution. LoRA inherently mitigates this. For full fine-tuning, regularization toward base model weights is essential.

7.3 Contrastive Learning 7.5 Evaluation Metrics