Fine-Tuning Strategies
Full fine-tuning, LoRA adapters, Matryoshka embeddings, and knowledge distillation. How to choose the right strategy for your constraints, and the failure modes that silently damage models in production.
100% VRAM
All model parameters are updated. Maximum expressive power but highest memory cost and forgetting risk.
20β40% VRAM
Trains only small rank-decomposed matrices per layer. Minimal forgetting. Production default for most teams.
3β10x Compression
Truncatable multi-scale embedding. One model serves many latency and storage operating points without retraining.
1. The Four Main Fine-Tuning Strategies
There is not a single fine-tuning method. The right strategy depends on dataset size, hardware constraints, how specialized the domain is, and whether you need flexible serving dimensions. These four strategies cover the practical spectrum from simple adaptation to full re-shaping of the embedding space.
Full Fine-Tuning
RiskyAll parameters updated. Best for very specialized domains where the base model has little coverage. High VRAM, highest catastrophic forgetting risk.
LoRA (Low-Rank Adaptation)
Recommended DefaultFreeze base weights. Add rank-decomposed adapter matrices. Only adapters are trained. Strong performance at fraction of the VRAM cost.
Matryoshka Representation Learning
Best for multi-SLA servingTrain so that first-N dimesnions are always a valid embedding. Single model supports multiple vector sizes at serving time.
Knowledge Distillation
Best for precision transferMatch score distributions from a teacher cross-encoder or LLM judge. Transfers fine-grained ranking quality without requiring explicit triplets.
2. Full Fine-Tuning
Full fine-tuning enables all model parameters to be updated during training. This gives the model maximum flexibility to reshape the embedding space toward domain-specific concepts, including specialized vocabulary, entity types, and ranking preferences that the base model was never exposed to.
- β’ Very specialized domains (biomedical, legal)
- β’ Large domain-specific training sets (>100K pairs)
- β’ Base model has very limited coverage of your vocabulary
- β’ Catastrophic forgetting of general language
- β’ High VRAM requirement
- β’ Requires regularization (weight decay, EWC)
3. LoRA: Low-Rank Adaptation
LoRA freezes all base model weights and adds small pairs of low-rank matrices to each transformer layer. Only these adapter matrices are trained. The effective weight update is computed as the product of the two small matrices, which is added to the frozen base weight at inference time.
| Hyperparameter | Typical Value | Effect |
|---|---|---|
| rank | 8, 16, 32 | Controls adapter expressivity. Higher rank = more parameters but stronger adaptation. |
| alpha | rank to 2Γrank | Scales the adapter update magnitude. Start at rank value as default. |
| dropout | 0.05 to 0.1 | Regularizes adapter matrices. Prevents overfit on small datasets. |
| target_modules | q_proj, v_proj | Apply adapters to query and value projections. Adding k_proj improves quality but increases cost. |
4. Matryoshka Representation Learning
Matryoshka Representation Learning (MRL) trains the model so that the first N dimensions are always a valid, coherent embedding on their own. This allows you to truncate the vector to any size (64d, 128d, 256d, 512d, 1024d) after training and still get useful retrieval quality β at meaningfully different latency and storage costs.
The architecture is identical to a standard bi-encoder. Only the loss function changes: instead of a single MNR loss on the full embedding, you compute MNR loss at each target dimensionality and sum them.
Production Operating Points
| Dimensions | Use Case | Tradeoff |
|---|---|---|
| 64d | Autocomplete, fast type-ahead | Fastest, lowest memory, minor quality loss |
| 256d | Main retrieval first-pass | Good balance β default for most production deployments |
| 1024d | High-precision reranking stage | Slower, higher memory, maximum quality |
5. Choosing the Right Base Model
The base model you start from affects fine-tuning outcomes more than training recipe details. Models that are already trained for retrieval converge faster, need less data, and produce better results after domain adaptation. Raw language models (BERT, RoBERTa) without retrieval pre-training require much larger datasets and more epochs to rival retrieval-specialized bases.
| Model Family | Params | Standard Dim | Best Use Case |
|---|---|---|---|
| BGE (BAAI) | 110Mβ335M | 768, 1024 | Strong MTEB scores, English/multilingual retrieval, instruction-prefix support |
| E5 (Microsoft) | 110Mβ560M | 768, 1024 | Consistent cross-domain results, straightforward fine-tuning |
| MiniLM (Sentence Transformers) | 22Mβ33M | 384 | Latency-constrained mobile/edge deployments, lightweight APIs |
6. Common Strategy Combinations
These four combinations cover most real-world deployment scenarios. Higher combinations trade cost for capability.
LoRA + In-Batch Negatives
Start hereThe simplest and most common first step. Adapts domain vocabulary with minimal hardware requirements. Good starting point for any team.
LoRA + Hard Negatives + MNR
Production defaultAdds hard negative mining to LoRA training. Teaches fine-grained boundaries. Most production deployments evolve to this.
Full Fine-Tune + Matryoshka + Distillation
High-investmentHigh-capability setup for specialized domains. Rewrites the base model, trains multi-scale embeddings, and distills from a cross-encoder teacher.
LoRA + Matryoshka
Efficiency optimizedMemory-efficient fine-tuning combined with multi-scale serving flexibility. Allows a single model to serve multiple SLAs at different dimensions.
7. Practical Decision Table
| Situation | Recommended Strategy |
|---|---|
| Limited GPU (<1 GPU, 24GB), any domain | LoRA on small base (MiniLM or BGE-small) |
| Specialized domain, 50K+ training pairs | Full fine-tuning with BGE or E5, low learning rate |
| Multiple latency SLAs from one model | Matryoshka loss, any base, multiple serving dims |
| Have a strong cross-encoder teacher model | Distillation (MarginMSE) to transfer precision |
| Cold start, no click data yet | LoRA on base model + synthetic LLM-generated pairs |
8. Fine-Tuning Failure Modes
Catastrophic Forgetting
Aggressive full fine-tuning rewrites general language understanding. Out-of-domain recall collapses while in-domain metrics improve. Fix: regularize toward base weights, use LoRA, reduce learning rate.
Adapter Rank Lock-In
Using LoRA rank too low (rank=2 or rank=4) for a highly specialized domain. The adapters can't express enough domain-specific geometry shifts. Fix: increase rank incrementally and monitor nDCG@10.
Dimension Truncation Cliff
Matryoshka model performs well at full dimensions but drops sharply at 64d or 128d, suggesting the early dimensions were not trained to be independently coherent. Fix: ensure all target dims are in the loss function during training.
Query-Document Serving Mismatch
Queries encoded with the new fine-tuned model, documents still indexed with the pre-fine-tune model. Scores become incoherent. Fix: always re-embed the full corpus after any weight update to the encoder.
Key Takeaways
The strategic question is adapting, not creating
You start from a model that already understands language. Fine-tuning teaches it your domain's vocabulary, entity relationships, and relevance judgments. The goal is not to create a new language understanding from scratch but to shift the embedding geometry toward your users' intent.
LoRA gives 70β80% of the benefit at 20β40% of the VRAM cost
Low-Rank Adaptation freezes all base weights and trains only two small matrices per layer. This makes fine-tuning feasible on smaller hardware and reduces catastrophic forgetting. In most production deployments, LoRA is the default choice unless the domain is extremely specialized.
Matryoshka enables flexible quality-cost tradeoffs at serving time
Matryoshka representations are trained so the first N dimensions are always a valid embedding on their own. This means you can serve at 64d, 256d, or 1024d from a single model, tuning for latency vs. quality dynamically per use case without retraining.
Forgetting is the silent cost of full fine-tuning without anchors
Full fine-tuning can over-write general language understanding in favor of domain specifics, causing catastrophic forgetting on queries outside your training distribution. LoRA inherently mitigates this. For full fine-tuning, regularization toward base model weights is essential.