Chapter 7.6Training Embedding Models

Production Deployment of Embedding Models

How to take a fine-tuned embedding model from training to production safely. ONNX export, query vs. document pipelines, blue-green rollouts, caching, dynamic batching, monitoring, and the five failure modes that hit teams after launch.

Common Mistake

Vector Mismatch

New query encoder + old document index = incoherent scores. Always re-embed the full corpus after any weight update.

Optimization

2–10x Speedup

ONNX export + quantization (FP32→INT8) typically delivers 2–10x inference speedup with under 1% quality loss.

Re-Embedding

Required

After every model update, the full document corpus must be re-embedded and the ANN index rebuilt before cutover.

Deployment Complexity: Shipping an embedding model is not like shipping a classifier. A classifier update affects predictions. An embedding model update affects every vector in your index. Small model weight changes propagate to every document at serving time. This requires a different deployment pattern than standard ML serving.

1. Two Inference Pipelines: Query and Document

Embedding inference splits into two fundamentally different jobs. Query encoding happens on every search request — it must be fast, with p99 latency under 50ms for most search applications. Document encoding runs as a batch job — it must be throughput-efficient to finish re-embedding millions of documents before query-vs-document drift accumulates.

Online Query Pipeline

Constraint: latency. Every search request waits for this to complete. Users are present.

• Model: ONNX-exported, INT8-quantized on GPU
• Batch size: 1 or small batches (1-4)
• Caching: LRU on popular query embeddings
• Target: p99 < 20–50ms

Offline Document Pipeline

Constraint: throughput and cost. Batch job, no user waiting. Must finish before stale index causes drift.

• Model: FP16 or INT8, maximized GPU utilization
• Batch size: 256–2048 per GPU, async multi-worker
• Trigger: any model weight update
• Target: complete full corpus before cutover

2. Model Export and Optimization

Serving a raw PyTorch model in production is inefficient. ONNX export removes Python overhead and enables hardware-specific optimizations. Combined with quantization, this typically delivers 2–10x speedups with under 1% quality degradation when done correctly.

export_onnx.py

import onnxruntime as ort

# Export PyTorch model to ONNX format

torch.onnx.export(

model,

dummy_input,

"model.onnx",

opset_version=17,

input_names=["input_ids", "attention_mask"],

output_names=["last_hidden_state"],

dynamic_axes={"input_ids": {0: "batch", 1: "seq_len"}}

)

# Load optimized session for inference

session = ort.InferenceSession(

"model.onnx",

providers=["CUDAExecutionProvider", "CPUExecutionProvider"]

)

Quantization Levels

Precision	Memory	Speed	Risk
FP32	Baseline	Baseline	None — reference
FP16	50% of FP32	1.5–2x	Minimal — standard for GPU inference
INT8	25% of FP32	2–4x on CPU/NPU	Must validate — embedding norms can shift

Quantization Warning: INT8 quantization is not safe to apply without validation. Embedding norms shift after quantization, especially after L2 normalization. Always benchmark quantized vs. FP32 on your nDCG evaluation benchmark before shipping INT8 to production.

3. Blue-Green Rollouts for Embedding Updates

Standard blue-green deployment swaps a new model version in place of the old one. For embedding models, this requires an additional step: the document index must be fully rebuilt with the new model before any query traffic uses it. Swapping model weights without rebuilding the index is the most common and most damaging embedding deployment mistake.

Start parallel shadow index

Begin encoding all corpus documents with the new model into a shadow ANN index. Keep the current (blue) index serving all live traffic. No user impact.

Validate shadow index offline

Run your full nDCG benchmark against the shadow index. Compare all query slices against the current index. Reject if any slice regresses.

Canary traffic (1–5%)

Route a small fraction of live queries against the shadow index. Monitor CTR, zero-result rate, and session quality in real time. Watch ANN recall metrics.

Gradual rollout (10% → 50% → 100%)

Increase traffic incrementally. Hold at each step for enough time to detect regressions. Automated rollback triggers on metric degradation.

Decommission old index

Only after 100% traffic has been confirmed stable for 24–48 hours, delete the old document index. Retain model weights and training artifacts for rollback.

4. Dynamic Batching and Embedding Caching

Dynamic Batching

GPU utilization collapses when you process one query at a time. Dynamic batching collects multiple concurrent requests into a single inference call, amortizing GPU overhead across all requests. Frameworks like NVIDIA Triton Inference Server support dynamic batching natively with configurable latency budgets.

Key tradeoff: batching increases throughput but adds queuing latency. Set the batch window (time budget before flushing the batch) to be less than your p99 latency target. Starting batch window: 5–15ms.

Query Embedding Cache

Popular head queries repeat constantly. Caching their embeddings avoids redundant model inference. A simple LRU cache keyed on the normalized query string (lowercased, whitespace-stripped) typically achieves 15–40% hit rate depending on query distribution.

Cache Invalidation: When the model is updated, all cached embeddings become stale (produced by the old encoder). Flush the entire cache atomically on every model deployment.

5. Embedding-Specific Monitoring

Standard infrastructure monitoring (CPU, memory, latency) does not catch embedding-specific degradation. These signals require custom instrumentation but are essential for detecting model health problems before they propagate into user experience metrics.

Infrastructure Metrics

Query encoder p50/p99 latency: Track separately from total search latency to isolate encoder regressions

Throughput (QPS): Monitor per serving instance to detect overload before latency spikes

GPU utilization: Low utilization often means batch sizes are too small

Index build time: Track re-embedding job duration to catch model slowdowns early

Embedding Health Metrics

Vector norm distribution: Should be stable across model versions. Drift indicates quantization issue or bad normalization layer

ANN recall spot-checks: Periodically run exact search vs ANN on sample queries to confirm recall is within target (>95%)

Cosine similarity of near-duplicate docs: Should be very high (>0.98). Low similarity after update indicates embedding space drift

Business metrics (CTR, zero-result rate): Final arbiter. Embedding health metrics gate the diagnosis, business metrics confirm user impact

6. Five Common Production Failure Modes

Vector Space Mismatch After Update

The new fine-tuned query encoder is deployed but old document vectors remain in the index. Scores become incoherent — queries return wrong results but without obvious errors. Fix: mandatory full re-embedding pipeline before any encoder is deployed to production.

INT8 Quantization Norm Shift

After INT8 quantization, embedding norms can shift in ways that break cosine similarity. Affected models may have good nDCG on benchmark but fail on queries at threshold boundary. Fix: always validate quantized model against full nDCG benchmark before shipping.

ANN Index Stale After Massive Doc Ingestion

If documents are added to the corpus but not encoded and added to the ANN index promptly, search misses newly added content. This is invisible until a user searches for something that only exists in the new documents. Fix: streaming index updates or bounded re-indexing schedules.

Cache Serving Stale Embeddings

Query embeddings cached before a model update are served post-update. The cached query vector was produced by the old encoder and is compared against document vectors from the new encoder. Fix: flush embedding cache atomically on every model deployment.

Batch Size Cap at Serving Underutilizes GPU

Serving configured with batch_size=1 leaves 95%+ GPU capacity idle under load. When traffic spikes, latency jumps because requests queue instead of batching. Fix: configure dynamic batching with max_batch=32-64 and latency budget of 10-20ms.

7. Production Deployment Checklist

Before shipping any embedding model update to production, verify every item in this checklist. A single missed step is enough to produce incoherent search results for all users.

Offline benchmark passes on all query slices (no regressions vs. current model)Required

ONNX export validated — output cosine similarities match PyTorch for N random queriesRequired

Quantization validated — nDCG delta within 0.5% of FP32 reference on benchmarkRequired

Full corpus re-embedding job queued and will complete before traffic cutoverRequired

Shadow ANN index built from new document vectors and validated offlineRequired

Query embedding cache flush scheduled to fire atomically with traffic cutoverRequired

Rollback plan documented — old index retained and rollback takes under 5 minutes

Monitoring dashboards updated — vector norm baseline set for new model

ANN recall spot-check scheduled for 1 hour post-cutover

On-call team notified of deployment window and escalation path

Key Takeaways

Treat query and document inference as two separate pipelines

Query encoding is latency-sensitive — it happens on every search request. Document encoding is throughput-sensitive — it runs as a batch job that must complete before stale docs pile up. Optimizing both with the same technique is a common mistake. ONNX + GPU helps queries; batching + async workers help documents.

A full corpus re-embedding pass must be triggered after any weight update

After fine-tuning, all previously indexed document vectors were produced by the old model. Serving new query vectors against old document vectors produces incoherent scores. This is the most common production failure and is entirely preventable with a gated deployment process.

Blue-green rollouts protect against catastrophic embedding failures

The shadow index pattern allows you to build and validate a new embedding index in parallel before cutting over any traffic. If validation fails, you revert instantly with zero user impact. This should be the standard deployment model for every embedding update.

Monitor embedding health directly, not just query latency

Embedding-specific metrics — vector norm distribution, cosine similarity of near-duplicate docs, ANN recall spot-checks — catch model degradation before it surfaces in business metrics. Standard infrastructure monitoring cannot see these signals.

7.5 Evaluation Metrics Chapter 7 Overview