Production Deployment of Embedding Models
How to take a fine-tuned embedding model from training to production safely. ONNX export, query vs. document pipelines, blue-green rollouts, caching, dynamic batching, monitoring, and the five failure modes that hit teams after launch.
Vector Mismatch
New query encoder + old document index = incoherent scores. Always re-embed the full corpus after any weight update.
2–10x Speedup
ONNX export + quantization (FP32→INT8) typically delivers 2–10x inference speedup with under 1% quality loss.
Required
After every model update, the full document corpus must be re-embedded and the ANN index rebuilt before cutover.
1. Two Inference Pipelines: Query and Document
Embedding inference splits into two fundamentally different jobs. Query encoding happens on every search request — it must be fast, with p99 latency under 50ms for most search applications. Document encoding runs as a batch job — it must be throughput-efficient to finish re-embedding millions of documents before query-vs-document drift accumulates.
Online Query Pipeline
Constraint: latency. Every search request waits for this to complete. Users are present.
- • Model: ONNX-exported, INT8-quantized on GPU
- • Batch size: 1 or small batches (1-4)
- • Caching: LRU on popular query embeddings
- • Target: p99 < 20–50ms
Offline Document Pipeline
Constraint: throughput and cost. Batch job, no user waiting. Must finish before stale index causes drift.
- • Model: FP16 or INT8, maximized GPU utilization
- • Batch size: 256–2048 per GPU, async multi-worker
- • Trigger: any model weight update
- • Target: complete full corpus before cutover
2. Model Export and Optimization
Serving a raw PyTorch model in production is inefficient. ONNX export removes Python overhead and enables hardware-specific optimizations. Combined with quantization, this typically delivers 2–10x speedups with under 1% quality degradation when done correctly.
Quantization Levels
| Precision | Memory | Speed | Risk |
|---|---|---|---|
| FP32 | Baseline | Baseline | None — reference |
| FP16 | 50% of FP32 | 1.5–2x | Minimal — standard for GPU inference |
| INT8 | 25% of FP32 | 2–4x on CPU/NPU | Must validate — embedding norms can shift |
3. Blue-Green Rollouts for Embedding Updates
Standard blue-green deployment swaps a new model version in place of the old one. For embedding models, this requires an additional step: the document index must be fully rebuilt with the new model before any query traffic uses it. Swapping model weights without rebuilding the index is the most common and most damaging embedding deployment mistake.
Begin encoding all corpus documents with the new model into a shadow ANN index. Keep the current (blue) index serving all live traffic. No user impact.
Run your full nDCG benchmark against the shadow index. Compare all query slices against the current index. Reject if any slice regresses.
Route a small fraction of live queries against the shadow index. Monitor CTR, zero-result rate, and session quality in real time. Watch ANN recall metrics.
Increase traffic incrementally. Hold at each step for enough time to detect regressions. Automated rollback triggers on metric degradation.
Only after 100% traffic has been confirmed stable for 24–48 hours, delete the old document index. Retain model weights and training artifacts for rollback.
4. Dynamic Batching and Embedding Caching
Dynamic Batching
GPU utilization collapses when you process one query at a time. Dynamic batching collects multiple concurrent requests into a single inference call, amortizing GPU overhead across all requests. Frameworks like NVIDIA Triton Inference Server support dynamic batching natively with configurable latency budgets.
Key tradeoff: batching increases throughput but adds queuing latency. Set the batch window (time budget before flushing the batch) to be less than your p99 latency target. Starting batch window: 5–15ms.
Query Embedding Cache
Popular head queries repeat constantly. Caching their embeddings avoids redundant model inference. A simple LRU cache keyed on the normalized query string (lowercased, whitespace-stripped) typically achieves 15–40% hit rate depending on query distribution.
5. Embedding-Specific Monitoring
Standard infrastructure monitoring (CPU, memory, latency) does not catch embedding-specific degradation. These signals require custom instrumentation but are essential for detecting model health problems before they propagate into user experience metrics.
Infrastructure Metrics
Embedding Health Metrics
6. Five Common Production Failure Modes
Vector Space Mismatch After Update
The new fine-tuned query encoder is deployed but old document vectors remain in the index. Scores become incoherent — queries return wrong results but without obvious errors. Fix: mandatory full re-embedding pipeline before any encoder is deployed to production.
INT8 Quantization Norm Shift
After INT8 quantization, embedding norms can shift in ways that break cosine similarity. Affected models may have good nDCG on benchmark but fail on queries at threshold boundary. Fix: always validate quantized model against full nDCG benchmark before shipping.
ANN Index Stale After Massive Doc Ingestion
If documents are added to the corpus but not encoded and added to the ANN index promptly, search misses newly added content. This is invisible until a user searches for something that only exists in the new documents. Fix: streaming index updates or bounded re-indexing schedules.
Cache Serving Stale Embeddings
Query embeddings cached before a model update are served post-update. The cached query vector was produced by the old encoder and is compared against document vectors from the new encoder. Fix: flush embedding cache atomically on every model deployment.
Batch Size Cap at Serving Underutilizes GPU
Serving configured with batch_size=1 leaves 95%+ GPU capacity idle under load. When traffic spikes, latency jumps because requests queue instead of batching. Fix: configure dynamic batching with max_batch=32-64 and latency budget of 10-20ms.
7. Production Deployment Checklist
Before shipping any embedding model update to production, verify every item in this checklist. A single missed step is enough to produce incoherent search results for all users.
Key Takeaways
Treat query and document inference as two separate pipelines
Query encoding is latency-sensitive — it happens on every search request. Document encoding is throughput-sensitive — it runs as a batch job that must complete before stale docs pile up. Optimizing both with the same technique is a common mistake. ONNX + GPU helps queries; batching + async workers help documents.
A full corpus re-embedding pass must be triggered after any weight update
After fine-tuning, all previously indexed document vectors were produced by the old model. Serving new query vectors against old document vectors produces incoherent scores. This is the most common production failure and is entirely preventable with a gated deployment process.
Blue-green rollouts protect against catastrophic embedding failures
The shadow index pattern allows you to build and validate a new embedding index in parallel before cutting over any traffic. If validation fails, you revert instantly with zero user impact. This should be the standard deployment model for every embedding update.
Monitor embedding health directly, not just query latency
Embedding-specific metrics — vector norm distribution, cosine similarity of near-duplicate docs, ANN recall spot-checks — catch model degradation before it surfaces in business metrics. Standard infrastructure monitoring cannot see these signals.