Cost of Vector Search at Scale
HNSW indexes require terabytes of RAM. Embedding models consume GPU hours. At billion-scale, infrastructure costs can dominate a company's entire search budget. This chapter provides concrete cost analysis and optimization strategies that can reduce costs by 10-30x.
$25K/mo
~3.2 TB RAM needed. 7× r6g.16xlarge instances at AWS on-demand pricing.
$800/mo
PQ (32x compression): 3 TB → 96 GB. 97% cost reduction with two-phase search.
$600-$20K
Local (MiniLM on A100): $600. API (OpenAI): ~$20K. Re-embedding on model upgrade repeats.
1. Embedding Generation Cost
Before you can search vectors, you need to create them. Every document in your corpus must be passed through a neural network to produce its embedding. This is a one-time cost per document (plus re-embedding when documents change), but at scale it becomes substantial. The cost depends on the model size: smaller models are faster and cheaper but produce lower-quality embeddings. Larger models produce better embeddings but cost more to run. API-based models are convenient but charge per token.
At 1 billion documents, local embedding (all-MiniLM on A100) costs ~$600 in GPU hours over ~55 hours of compute. A higher-quality model like e5-large costs ~$3,750 over 14 days on a single GPU. API-based embedding (OpenAI) costs ~$20,000 with no GPU hardware needed, but rate limits mean it could take days. Re-embedding cost is critical: when you upgrade to a better model (which happens regularly as the field advances), you must re-embed your entire corpus — a full repeat of the original cost.
| Model | Dims | Speed (A100) | Cost/1M docs | MTEB |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | ~5,000/sec | ~$0.60 | 56.3 |
| all-mpnet-base-v2 | 768 | ~2,500/sec | ~$1.20 | 57.8 |
| e5-large-v2 | 1024 | ~800/sec | ~$3.75 | 62.2 |
| OpenAI text-embedding-3-small | 1536 | API | ~$20 | — |
| Cohere embed-v3 | 1024 | API | ~$10 | — |
2. Storage Cost (Raw Vectors)
The memory required to store vectors follows a simple formula: memory_per_vector = dimensions × bytes_per_float. For 768-dimensional float32 vectors, that's 768 × 4 = 3,072 bytes (3 KB) per vector. These per-vector costs seem small, but they compound brutally at scale. The table below shows how storage requirements grow as you move from millions to billions of vectors, and how reducing precision (float16, int8) provides significant savings.
| Scale | float32 (768d) | float16 | int8 |
|---|---|---|---|
| 1M | 3 GB | 1.5 GB | 0.75 GB |
| 10M | 30 GB | 15 GB | 7.5 GB |
| 100M | 300 GB | 150 GB | 75 GB |
| 1B | 3,000 GB (3 TB) | 1,500 GB | 750 GB |
To put 3 TB in context: AWS r6g.16xlarge has 512 GB RAM at ~$3,600/month. You'd need 6 instances just for raw vectors — ~$21,600/month before index overhead.
3. Infrastructure Cost (Cloud)
Here's what it actually costs to run vector search at various scales on cloud infrastructure. These numbers are based on AWS pricing (on-demand, us-east-1, 2024) and should be treated as illustrative. Notice the non-linear cost scaling: going from 100M to 1B vectors (10x data) costs ~7x more in infrastructure because larger instances have worse price/GB ratios and you need more of them, plus operational overhead increases.
Managed vector database costs (Pinecone, Weaviate Cloud, etc.) are typically 2-5x higher than self-managed cloud instances, but include operational overhead, scaling, backups, and monitoring. For teams without dedicated infrastructure engineers, the managed premium is often worth it.
| Vectors | HNSW RAM | Instance | Monthly |
|---|---|---|---|
| 1M | ~4 GB | r6g.large | ~$90 |
| 10M | ~35 GB | r6g.xlarge × 2 | ~$360 |
| 100M | ~350 GB | r6g.8xlarge × 2 | ~$3,600 |
| 1B | ~3.2 TB | r6g.16xlarge × 7 | ~$25,000 |
| 10B | ~32 TB | 70+ instances | ~$250,000 |
4. Cost Optimization Strategies
The good news is that several techniques can dramatically reduce vector search costs. The strategies below are listed in order of impact — quantization alone can save $19K/month at billion-scale, and combining multiple strategies can reduce costs by 10-30x while maintaining acceptable recall.
Strategy 1: Quantization
Quantization reduces the precision of each number in the vector, trading a small amount of accuracy for massive memory savings. This is almost always the first optimization to apply because it's simple, well-understood, and the recall impact is manageable. Scalar quantization maps each float32 dimension to an int8 value (4x savings), while product quantization (PQ) compresses the full vector into ~96 sub-vector centroid IDs (32x savings).
| Method | Compression | Recall Loss | Memory (1B) | Monthly |
|---|---|---|---|---|
| float32 | 1x | — | 3,000 GB | ~$25,000 |
| float16 | 2x | <1% | 1,500 GB | ~$12,500 |
| Scalar int8 | 4x | 2-5% | 750 GB | ~$6,200 |
| PQ (M=96) | 32x | 5-15% | 96 GB | ~$800 |
Strategy 2: DiskANN
Microsoft's DiskANN stores the Vamana graph on SSD instead of RAM, keeping only compressed PQ codes in memory. The fundamental insight is that SSD storage costs ~10x less per GB than RAM, and modern NVMe SSDs can deliver the random I/O patterns needed for graph traversal with acceptable latency (5-20ms per query vs. <1ms for in-memory HNSW). For many applications, 10-20ms query latency is perfectly acceptable.
Strategy 3: Tiered Storage
Not all vectors need to be in RAM. Production systems implement tiered storage based on access patterns — recent and frequently accessed vectors stay in fast RAM (HNSW), moderately accessed vectors move to SSD (DiskANN), and rarely accessed archival vectors live on cheap object storage. The key is building a routing layer that directs queries to the appropriate tier based on recency and access frequency.
Strategy 4: Matryoshka Embeddings
Lower dimensions mean less memory AND faster distance computation — a double win. Matryoshka embeddings (Kusupati et al., 2022) are models specifically trained to produce embeddings that are useful at any prefix length. You can take the first 256 dimensions of a 768-dim Matryoshka embedding and still get good retrieval quality — typically only 2-5% recall drop. This is different from post-hoc dimensionality reduction (PCA), which typically loses 5-15% recall at the same reduction ratio.
5. Architecture Cost Comparison (100M vectors, 768d)
The right architecture depends on your latency requirements. If you can tolerate 10-15ms latency (most applications can), DiskANN or IVF-PQ with re-scoring gives you 95%+ recall at a fraction of HNSW's cost. The table below compares all major architectures at 100M vectors.
| Architecture | Memory | Monthly | Latency | Recall |
|---|---|---|---|---|
| HNSW float32 | 340 GB | ~$3,600 | <1ms | 98% |
| HNSW int8 | 90 GB | ~$1,800 | <1ms | 95% |
| HNSW int8 + rescore | 90 GB + SSD | ~$1,850 | ~5ms | 97% |
| IVF-PQ | 12 GB | ~$180 | ~2ms | 88% |
| IVF-PQ + rescore | 12 GB + SSD | ~$230 | ~10ms | 96% |
| DiskANN | 8 GB + SSD | ~$200 | ~15ms | 95% |
Key Takeaways
Memory Is the Dominant Cost
1B vectors × 768-dim float32 = 3 TB of RAM ≈ $25K/month on AWS (6× r6g.16xlarge). Non-linear scaling: 10x data costs ~7x more because larger instances have worse price/GB ratios.
Quantization Is the Single Biggest Lever
Scalar int8 (4x compression) saves ~$19K/month at billion-scale with 2-5% recall loss. PQ (32x) saves ~$24K/month but needs two-phase search. This is always the first optimization.
DiskANN: 5-8x Cost Reduction at Billion Scale
1B vectors: HNSW needs ~3.2 TB RAM ($25K/mo). DiskANN needs ~64 GB RAM + 3 TB SSD ($3-5K/mo). Latency tradeoff: <1ms → 5-20ms. Acceptable for most applications.
Two-Phase Search Is the Production Standard
Search compressed index in RAM → re-score top-100 candidates with full-precision vectors from SSD. Result: PQ memory costs with 98%+ recall. SSD I/O: 100 × 3KB = 300KB ≈ 0.5ms on NVMe.
Matryoshka Embeddings: 3x Savings That Compound
768-dim → 256-dim prefix with only 2-5% recall loss (trained for prefix quality). For 1B vectors: $25K → $8.3K/month. Compounds with quantization: 256-dim int8 = 12x total compression.