Chapter 6.9Vector & Semantic Search

Cost of Vector Search at Scale

HNSW indexes require terabytes of RAM. Embedding models consume GPU hours. At billion-scale, infrastructure costs can dominate a company's entire search budget. This chapter provides concrete cost analysis and optimization strategies that can reduce costs by 10-30x.

1B Vectors (HNSW)

$25K/mo

~3.2 TB RAM needed. 7× r6g.16xlarge instances at AWS on-demand pricing.

With Quantization

$800/mo

PQ (32x compression): 3 TB → 96 GB. 97% cost reduction with two-phase search.

Embedding Cost (1B docs)

$600-$20K

Local (MiniLM on A100): $600. API (OpenAI): ~$20K. Re-embedding on model upgrade repeats.

1. Embedding Generation Cost

Before you can search vectors, you need to create them. Every document in your corpus must be passed through a neural network to produce its embedding. This is a one-time cost per document (plus re-embedding when documents change), but at scale it becomes substantial. The cost depends on the model size: smaller models are faster and cheaper but produce lower-quality embeddings. Larger models produce better embeddings but cost more to run. API-based models are convenient but charge per token.

At 1 billion documents, local embedding (all-MiniLM on A100) costs ~$600 in GPU hours over ~55 hours of compute. A higher-quality model like e5-large costs ~$3,750 over 14 days on a single GPU. API-based embedding (OpenAI) costs ~$20,000 with no GPU hardware needed, but rate limits mean it could take days. Re-embedding cost is critical: when you upgrade to a better model (which happens regularly as the field advances), you must re-embed your entire corpus — a full repeat of the original cost.

Model	Dims	Speed (A100)	Cost/1M docs	MTEB
all-MiniLM-L6-v2	384	~5,000/sec	~$0.60	56.3
all-mpnet-base-v2	768	~2,500/sec	~$1.20	57.8
e5-large-v2	1024	~800/sec	~$3.75	62.2
OpenAI text-embedding-3-small	1536	API	~$20	—
Cohere embed-v3	1024	API	~$10	—

2. Storage Cost (Raw Vectors)

The memory required to store vectors follows a simple formula: memory_per_vector = dimensions × bytes_per_float. For 768-dimensional float32 vectors, that's 768 × 4 = 3,072 bytes (3 KB) per vector. These per-vector costs seem small, but they compound brutally at scale. The table below shows how storage requirements grow as you move from millions to billions of vectors, and how reducing precision (float16, int8) provides significant savings.

Scale	float32 (768d)	float16	int8
1M	3 GB	1.5 GB	0.75 GB
10M	30 GB	15 GB	7.5 GB
100M	300 GB	150 GB	75 GB
1B	3,000 GB (3 TB)	1,500 GB	750 GB

To put 3 TB in context: AWS r6g.16xlarge has 512 GB RAM at ~$3,600/month. You'd need 6 instances just for raw vectors — ~$21,600/month before index overhead.

3. Infrastructure Cost (Cloud)

Here's what it actually costs to run vector search at various scales on cloud infrastructure. These numbers are based on AWS pricing (on-demand, us-east-1, 2024) and should be treated as illustrative. Notice the non-linear cost scaling: going from 100M to 1B vectors (10x data) costs ~7x more in infrastructure because larger instances have worse price/GB ratios and you need more of them, plus operational overhead increases.

Managed vector database costs (Pinecone, Weaviate Cloud, etc.) are typically 2-5x higher than self-managed cloud instances, but include operational overhead, scaling, backups, and monitoring. For teams without dedicated infrastructure engineers, the managed premium is often worth it.

Vectors	HNSW RAM	Instance	Monthly
1M	~4 GB	r6g.large	~$90
10M	~35 GB	r6g.xlarge × 2	~$360
100M	~350 GB	r6g.8xlarge × 2	~$3,600
1B	~3.2 TB	r6g.16xlarge × 7	~$25,000
10B	~32 TB	70+ instances	~$250,000

4. Cost Optimization Strategies

The good news is that several techniques can dramatically reduce vector search costs. The strategies below are listed in order of impact — quantization alone can save $19K/month at billion-scale, and combining multiple strategies can reduce costs by 10-30x while maintaining acceptable recall.

Strategy 1: Quantization

Quantization reduces the precision of each number in the vector, trading a small amount of accuracy for massive memory savings. This is almost always the first optimization to apply because it's simple, well-understood, and the recall impact is manageable. Scalar quantization maps each float32 dimension to an int8 value (4x savings), while product quantization (PQ) compresses the full vector into ~96 sub-vector centroid IDs (32x savings).

Method	Compression	Recall Loss	Memory (1B)	Monthly
float32	1x	—	3,000 GB	~$25,000
float16	2x	<1%	1,500 GB	~$12,500
Scalar int8	4x	2-5%	750 GB	~$6,200
PQ (M=96)	32x	5-15%	96 GB	~$800

Strategy 2: DiskANN

Microsoft's DiskANN stores the Vamana graph on SSD instead of RAM, keeping only compressed PQ codes in memory. The fundamental insight is that SSD storage costs ~10x less per GB than RAM, and modern NVMe SSDs can deliver the random I/O patterns needed for graph traversal with acceptable latency (5-20ms per query vs. <1ms for in-memory HNSW). For many applications, 10-20ms query latency is perfectly acceptable.

HNSW (1B, 768d): ~3.2 TB RAM → ~$25,000/mo

DiskANN (1B, 768d): ~64 GB RAM + 3 TB SSD → ~$3,000-5,000/mo

→ 5-8x cost reduction

Latency: HNSW <1ms vs DiskANN 5-20ms

Strategy 3: Tiered Storage

Not all vectors need to be in RAM. Production systems implement tiered storage based on access patterns — recent and frequently accessed vectors stay in fast RAM (HNSW), moderately accessed vectors move to SSD (DiskANN), and rarely accessed archival vectors live on cheap object storage. The key is building a routing layer that directs queries to the appropriate tier based on recency and access frequency.

Tier 1 (RAM): Hot vectors (recent, popular) → HNSW, <1ms → 10-20% of corpus

Tier 2 (SSD): Warm vectors → DiskANN/IVF-PQ → 5-20ms, 5-10x cheaper

Tier 3 (S3): Cold vectors → IVF-PQ on object storage → 100ms+, 50-100x cheaper

Strategy 4: Matryoshka Embeddings

Lower dimensions mean less memory AND faster distance computation — a double win. Matryoshka embeddings (Kusupati et al., 2022) are models specifically trained to produce embeddings that are useful at any prefix length. You can take the first 256 dimensions of a 768-dim Matryoshka embedding and still get good retrieval quality — typically only 2-5% recall drop. This is different from post-hoc dimensionality reduction (PCA), which typically loses 5-15% recall at the same reduction ratio.

768-dim × 4 bytes = 3,072 bytes/vector

384-dim × 4 bytes = 1,536 bytes/vector → 2x savings

256-dim × 4 bytes = 1,024 bytes/vector → 3x savings

128-dim × 4 bytes = 512 bytes/vector → 6x savings

Matryoshka models trained for prefix quality: 768→256 = only 2-5% recall drop

Regular PCA: same reduction = 5-15% recall drop

5. Architecture Cost Comparison (100M vectors, 768d)

The right architecture depends on your latency requirements. If you can tolerate 10-15ms latency (most applications can), DiskANN or IVF-PQ with re-scoring gives you 95%+ recall at a fraction of HNSW's cost. The table below compares all major architectures at 100M vectors.

Architecture	Memory	Monthly	Latency	Recall
HNSW float32	340 GB	~$3,600	<1ms	98%
HNSW int8	90 GB	~$1,800	<1ms	95%
HNSW int8 + rescore	90 GB + SSD	~$1,850	~5ms	97%
IVF-PQ	12 GB	~$180	~2ms	88%
IVF-PQ + rescore	12 GB + SSD	~$230	~10ms	96%
DiskANN	8 GB + SSD	~$200	~15ms	95%

Key Takeaways

Memory Is the Dominant Cost

1B vectors × 768-dim float32 = 3 TB of RAM ≈ $25K/month on AWS (6× r6g.16xlarge). Non-linear scaling: 10x data costs ~7x more because larger instances have worse price/GB ratios.

Quantization Is the Single Biggest Lever

Scalar int8 (4x compression) saves ~$19K/month at billion-scale with 2-5% recall loss. PQ (32x) saves ~$24K/month but needs two-phase search. This is always the first optimization.

DiskANN: 5-8x Cost Reduction at Billion Scale

1B vectors: HNSW needs ~3.2 TB RAM ($25K/mo). DiskANN needs ~64 GB RAM + 3 TB SSD ($3-5K/mo). Latency tradeoff: <1ms → 5-20ms. Acceptable for most applications.

Two-Phase Search Is the Production Standard

Search compressed index in RAM → re-score top-100 candidates with full-precision vectors from SSD. Result: PQ memory costs with 98%+ recall. SSD I/O: 100 × 3KB = 300KB ≈ 0.5ms on NVMe.

Matryoshka Embeddings: 3x Savings That Compound

768-dim → 256-dim prefix with only 2-5% recall loss (trained for prefix quality). For 1B vectors: $25K → $8.3K/month. Compounds with quantization: 256-dim int8 = 12x total compression.

6.8 When Semantic Search Fails 6.10 Chunking Strategies