When Semantic Search Fails
Semantic search is not a silver bullet. It has seven systematic failure modes that are insidious: the system always returns results, but those results may be confidently wrong. This chapter catalogues each failure pattern, explains the underlying causes, and provides concrete mitigations.
1. Exact Match Requirements
Embeddings compress meaning into fixed-dimensional vectors, optimizing for semantic similarity and losing surface-level precision in the process. Two strings that look almost identical to a human—differing by only one critical character or number—can produce effectively identical embeddings.
This is catastrophic in e-commerce, legal, and medical domains. A user searching for "iPhone 15 Pro Max 256GB Blue" does not want the 512GB Black version, even though structurally and semantically the queries are 98% identical. Product SKUs, case numbers, and model identifiers are completely opaque to embedding models unless they happened to appear frequently in the training data.
- E-commerce: SKU, model number, color, size
- Legal: Case numbers, statute references
- Engineering: Error codes, config params
- Medical: Drug names (metformin vs metoprolol)
Hybrid search with BM25 weighted higher for identifier queries. Use metadata filters (structured fields for SKU, size, color) as hard constraints, not soft semantic matches.
2. Information Loss in Compression
When you embed a paragraph using a model like all-MiniLM-L6-v2, you are compressing hundreds of words into exactly 384 floating-point numbers (about 1.5KB of data). This compression is lossy. The embedding is forced to capture the general gist or primary topic of the text, sacrificing specific details, numbers, and secondary clauses.
If a document contains a list of five specific features, the embedding might capture that it is a "feature list for product X", but it cannot mathematically retain the exact values of all five features. When queried for one specific feature value, the semantic match will be weak.
Structured metadata for numerical attributes (price, HP, ratings). Smaller chunks preserve more per-fact detail. ColBERT retains per-token embeddings instead of compressing to one vector.
3. Domain Mismatch
Most off-the-shelf embedding models (including OpenAI's text-embedding-3) are trained on massive corpora of general web text. They learn the statistical distributions and meanings of words as they are used by the general public.
When applied to specialized domains, they systematically misinterpret domain terminology. The same word meaning entirely different things in different contexts causes the model to confidently pick the wrong meaning without flagging any uncertainty to the user. This is particularly dangerous in legal, financial, and medical search.
Fine-tune on 1K-5K domain pairs. Domain-specific models: PubMedBERT, LegalBERT, FinBERT. Hybrid search: BM25 handles domain terms literally.
4. Negation and Logic
Bi-encoder architectures (the standard for vector search) are notoriously bad at handling negation and boolean logic. When a user explicitly asks for something to be excluded, the semantic search engine will often return exactly what they asked to avoid.
This happens because embedding models are optimized to measure topical similarity, not logical truth. A query about "hotels with pools" and "hotels without pools" share 90% of their tokens and are discussing the exact same topic (hotel amenities). In the embedding space, they are nearly identical neighbors.
"WITHOUT pool" → query: "hotels" + filter: pool=false
Joint query-doc attention detects the mismatch via cross-attention
Pre-processing layer converts negation to structured filters
5. Scalability Degradation
A semantic search system that performs flawlessly on 100,000 documents might degrade significantly when scaled to 100 million documents, even if no code changes. This isn't just an infrastructure problem; it's a mathematical reality of high-dimensional geometry known as the curse of dimensionality.
As you add more vectors to a fixed-dimensional space (e.g., 768 dimensions), the space becomes crowded. The distance between the "best" match and the "100th best" match shrinks to an imperceptible margin. Approximate Nearest Neighbor (ANN) algorithms rely on clear distance gradients to navigate their graphs. When distances become uniform, ANN algorithms get stuck in local minima and fail to find the true nearest neighbors, causing recall to silently drop.
| Scale | Recall@10 (default params) | Why It Degrades |
|---|---|---|
| 1M | ~0.98 | Few candidates in the "almost as close" region |
| 100M | ~0.92 | More candidates crowd the narrow distance band |
| 1B | Requires retuning | Hub nodes + local minima + graph quality degradation |
6. Short Queries
Dense embeddings thrive on context. When a user types a full sentence, the surrounding words clarify the intent of ambiguous terms. When a user types a single word, the embedding model has no context to pull from, forcing it to average all possible meanings of that word into a single, vague vector.
If a user searches for "python", does the embedding represent the programming language, the snake, or the British comedy troupe? The resulting vector will sit halfway between all three clusters in the embedding space, retrieving a confusing mix of results from different domains. Keyword search (BM25) is often safer for 1-2 word queries because it looks for exact lexical matches rather than trying to guess the semantic center of mass.
Mitigations: Query expansion ("python" → "python programming language"). Fallback to BM25 for queries under 3 words (alpha=0.3-0.4). User context and session history for disambiguation.
7. Training Data Bias
Embeddings are a reflection of the data they were trained on. Because most foundational models are trained on dumps of the public internet (Common Crawl, Wikipedia, Reddit), they inherit the biases, blind spots, and temporal freezing of that data. Semantic search systems built on these models will pass these biases directly to your users.
Over-represent English/Western web content. Underrepresented languages get lower-quality embeddings.
Models trained to 2023 don't know 2024+ events. Combine with date-based boosting.
"nurse" closer to "she" than "he" because training data disproportionately associates nursing with women.
Python has better embeddings than Haskell — more web presence = better training signal.
When to Trust Semantic Search
Given these systematic failure modes, building a robust search engine requires knowing exactly when to rely on semantic search and when to fall back to traditional techniques. A modern search architecture uses semantic search as one tool in a larger orchestration layer, rather than a silver bullet.
The table below provides a framework for deciding how heavily to weight vector similarity versus keyword matching or structured filtering based on the type of query being processed by your query understanding layer.
| Scenario | Semantic? | Keyword? | Best Approach |
|---|---|---|---|
| Conceptual queries | ✅ High | ❌ Low | Semantic-heavy hybrid |
| Exact identifiers | ❌ Low | ✅ High | Keyword-heavy hybrid |
| Domain jargon | ⚠️ if fine-tuned | ✅ High | Fine-tuned + keyword |
| Negation queries | ❌ Low | ⚠️ | Boolean filters + cross-encoder |
| Short queries (1-2 words) | ⚠️ | ✅ High | Keyword fallback |
| Numerical comparison | ❌ Low | ❌ Low | Structured metadata filters |
| Time-sensitive queries | ⚠️ | ✅ High | Recency boosting + hybrid |
Key Takeaways
Failures Are Silent, Not Obvious
Unlike keyword search (zero results = obvious failure), semantic search always returns something. The danger: it returns confidently wrong results that users trust. You can't detect failures without evaluation.
Exact Match Is the Achilles' Heel
SKU-2847-B ≈ SKU-2847-C in embedding space. iPhone 15 Pro Max 256GB Blue ≈ 512GB Black (cosine ~0.98). Product codes, case numbers, and error codes should always go through keyword search.
Negation Is Invisible to Bi-Encoders
'hotels with pool' ≈ 'hotels without pool' (cosine ~0.95). Mean pooling dilutes the single 'not' token across 10+ tokens. Cross-encoder reranking and boolean filters are the primary mitigations.
Domain Mismatch Causes Silent Misinterpretation
'Consideration' maps to 'thoughtfulness' instead of 'something exchanged in a contract' (legal). 'Positive' maps to 'upbeat' instead of 'test result' (medical). Fine-tuning even 1K-5K domain pairs helps significantly.
Quality Degrades at Scale — Same Parameters, Worse Results
Recall@10 drops from 0.98 at 1M to ~0.92 at 100M with identical HNSW parameters. The curse of dimensionality: in 768 dims, all points become roughly equidistant. Must increase ef_search 2-3x as data grows.