Systems Atlas
Chapter 6.10Vector & Semantic Search

Document Chunking Strategies

Before you can embed a document, you must decide how to split it. This seemingly mundane preprocessing step has an outsized impact on retrieval quality — arguably more than the choice of embedding model. This chapter covers fixed-size, sentence-based, recursive, semantic, and structure-aware chunking.

Sweet Spot

256-512

tokens per chunk. Balances precision (find the fact) with context (understand it).

Context Boost

5-15%

Retrieval improvement from prepending document/section title to every chunk.

Overlap

10-20%

Of chunk size. Ensures sentences at boundaries appear in full.

1. Fixed-Size Chunking

The simplest strategy: split text every N tokens regardless of content boundaries. The tokenizer counts tokens (not characters), and you cut at every 512 (or whatever your target is). The overlap parameter slides the window back by a fixed number of tokens, so sentences at chunk boundaries appear in full in at least one chunk. This is the baseline approach that every other strategy is measured against.

Fixed-size chunking works surprisingly well for homogeneous text (news articles, Wikipedia) where paragraph boundaries occur naturally every 200-400 tokens. It fails badly on structured content (code, tables, legal documents) where a 512-token cut might split a function body, table row, or contract clause in half — producing two chunks that are each meaningless on their own.

fixed_size_chunk.py
def fixed_size_chunk(text, chunk_size=512, overlap=50):
tokens = tokenize(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunks.append(detokenize(tokens[start:end]))
start = end - overlap # Slide back for overlap
return chunks
Advantages
  • Simple and predictable — know exactly how many chunks
  • Fast — no NLP processing needed
  • Uniform embedding quality (consistent token count)
Disadvantages
  • Splits mid-sentence, mid-paragraph, mid-table
  • Cause-and-effect separated across chunks
  • Code blocks fragmented into meaningless pieces

2. Sentence-Based Chunking

An improvement over fixed-size: first split text into individual sentences using a sentence tokenizer (NLTK's sent_tokenize or spaCy's sentence detector), then greedily accumulate sentences into chunks until reaching the target token limit. No sentence is ever split across chunks, which means every chunk contains only complete thoughts. Overlap is measured in sentences, not tokens — carry over the last N sentences from the previous chunk to the next.

This produces chunks of varying size (some may be 300 tokens, others 480) because sentence lengths vary. The inconsistency is a trade-off worth making: complete sentences generate much better embeddings than truncated fragments. The main limitation is that sentence tokenizers can struggle with abbreviations ("Dr. Smith"), code snippets, or non-standard formatting.

sentence_chunk.py
import nltk
def sentence_chunk(text, max_tokens=512, overlap_sentences=2):
sentences = nltk.sent_tokenize(text)
chunks, current = [], []
current_tokens = 0
for sent in sentences:
if current_tokens + count(sent) > max_tokens and current:
chunks.append(' '.join(current))
current = current[-overlap_sentences:] # Carry over N sentences
current.append(sent)
return chunks

3. Recursive Character Splitting

The most popular strategy in practice, and LangChain's default for good reason. Instead of committing to one type of boundary (tokens or sentences), recursive splitting tries a hierarchy of separators from best to worst. It first attempts to split on paragraph boundaries (double newlines), which are the most natural semantic breaks. If a paragraph is still too long, it falls back to line breaks, then sentences, then words, and finally individual characters.

This approach is pragmatic: for most documents, paragraph-level splitting captures ~80% of the benefit of more advanced methods. It handles mixed-format documents gracefully (prose paragraphs split nicely, while a long code block falls back to line-level splitting). The hierarchy below shows the separators in order:

# Separator hierarchy (best → fallback)
1. "\n\n" Paragraph boundaries — best semantic separation
2. "\n" Line breaks — good for structured text
3. ". " Sentences — preserve complete thoughts
4. " " Words — last resort
5. "" Characters — absolute fallback
Why It's the Best Default

Paragraphs are usually the ideal chunk boundary (complete thoughts). Trying them first captures most of the benefit. When too long, falling back to sentences is a reasonable second choice. Pragmatic and works well "out of the box" for most documents.


4. Semantic Chunking

Rather than relying on formatting cues (newlines, punctuation), semantic chunking uses theactual content to determine where to split. The idea: embed each sentence, then compare adjacent sentence embeddings using cosine similarity. When similarity drops sharply between sentence N and sentence N+1, that indicates a topic shift — and a natural place to cut.

This produces the highest-quality chunks because each chunk contains a coherent topic. However, it has a significant cost: you need to embed every sentence individually (O(N) model passes), which is expensive for large corpora. There's also a circular dependency problem — you need an embedding model to chunk, but you also need chunks to build your index. In practice, you can use a small, fast model (e.g., all-MiniLM-L6) for chunking and a different, better model for the final index embeddings.

semantic_chunk.py
def semantic_chunk(text, model, threshold=0.75):
sentences = nltk.sent_tokenize(text)
embeddings = [model.encode(s) for s in sentences]
chunks, current = [], [sentences[0]]
for i in range(1, len(sentences)):
sim = cosine_similarity(embeddings[i-1], embeddings[i])
if sim < threshold: # Topic shift → new chunk
chunks.append(' '.join(current))
current = [sentences[i]]
else:
current.append(sentences[i])
return chunks
Advantages

Topic-aware boundaries, adaptive sizing, best embedding quality per chunk

Disadvantages

Expensive (O(N) model passes), threshold sensitivity, circular dependency with model


5. Structure-Aware Chunking

Many documents aren't flat prose — they have headers, subheaders, bullet lists, code blocks, and tables. Structure-aware chunking parses this formatting and uses it to create chunks that respect the document's own organization. A Markdown parser can split on ## headers, an HTML parser on <section> tags, a PDF parser on heading fonts.

The most impactful technique in structure-aware chunking (and one that applies to ALL strategies) is context prepending: adding the document title and section header to the beginning of every chunk. This single change improves retrieval by 5-15% because it gives the embedding model crucial context about what the chunk is about. Without it, a chunk containing "pip install faiss-cpu" could be about any library installation. With it, the chunk says "FAISS Library > Installation Guide: pip install faiss-cpu" — unambiguous.

Without context:
"pip install faiss-cpu. For GPU: pip install faiss-gpu."
→ Embedding: something about pip installation (vague)
With context:
"FAISS Library > Installation Guide: pip install faiss-cpu..."
→ Embedding: installing the FAISS library (specific and accurate)

6. Chunk Size Guidelines

Chunk size is the single most important parameter in your chunking pipeline. Smaller chunks (50-128 tokens) are highly precise — they contain exactly one fact and embed that fact well. But they lack surrounding context, making it hard to understand the fact in isolation. Larger chunks (512-1024 tokens) preserve context and narrative flow, but the embedding becomes a blur of multiple topics, reducing precision. The sweet spot for most applications is256-512 tokens, which captures 1-2 paragraphs — enough for a coherent explanation with context, focused enough for a meaningful embedding.

Chunk SizePrecisionContextBest For
50-100 tokensVery highVery lowFAQs, definitions, single facts
128-256 tokensHighModerateQ&A, customer support
256-512 tokensBalancedGoodGeneral purpose (recommended!)
512-1024 tokensModerateHighTechnical docs, research papers
1024+ tokensLowVery highLong-context models, summarization

7. Decision Matrix

Choosing the right chunking strategy depends on your document types, quality requirements, and computational budget. For most teams, recursive splitting is the right default— it's fast, requires no ML model, and handles diverse document types well. Only move to semantic chunking if you have specific quality requirements that recursive doesn't meet, and the computational cost of embedding every sentence is acceptable.

FactorFixedSentenceRecursiveSemanticStructure
ComplexityLowLowMediumHighMedium
SpeedFastestFastFastSlowMedium
QualityAcceptableGoodGoodBestVery Good
Code/TablesPoorPoorMediumMediumBest
Recommended?✅ Baseline✅ DefaultStructured docs

Key Takeaways

01

Chunking Has More Impact Than Model Choice

A mediocre embedding model with good chunks outperforms a great model with bad chunks. How you split documents determines what unit of information can be retrieved — and what context is lost.

02

256-512 Tokens Is the Sweet Spot

Smaller chunks improve precision (find the exact fact). Larger chunks preserve context (understand the fact). 256-512 tokens captures 1-2 paragraphs — enough for coherent explanation with context, focused enough for a meaningful embedding.

03

Recursive Splitting Is the Best Default

Try paragraph boundaries first (\n\n), fall back to sentences, then words. LangChain's default for good reason: paragraphs are usually ideal chunk boundaries. Handles most document types well out-of-the-box.

04

Always Prepend Context to Every Chunk

Adding document title + section header to every chunk improves retrieval by 5-15%. 'pip install faiss-cpu' → vague. 'FAISS Library > Installation Guide: pip install faiss-cpu' → specific and accurate.

05

Overlap of 10-20% Prevents Information Loss

Sentences split at chunk boundaries appear in full in at least one chunk. chunk_size=512 → overlap=50-100 tokens. Too much overlap (>30%) wastes storage and confuses retrieval with near-duplicate chunks.