Digitorn
Digitorn
All terms
RAG & knowledge

Chunking

Splitting source documents into smaller pieces (paragraphs, sections) before embedding them for retrieval.

also known as: text splitting, document chunking
In depth

An LLM cannot retrieve a 100-page PDF as one unit, the embedding would lose the granular signal. Chunking splits documents into smaller pieces, each embedded separately, so retrieval can find the specific paragraph that answers the question. Chunk size is a tuning knob: small chunks are precise but lose context, large chunks keep context but blur the signal. 200-500 tokens with a 50-token overlap is a common starting point.

Related concepts
Newsletter

Get the next post in your inbox.

Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.

One-click unsubscribe. We never share your address. Powered by our own infrastructure, not a tracker.

More in RAG & knowledge

Embedding/glossary/embeddingHybrid search/glossary/hybrid-searchKnowledge base/glossary/knowledge-baseRAG/glossary/ragRe-ranking/glossary/rerankSemantic search/glossary/semantic-search