Chunkers
Choose how documents are split before embedding — structured, plain text, parent-child, semantic, or custom.
A chunker decides where to cut a document before each piece is embedded. The choice shapes recall quality more than almost any other indexing knob: cut too small and chunks lose context, cut too large and the embedding signal gets diluted.
The default is chunker.structured(). It works for text and preserves provenance from @crux/ingest parts such as pages, tables, sheets, and JSON paths.
Quick rule
chunker.structured() // default for files, URLs, PDFs, DOCX, sheets, JSON
chunker.text() // plain text only
chunker.parentChild() // small searchable chunks + larger context later
chunker.semantic() // boundaries should follow meaning
custom Chunker // your domain already has explicit boundariesBundled chunkers
| Chunker | What it does | Best for |
|---|---|---|
chunker.text() | Splits plain text by paragraphs, sentences, then hard character windows. | Simple prose or already-normalized text. |
chunker.structured() | Splits by ingest part and preserves page/table/sheet/JSON provenance. | Most file and URL ingestion. This is the default. |
chunker.parentChild() | Writes large parent records and smaller searchable child chunks with parent.key. | Long documents where citations should point to small chunks but prompts need surrounding context. |
chunker.semantic() | Uses embedding, custom/model boundaries, or hybrid fallback to split by meaning. | Docs where size-based chunking cuts through concepts or sections. |
Custom Chunker | Returns your own chunks and optional parent records. | Markdown sections, legal clauses, transcripts, tickets, CMS blocks, or domain-specific boundaries. |
Plain text
Use chunker.text() when you are indexing plain prose and do not need part-aware table/page/sheet handling.
indexingPipeline({
chunker: chunker.text({
maxChars: 1000,
overlapChars: 120,
}),
})It splits paragraph-first, then sentence-first for oversized paragraphs, then hard-splits very long sentences. Chunk IDs are stable hashes of source id, ordinal, content, and provenance.
Structured documents
indexingPipeline({
chunker: chunker.structured({
maxChars: 1200,
overlapChars: 150,
tableRowsPerChunk: 25,
}),
})chunker.structured() is the best default for @crux/ingest output. PDF pages keep page provenance, tables are chunked by row windows with headers repeated, spreadsheets preserve sheet names, and JSON parts preserve paths.
Parent / child
Use parent-child indexing when you want small searchable chunks but larger parent records for display or later context expansion.
indexingPipeline({
chunker: chunker.parentChild({
parentMaxChars: 6000,
childMaxChars: 900,
childOverlapChars: 120,
}),
})The retriever searches child chunks. Each child stores parent.parentId and parent.key; parentExpand({ store: data }) can follow that key at query time and add parent content while keeping the child chunk as the citation.
import { parentExpand, retrievalPipeline } from '@crux/core/retrieval'
const expandedDocs = retrievalPipeline(docs, [
parentExpand({ store: data, maxParentChars: 4000 }),
])Semantic
Use semantic chunking when boundaries should follow meaning, not only size.
indexingPipeline({
chunker: chunker.semantic({
strategy: 'embedding',
dense,
minChars: 300,
maxChars: 1200,
similarityThreshold: 0.76,
}),
})For model- or domain-driven boundaries, provide a segmenter:
indexingPipeline({
chunker: chunker.semantic({
strategy: 'custom',
minChars: 300,
maxChars: 1200,
async segment({ document, segments }, ctx) {
return detectSectionBoundaries(document.content, segments, ctx)
},
}),
})Use strategy: 'hybrid' when you want your segmenter first, but want embedding-based boundaries as a fallback when it returns no boundaries.
Custom
You can also bring your own chunker:
const markdownSections = {
_tag: 'Chunker' as const,
name: 'markdown-sections',
version: '1',
fingerprint: () => 'markdown-sections:v1',
chunkDocument(document) {
return {
chunks: splitByHeading(document.content).map((section, ordinal) => ({
namespace: document.namespace,
sourceId: document.sourceId,
chunkId: section.slug,
ordinal,
content: section.content,
metadata: {
...document.metadata,
heading: section.heading,
},
})),
}
},
}
const pipeline = indexingPipeline({ chunker: markdownSections })Custom chunkers are best for domains with explicit boundaries: Markdown headings, legal clauses, API reference sections, support tickets, transcript turns, or CMS blocks. Keep chunkId stable across syncs so replacement, citations, and retrieval traces stay readable.
Related
- Indexing Documents — how chunkers fit into the indexing pipeline
- Pipelines —
parentExpandand other retrieval-time transforms - Reference: indexing — full
chunker.*API