Crux
GuidesRetrieval & RAG

Chunkers

Choose how documents are split before embedding — structured, plain text, parent-child, semantic, or custom.

A chunker decides where to cut a document before each piece is embedded. The choice shapes recall quality more than almost any other indexing knob: cut too small and chunks lose context, cut too large and the embedding signal gets diluted.

The default is chunker.structured(). It works for text and preserves provenance from @crux/ingest parts such as pages, tables, sheets, and JSON paths.

Quick rule

chunker.structured()   // default for files, URLs, PDFs, DOCX, sheets, JSON
chunker.text()         // plain text only
chunker.parentChild()  // small searchable chunks + larger context later
chunker.semantic()     // boundaries should follow meaning
custom Chunker         // your domain already has explicit boundaries

Bundled chunkers

ChunkerWhat it doesBest for
chunker.text()Splits plain text by paragraphs, sentences, then hard character windows.Simple prose or already-normalized text.
chunker.structured()Splits by ingest part and preserves page/table/sheet/JSON provenance.Most file and URL ingestion. This is the default.
chunker.parentChild()Writes large parent records and smaller searchable child chunks with parent.key.Long documents where citations should point to small chunks but prompts need surrounding context.
chunker.semantic()Uses embedding, custom/model boundaries, or hybrid fallback to split by meaning.Docs where size-based chunking cuts through concepts or sections.
Custom ChunkerReturns your own chunks and optional parent records.Markdown sections, legal clauses, transcripts, tickets, CMS blocks, or domain-specific boundaries.

Plain text

Use chunker.text() when you are indexing plain prose and do not need part-aware table/page/sheet handling.

indexingPipeline({
  chunker: chunker.text({
    maxChars: 1000,
    overlapChars: 120,
  }),
})

It splits paragraph-first, then sentence-first for oversized paragraphs, then hard-splits very long sentences. Chunk IDs are stable hashes of source id, ordinal, content, and provenance.

Structured documents

indexingPipeline({
  chunker: chunker.structured({
    maxChars: 1200,
    overlapChars: 150,
    tableRowsPerChunk: 25,
  }),
})

chunker.structured() is the best default for @crux/ingest output. PDF pages keep page provenance, tables are chunked by row windows with headers repeated, spreadsheets preserve sheet names, and JSON parts preserve paths.

Parent / child

Use parent-child indexing when you want small searchable chunks but larger parent records for display or later context expansion.

indexingPipeline({
  chunker: chunker.parentChild({
    parentMaxChars: 6000,
    childMaxChars: 900,
    childOverlapChars: 120,
  }),
})

The retriever searches child chunks. Each child stores parent.parentId and parent.key; parentExpand({ store: data }) can follow that key at query time and add parent content while keeping the child chunk as the citation.

import { parentExpand, retrievalPipeline } from '@crux/core/retrieval'

const expandedDocs = retrievalPipeline(docs, [
  parentExpand({ store: data, maxParentChars: 4000 }),
])

Semantic

Use semantic chunking when boundaries should follow meaning, not only size.

indexingPipeline({
  chunker: chunker.semantic({
    strategy: 'embedding',
    dense,
    minChars: 300,
    maxChars: 1200,
    similarityThreshold: 0.76,
  }),
})

For model- or domain-driven boundaries, provide a segmenter:

indexingPipeline({
  chunker: chunker.semantic({
    strategy: 'custom',
    minChars: 300,
    maxChars: 1200,
    async segment({ document, segments }, ctx) {
      return detectSectionBoundaries(document.content, segments, ctx)
    },
  }),
})

Use strategy: 'hybrid' when you want your segmenter first, but want embedding-based boundaries as a fallback when it returns no boundaries.

Custom

You can also bring your own chunker:

const markdownSections = {
  _tag: 'Chunker' as const,
  name: 'markdown-sections',
  version: '1',
  fingerprint: () => 'markdown-sections:v1',
  chunkDocument(document) {
    return {
      chunks: splitByHeading(document.content).map((section, ordinal) => ({
        namespace: document.namespace,
        sourceId: document.sourceId,
        chunkId: section.slug,
        ordinal,
        content: section.content,
        metadata: {
          ...document.metadata,
          heading: section.heading,
        },
      })),
    }
  },
}

const pipeline = indexingPipeline({ chunker: markdownSections })

Custom chunkers are best for domains with explicit boundaries: Markdown headings, legal clauses, API reference sections, support tickets, transcript turns, or CMS blocks. Keep chunkId stable across syncs so replacement, citations, and retrieval traces stay readable.

On this page