Chunkers

Choose how documents are split before embedding — structured, plain text, parent-child, semantic, or custom.

A chunker decides where to cut a document before each piece is embedded. The choice shapes recall quality more than almost any other indexing knob: cut too small and chunks lose context, cut too large and the embedding signal gets diluted.

The default is chunker.structured(). It works for text and preserves provenance from @crux/ingest parts such as pages, tables, sheets, and JSON paths.

Quick rule

chunker.structured()   // default for files, URLs, PDFs, DOCX, sheets, JSON
chunker.text()         // plain text only
chunker.parentChild()  // small searchable chunks + larger context later
chunker.semantic()     // boundaries should follow meaning
custom Chunker         // your domain already has explicit boundaries

Bundled chunkers

Chunker	What it does	Best for
`chunker.text()`	Splits plain text by paragraphs, sentences, then hard character windows.	Simple prose or already-normalized text.
`chunker.structured()`	Splits by ingest part and preserves page/table/sheet/JSON provenance.	Most file and URL ingestion. This is the default.
`chunker.parentChild()`	Writes large parent records and smaller searchable child chunks with `parent.key`.	Long documents where citations should point to small chunks but prompts need surrounding context.
`chunker.semantic()`	Uses embedding, custom/model boundaries, or hybrid fallback to split by meaning.	Docs where size-based chunking cuts through concepts or sections.
Custom `Chunker`	Returns your own chunks and optional parent records.	Markdown sections, legal clauses, transcripts, tickets, CMS blocks, or domain-specific boundaries.

Plain text

Use chunker.text() when you are indexing plain prose and do not need part-aware table/page/sheet handling.

indexingPipeline({
  chunker: chunker.text({
    maxChars: 1000,
    overlapChars: 120,
  }),
})

It splits paragraph-first, then sentence-first for oversized paragraphs, then hard-splits very long sentences. Chunk IDs are stable hashes of source id, ordinal, content, and provenance.

Structured documents

indexingPipeline({
  chunker: chunker.structured({
    maxChars: 1200,
    overlapChars: 150,
    tableRowsPerChunk: 25,
  }),
})

chunker.structured() is the best default for @crux/ingest output. PDF pages keep page provenance, tables are chunked by row windows with headers repeated, spreadsheets preserve sheet names, and JSON parts preserve paths.

Parent / child

Use parent-child indexing when you want small searchable chunks but larger parent records for display or later context expansion.

indexingPipeline({
  chunker: chunker.parentChild({
    parentMaxChars: 6000,
    childMaxChars: 900,
    childOverlapChars: 120,
  }),
})

The retriever searches child chunks. Each child stores parent.parentId and parent.key; parentExpand({ store: data }) can follow that key at query time and add parent content while keeping the child chunk as the citation.

import { parentExpand, retrievalPipeline } from '@crux/core/retrieval'

const expandedDocs = retrievalPipeline(docs, [
  parentExpand({ store: data, maxParentChars: 4000 }),
])

Semantic

Use semantic chunking when boundaries should follow meaning, not only size.

indexingPipeline({
  chunker: chunker.semantic({
    strategy: 'embedding',
    dense,
    minChars: 300,
    maxChars: 1200,
    similarityThreshold: 0.76,
  }),
})

For model- or domain-driven boundaries, provide a segmenter:

indexingPipeline({
  chunker: chunker.semantic({
    strategy: 'custom',
    minChars: 300,
    maxChars: 1200,
    async segment({ document, segments }, ctx) {
      return detectSectionBoundaries(document.content, segments, ctx)
    },
  }),
})

Use strategy: 'hybrid' when you want your segmenter first, but want embedding-based boundaries as a fallback when it returns no boundaries.

Custom

You can also bring your own chunker:

const markdownSections = {
  _tag: 'Chunker' as const,
  name: 'markdown-sections',
  version: '1',
  fingerprint: () => 'markdown-sections:v1',
  chunkDocument(document) {
    return {
      chunks: splitByHeading(document.content).map((section, ordinal) => ({
        namespace: document.namespace,
        sourceId: document.sourceId,
        chunkId: section.slug,
        ordinal,
        content: section.content,
        metadata: {
          ...document.metadata,
          heading: section.heading,
        },
      })),
    }
  },
}

const pipeline = indexingPipeline({ chunker: markdownSections })

Custom chunkers are best for domains with explicit boundaries: Markdown headings, legal clauses, API reference sections, support tickets, transcript turns, or CMS blocks. Keep chunkId stable across syncs so replacement, citations, and retrieval traces stay readable.

Indexing Documents — how chunkers fit into the indexing pipeline
Pipelines — parentExpand and other retrieval-time transforms
Reference: indexing — full chunker.* API

On this page