Crux
API Reference@crux/ingest

@crux/ingest

Source loaders for inline text, files, and URLs — produce IngestDocument streams for corpus sync.

import { textSource } from '@crux/ingest'
import { fileSource, filesSource } from '@crux/ingest/files'
import { urlSource, urlsSource } from '@crux/ingest/urls'
import type { IngestDocument, SourceLoader } from '@crux/ingest'

Overview

@crux/ingest is the source-loading layer for retrieval. It turns inline text, local files, folders, globs, and URLs into structured IngestDocument values that can flow into corpus.sync() or indexer.indexDocuments().

The important word is structured. Every parser produces parts first: text blocks, pages, tables, sheets, or JSON paths. Crux then derives content from those parts so the current chunking and retrieval path still works like normal text ingestion. That gives simple users a plain-text path and gives advanced users provenance for custom chunkers, UI citations, table-aware indexing, and future document-intelligence features.

If your application already has a clean loading pipeline, you do not have to use this package. Yield the same document shape from your own loader and pass it to the corpus. @crux/ingest exists to make the default retrieval stack coherent, not to become mandatory ceremony.

Subpaths

Bundled loaders

Crux ships these source loaders:

LoaderImportWhat it loadsUse it when
textSource()@crux/ingestOne or more in-memory IngestDocument-like objects.Your app already has text, CMS records, API results, or generated documents in memory.
fileSource()@crux/ingest/filesOne local file.You want to index a known file and optionally control its sourceId.
filesSource()@crux/ingest/filesAn array of paths, a directory, or a glob.You want to index a local docs folder or file set.
urlSource()@crux/ingest/urlsOne HTTP(S) URL.You want to index a known web page or remote document.
urlsSource()@crux/ingest/urlsA list of HTTP(S) URLs.You have an explicit URL list and do not need crawler behavior.
Custom SourceLoaderany packageAny async iterable of IngestLoadResult.You are loading from a CMS, database, queue, API, crawler, or connector.

All bundled loaders return the same SourceLoader shape:

const loader = filesSource({ directory: './docs' }, { namespace: 'product-docs' })

await docsCorpus.sync(loader.load(), { sourceSet: 'complete' })

Use load() for product sync jobs because failures are returned as values. Use documents() for scripts and tests where the first failed source should throw.

Bundled parsers

Parsers turn bytes into structured parts. The source decides the format from file extension, URL content type, URL extension, or PDF magic bytes.

ParserFormatsParts producedNotes
texttxt, unknowntext partPlain text fallback. Unknown files/URLs are treated as text unless a custom parser overrides them.
markdownmdtext, table partsParses Markdown + GFM. Preserves heading paths and table rows.
htmlhtmltext, table partsExtracts title, headings, paragraphs, list items, code/pre blocks, and tables.
pdfpdfpage partsUses pdfjs-dist; image-only pages can call an OCR hook and otherwise emit warnings.
csvcsvtable partProduces one table with rows, columns, and row range metadata.
jsonjsonjson partsProduces one part per JSON path, including nested objects and arrays.
docxdocxtext, table partsConverts DOCX to HTML with Mammoth, then uses the HTML parser. Mammoth warnings become ingest warnings.
xlsxxlsxsheet, table partsUses ExcelJS. Produces a sheet part and a table part for each non-empty worksheet.

Override parsers source-locally when a format needs domain-specific handling:

const loader = filesSource(
  { glob: '**/*.acme' },
  {
    namespace: 'docs',
    parsers: [
      {
        name: 'acme',
        formats: ['unknown'],
        parse: ({ text }) => ({
          parts: [
            {
              id: 'acme:1',
              kind: 'text',
              role: 'paragraph',
              content: text ?? '',
            },
          ],
        }),
      },
    ],
  },
)

Types

IngestDocument

type IngestPart =
  | {
      kind: 'text'
      id: string
      content: string
      role?: 'heading' | 'paragraph' | 'list' | 'code'
      headingPath?: string[]
    }
  | { kind: 'page'; id: string; pageNumber: number; content: string; ocr?: boolean }
  | { kind: 'table'; id: string; rows: string[][]; columns?: string[]; content: string }
  | { kind: 'sheet'; id: string; sheetName: string; tables: IngestPart[]; content: string }
  | { kind: 'json'; id: string; path: string; valueType: string; content: string }

interface IngestDocument {
  namespace: string
  sourceId: string
  parts: IngestPart[]
  content: string
  title?: string
  metadata?: Record<string, unknown>
  warnings?: IngestWarning[]
}

content is derived when omitted. If you write a custom loader, you can provide parts, content, or both. Providing parts is preferred because provenance survives into indexing metadata.

SourceLoader

type IngestLoadResult =
  | { ok: true; document: IngestDocument }
  | { ok: false; namespace: string; sourceId: string; error: IngestError; metadata?: Record<string, unknown> }

interface SourceLoader {
  load(): AsyncIterable<IngestLoadResult>
  documents(): AsyncIterable<IngestDocument>
}

Use load() for real sync jobs. It reports source-level failures as values, which lets corpus.sync() keep going and record failed sources in the ledger. Use documents() for scripts and tests where the first bad source should throw.

textSource(input) — inline documents

For when you already have the text in hand:

import { textSource } from '@crux/ingest'

const single = textSource({
  namespace: 'docs',
  sourceId: 'readme',
  content: '# Hello\nThis is the readme.',
  title: 'README',
})

const many = textSource([
  { namespace: 'docs', sourceId: 'a', content: '...' },
  { namespace: 'docs', sourceId: 'b', content: '...' },
])

// Product sync path.
await docsCorpus.sync(single.load(), { sourceSet: 'partial' })

// Fail-fast script path.
for await (const document of single.documents()) {
  console.log(document.content)
}

Throws if namespace or sourceId is empty.

Parser contract

type IngestParser = {
  name: string
  formats: IngestFormat[]
  parse(input: ParseInput, ctx: ParseContext): Promise<ParseResult> | ParseResult
}

Built-in parsers cover txt, md, html, pdf, csv, json, docx, and xlsx. Pass parsers to a source to override a built-in parser or add a source-local parser. Custom parsers win over built-ins for matching formats.

const loader = fileSource('./docs/manual.acme', {
  namespace: 'docs',
  parsers: [
    {
      name: 'acme',
      formats: ['unknown'],
      parse: ({ text }) => ({
        parts: [{ id: 'acme:1', kind: 'text', role: 'paragraph', content: text ?? '' }],
      }),
    },
  ],
})

PDF parsing supports an OCR hook. Crux does not ship an OCR provider, but image-only pages can call your hook and mark the page part with ocr: true.

Pattern: pipe into a corpus

import { corpus, indexer } from '@crux/core/indexing'
import { filesSource } from '@crux/ingest/files'

const docsIndexer = indexer({
  id: 'docs',
  namespace: 'product-docs',
  store,
  dense: embedding,
})

const docsCorpus = corpus({
  id: 'docs',
  namespace: 'product-docs',
  store,
  indexer: docsIndexer,
})

const source = filesSource({ directory: './docs', recursive: true }, { namespace: 'product-docs' })

await docsCorpus.sync(source.load(), {
  sourceSet: 'complete',
  stale: 'delete',
})

Use indexer.indexDocuments(source.load()) for one-off writes and tests. Use corpus.sync() for repeated ingestion jobs.

Observability

Each parser run emits ingest:parse:start and ingest:parse:end through devtools, the CLI/TUI, and @crux/otel. OTel exports a crux.ingest.parse span with parser, format, byte length, part count, warning count, and privacy-safe hashed namespace/source identifiers.

What @crux/ingest is not

It is not a crawler framework, connector marketplace, or full document-intelligence platform. Use it to normalize source material into structured documents, then let corpus sync and retrieval own the rest.

  • Reference: Indexingindexer and corpus
  • Reference: Retrieval — query the indexed corpus
  • Guide: Retrieval — the full RAG pipeline

On this page