@crux/ingest

Source loaders for inline text, files, and URLs — produce IngestDocument streams for corpus sync.

import { textSource } from '@crux/ingest'
import { fileSource, filesSource } from '@crux/ingest/files'
import { urlSource, urlsSource } from '@crux/ingest/urls'
import type { IngestDocument, SourceLoader } from '@crux/ingest'

Overview

@crux/ingest is the source-loading layer for retrieval. It turns inline text, local files, folders, globs, and URLs into structured IngestDocument values that can flow into corpus.sync() or indexer.indexDocuments().

The important word is structured. Every parser produces parts first: text blocks, pages, tables, sheets, or JSON paths. Crux then derives content from those parts so the current chunking and retrieval path still works like normal text ingestion. That gives simple users a plain-text path and gives advanced users provenance for custom chunkers, UI citations, table-aware indexing, and future document-intelligence features.

If your application already has a clean loading pipeline, you do not have to use this package. Yield the same document shape from your own loader and pass it to the corpus. @crux/ingest exists to make the default retrieval stack coherent, not to become mandatory ceremony.

Subpaths

Files

fileSource, filesSource — single file, glob, directory loaders.

URLs

urlSource, urlsSource — fetch and extract HTML to plain text.

Bundled loaders

Crux ships these source loaders:

Loader	Import	What it loads	Use it when
`textSource()`	`@crux/ingest`	One or more in-memory `IngestDocument`-like objects.	Your app already has text, CMS records, API results, or generated documents in memory.
`fileSource()`	`@crux/ingest/files`	One local file.	You want to index a known file and optionally control its `sourceId`.
`filesSource()`	`@crux/ingest/files`	An array of paths, a directory, or a glob.	You want to index a local docs folder or file set.
`urlSource()`	`@crux/ingest/urls`	One HTTP(S) URL.	You want to index a known web page or remote document.
`urlsSource()`	`@crux/ingest/urls`	A list of HTTP(S) URLs.	You have an explicit URL list and do not need crawler behavior.
Custom `SourceLoader`	any package	Any async iterable of `IngestLoadResult`.	You are loading from a CMS, database, queue, API, crawler, or connector.

All bundled loaders return the same SourceLoader shape:

const loader = filesSource({ directory: './docs' }, { namespace: 'product-docs' })

await docsCorpus.sync(loader.load(), { sourceSet: 'complete' })

Use load() for product sync jobs because failures are returned as values. Use documents() for scripts and tests where the first failed source should throw.

Bundled parsers

Parsers turn bytes into structured parts. The source decides the format from file extension, URL content type, URL extension, or PDF magic bytes.

Parser	Formats	Parts produced	Notes
`text`	`txt`, `unknown`	`text` part	Plain text fallback. Unknown files/URLs are treated as text unless a custom parser overrides them.
`markdown`	`md`	`text`, `table` parts	Parses Markdown + GFM. Preserves heading paths and table rows.
`html`	`html`	`text`, `table` parts	Extracts title, headings, paragraphs, list items, code/pre blocks, and tables.
`pdf`	`pdf`	`page` parts	Uses `pdfjs-dist`; image-only pages can call an OCR hook and otherwise emit warnings.
`csv`	`csv`	`table` part	Produces one table with rows, columns, and row range metadata.
`json`	`json`	`json` parts	Produces one part per JSON path, including nested objects and arrays.
`docx`	`docx`	`text`, `table` parts	Converts DOCX to HTML with Mammoth, then uses the HTML parser. Mammoth warnings become ingest warnings.
`xlsx`	`xlsx`	`sheet`, `table` parts	Uses ExcelJS. Produces a sheet part and a table part for each non-empty worksheet.

Override parsers source-locally when a format needs domain-specific handling:

const loader = filesSource(
  { glob: '**/*.acme' },
  {
    namespace: 'docs',
    parsers: [
      {
        name: 'acme',
        formats: ['unknown'],
        parse: ({ text }) => ({
          parts: [
            {
              id: 'acme:1',
              kind: 'text',
              role: 'paragraph',
              content: text ?? '',
            },
          ],
        }),
      },
    ],
  },
)

Types

`IngestDocument`

type IngestPart =
  | {
      kind: 'text'
      id: string
      content: string
      role?: 'heading' | 'paragraph' | 'list' | 'code'
      headingPath?: string[]
    }
  | { kind: 'page'; id: string; pageNumber: number; content: string; ocr?: boolean }
  | { kind: 'table'; id: string; rows: string[][]; columns?: string[]; content: string }
  | { kind: 'sheet'; id: string; sheetName: string; tables: IngestPart[]; content: string }
  | { kind: 'json'; id: string; path: string; valueType: string; content: string }

interface IngestDocument {
  namespace: string
  sourceId: string
  parts: IngestPart[]
  content: string
  title?: string
  metadata?: Record<string, unknown>
  warnings?: IngestWarning[]
}

content is derived when omitted. If you write a custom loader, you can provide parts, content, or both. Providing parts is preferred because provenance survives into indexing metadata.

`SourceLoader`

type IngestLoadResult =
  | { ok: true; document: IngestDocument }
  | { ok: false; namespace: string; sourceId: string; error: IngestError; metadata?: Record<string, unknown> }

interface SourceLoader {
  load(): AsyncIterable<IngestLoadResult>
  documents(): AsyncIterable<IngestDocument>
}

Use load() for real sync jobs. It reports source-level failures as values, which lets corpus.sync() keep going and record failed sources in the ledger. Use documents() for scripts and tests where the first bad source should throw.

`textSource(input)` — inline documents

For when you already have the text in hand:

import { textSource } from '@crux/ingest'

const single = textSource({
  namespace: 'docs',
  sourceId: 'readme',
  content: '# Hello\nThis is the readme.',
  title: 'README',
})

const many = textSource([
  { namespace: 'docs', sourceId: 'a', content: '...' },
  { namespace: 'docs', sourceId: 'b', content: '...' },
])

// Product sync path.
await docsCorpus.sync(single.load(), { sourceSet: 'partial' })

// Fail-fast script path.
for await (const document of single.documents()) {
  console.log(document.content)
}

Throws if namespace or sourceId is empty.

Parser contract

type IngestParser = {
  name: string
  formats: IngestFormat[]
  parse(input: ParseInput, ctx: ParseContext): Promise<ParseResult> | ParseResult
}

Built-in parsers cover txt, md, html, pdf, csv, json, docx, and xlsx. Pass parsers to a source to override a built-in parser or add a source-local parser. Custom parsers win over built-ins for matching formats.

const loader = fileSource('./docs/manual.acme', {
  namespace: 'docs',
  parsers: [
    {
      name: 'acme',
      formats: ['unknown'],
      parse: ({ text }) => ({
        parts: [{ id: 'acme:1', kind: 'text', role: 'paragraph', content: text ?? '' }],
      }),
    },
  ],
})

PDF parsing supports an OCR hook. Crux does not ship an OCR provider, but image-only pages can call your hook and mark the page part with ocr: true.

Pattern: pipe into a corpus

import { corpus, indexer } from '@crux/core/indexing'
import { filesSource } from '@crux/ingest/files'

const docsIndexer = indexer({
  id: 'docs',
  namespace: 'product-docs',
  store,
  dense: embedding,
})

const docsCorpus = corpus({
  id: 'docs',
  namespace: 'product-docs',
  store,
  indexer: docsIndexer,
})

const source = filesSource({ directory: './docs', recursive: true }, { namespace: 'product-docs' })

await docsCorpus.sync(source.load(), {
  sourceSet: 'complete',
  stale: 'delete',
})

Use indexer.indexDocuments(source.load()) for one-off writes and tests. Use corpus.sync() for repeated ingestion jobs.

Observability

Each parser run emits ingest:parse:start and ingest:parse:end through devtools, the CLI/TUI, and @crux/otel. OTel exports a crux.ingest.parse span with parser, format, byte length, part count, warning count, and privacy-safe hashed namespace/source identifiers.

What `@crux/ingest` is not

It is not a crawler framework, connector marketplace, or full document-intelligence platform. Use it to normalize source material into structured documents, then let corpus sync and retrieval own the rest.

Reference: Indexing — indexer and corpus
Reference: Retrieval — query the indexed corpus
Guide: Retrieval — the full RAG pipeline

@crux/ingest

Files

URLs

On this page