@crux/ingest
Source loaders for inline text, files, and URLs — produce IngestDocument streams for corpus sync.
import { textSource } from '@crux/ingest'
import { fileSource, filesSource } from '@crux/ingest/files'
import { urlSource, urlsSource } from '@crux/ingest/urls'
import type { IngestDocument, SourceLoader } from '@crux/ingest'Overview
@crux/ingest is the source-loading layer for retrieval. It turns inline text, local files, folders, globs, and URLs into structured IngestDocument values that can flow into corpus.sync() or indexer.indexDocuments().
The important word is structured. Every parser produces parts first: text blocks, pages, tables, sheets, or JSON paths. Crux then derives content from those parts so the current chunking and retrieval path still works like normal text ingestion. That gives simple users a plain-text path and gives advanced users provenance for custom chunkers, UI citations, table-aware indexing, and future document-intelligence features.
If your application already has a clean loading pipeline, you do not have to use this package. Yield the same document shape from your own loader and pass it to the corpus. @crux/ingest exists to make the default retrieval stack coherent, not to become mandatory ceremony.
Subpaths
Files
fileSource, filesSource — single file, glob, directory loaders.
URLs
urlSource, urlsSource — fetch and extract HTML to plain text.
Bundled loaders
Crux ships these source loaders:
| Loader | Import | What it loads | Use it when |
|---|---|---|---|
textSource() | @crux/ingest | One or more in-memory IngestDocument-like objects. | Your app already has text, CMS records, API results, or generated documents in memory. |
fileSource() | @crux/ingest/files | One local file. | You want to index a known file and optionally control its sourceId. |
filesSource() | @crux/ingest/files | An array of paths, a directory, or a glob. | You want to index a local docs folder or file set. |
urlSource() | @crux/ingest/urls | One HTTP(S) URL. | You want to index a known web page or remote document. |
urlsSource() | @crux/ingest/urls | A list of HTTP(S) URLs. | You have an explicit URL list and do not need crawler behavior. |
Custom SourceLoader | any package | Any async iterable of IngestLoadResult. | You are loading from a CMS, database, queue, API, crawler, or connector. |
All bundled loaders return the same SourceLoader shape:
const loader = filesSource({ directory: './docs' }, { namespace: 'product-docs' })
await docsCorpus.sync(loader.load(), { sourceSet: 'complete' })Use load() for product sync jobs because failures are returned as values. Use documents() for scripts and tests where the first failed source should throw.
Bundled parsers
Parsers turn bytes into structured parts. The source decides the format from file extension, URL content type, URL extension, or PDF magic bytes.
| Parser | Formats | Parts produced | Notes |
|---|---|---|---|
text | txt, unknown | text part | Plain text fallback. Unknown files/URLs are treated as text unless a custom parser overrides them. |
markdown | md | text, table parts | Parses Markdown + GFM. Preserves heading paths and table rows. |
html | html | text, table parts | Extracts title, headings, paragraphs, list items, code/pre blocks, and tables. |
pdf | pdf | page parts | Uses pdfjs-dist; image-only pages can call an OCR hook and otherwise emit warnings. |
csv | csv | table part | Produces one table with rows, columns, and row range metadata. |
json | json | json parts | Produces one part per JSON path, including nested objects and arrays. |
docx | docx | text, table parts | Converts DOCX to HTML with Mammoth, then uses the HTML parser. Mammoth warnings become ingest warnings. |
xlsx | xlsx | sheet, table parts | Uses ExcelJS. Produces a sheet part and a table part for each non-empty worksheet. |
Override parsers source-locally when a format needs domain-specific handling:
const loader = filesSource(
{ glob: '**/*.acme' },
{
namespace: 'docs',
parsers: [
{
name: 'acme',
formats: ['unknown'],
parse: ({ text }) => ({
parts: [
{
id: 'acme:1',
kind: 'text',
role: 'paragraph',
content: text ?? '',
},
],
}),
},
],
},
)Types
IngestDocument
type IngestPart =
| {
kind: 'text'
id: string
content: string
role?: 'heading' | 'paragraph' | 'list' | 'code'
headingPath?: string[]
}
| { kind: 'page'; id: string; pageNumber: number; content: string; ocr?: boolean }
| { kind: 'table'; id: string; rows: string[][]; columns?: string[]; content: string }
| { kind: 'sheet'; id: string; sheetName: string; tables: IngestPart[]; content: string }
| { kind: 'json'; id: string; path: string; valueType: string; content: string }
interface IngestDocument {
namespace: string
sourceId: string
parts: IngestPart[]
content: string
title?: string
metadata?: Record<string, unknown>
warnings?: IngestWarning[]
}content is derived when omitted. If you write a custom loader, you can provide parts, content, or both. Providing parts is preferred because provenance survives into indexing metadata.
SourceLoader
type IngestLoadResult =
| { ok: true; document: IngestDocument }
| { ok: false; namespace: string; sourceId: string; error: IngestError; metadata?: Record<string, unknown> }
interface SourceLoader {
load(): AsyncIterable<IngestLoadResult>
documents(): AsyncIterable<IngestDocument>
}Use load() for real sync jobs. It reports source-level failures as values, which lets corpus.sync() keep going and record failed sources in the ledger. Use documents() for scripts and tests where the first bad source should throw.
textSource(input) — inline documents
For when you already have the text in hand:
import { textSource } from '@crux/ingest'
const single = textSource({
namespace: 'docs',
sourceId: 'readme',
content: '# Hello\nThis is the readme.',
title: 'README',
})
const many = textSource([
{ namespace: 'docs', sourceId: 'a', content: '...' },
{ namespace: 'docs', sourceId: 'b', content: '...' },
])
// Product sync path.
await docsCorpus.sync(single.load(), { sourceSet: 'partial' })
// Fail-fast script path.
for await (const document of single.documents()) {
console.log(document.content)
}Throws if namespace or sourceId is empty.
Parser contract
type IngestParser = {
name: string
formats: IngestFormat[]
parse(input: ParseInput, ctx: ParseContext): Promise<ParseResult> | ParseResult
}Built-in parsers cover txt, md, html, pdf, csv, json, docx, and xlsx. Pass parsers to a source to override a built-in parser or add a source-local parser. Custom parsers win over built-ins for matching formats.
const loader = fileSource('./docs/manual.acme', {
namespace: 'docs',
parsers: [
{
name: 'acme',
formats: ['unknown'],
parse: ({ text }) => ({
parts: [{ id: 'acme:1', kind: 'text', role: 'paragraph', content: text ?? '' }],
}),
},
],
})PDF parsing supports an OCR hook. Crux does not ship an OCR provider, but image-only pages can call your hook and mark the page part with ocr: true.
Pattern: pipe into a corpus
import { corpus, indexer } from '@crux/core/indexing'
import { filesSource } from '@crux/ingest/files'
const docsIndexer = indexer({
id: 'docs',
namespace: 'product-docs',
store,
dense: embedding,
})
const docsCorpus = corpus({
id: 'docs',
namespace: 'product-docs',
store,
indexer: docsIndexer,
})
const source = filesSource({ directory: './docs', recursive: true }, { namespace: 'product-docs' })
await docsCorpus.sync(source.load(), {
sourceSet: 'complete',
stale: 'delete',
})Use indexer.indexDocuments(source.load()) for one-off writes and tests. Use corpus.sync() for repeated ingestion jobs.
Observability
Each parser run emits ingest:parse:start and ingest:parse:end through devtools, the CLI/TUI, and @crux/otel. OTel exports a crux.ingest.parse span with parser, format, byte length, part count, warning count, and privacy-safe hashed namespace/source identifiers.
What @crux/ingest is not
It is not a crawler framework, connector marketplace, or full document-intelligence platform. Use it to normalize source material into structured documents, then let corpus sync and retrieval own the rest.