Crux
API Reference@crux/ingest

File Loaders

fileSource, filesSource — load IngestDocuments from individual files, globs, or directories.

import { fileSource, filesSource } from '@crux/ingest/files'
import type { FileSourceOptions, FilesSourceOptions, FilesDirectoryInput, FilesGlobInput } from '@crux/ingest/files'

Supported file types

fileSource() and filesSource() use the same bundled parser index as the rest of @crux/ingest:

ExtensionFormatNotes
.txtPlain textLoaded verbatim
.md, .markdownMarkdownExtracts headings, paragraphs, code/list text, and GFM tables
.html, .htmHTMLExtracts title, text blocks, headings, lists, code, and tables
.pdfPDFExtracts page parts; image-only pages can use an OCR hook
.csvCSVExtracts a table part
.jsonJSONExtracts JSON path parts
.docxDOCXExtracts text and tables through Mammoth
.xlsxXLSXExtracts sheet and table parts through ExcelJS

The file extension determines the parser. For other formats, pre-extract to text, provide a custom parser, or use textSource().

For parser output details, see the @crux/ingest overview.

fileSource(path, options)

Load a single file.

fileSource('./docs/intro.md', {
  namespace: 'product-docs',
  sourceId?: string,         // defaults to absolute path
  parsers?: IngestParser[],   // source-local parser overrides
  ocr?: OcrHook,              // optional PDF OCR hook
})

Returns a SourceLoader that yields exactly one IngestDocument.

OptionTypeDescription
namespacestringRequired, non-empty. Used as the indexer namespace prefix.
sourceIdstring?Optional. Defaults to the file's absolute path.
parsersIngestParser[]?Optional parser overrides. Custom parsers beat built-ins for matching formats.
ocrOcrHook?Optional hook for PDF pages without extractable text.

Document metadata

Each loaded file produces:

{
  namespace: '...',
  sourceId: '/abs/path/to/file.md',
  content: '...',
  title: 'file.md',                  // basename
  metadata: {
    sourcePath: '/abs/path/to/file.md',
    format: 'md',
    parser: 'markdown',
  },
  parts: [
    { id: 'md:text:1', kind: 'text', role: 'heading', content: 'Intro' },
    { id: 'md:text:2', kind: 'text', role: 'paragraph', content: '...' },
  ],
}

For HTML, title is overridden with the extracted <title> if present. For PDFs, title is overridden with the extracted document title when metadata exposes one.

content is derived from parts, so the default indexer still chunks text. The parts remain available for custom chunkers and UI provenance.

filesSource(input, options)

Load multiple files. The input parameter accepts three shapes:

Array of paths

filesSource(['./a.md', './b.md', './c.md'], { namespace: 'docs' })

Directory

filesSource({ directory: './docs', recursive: true }, { namespace: 'docs' })
FieldTypeDefaultDescription
directorystringrequiredDirectory path
recursiveboolean?trueWalk subdirectories

Glob

filesSource({ glob: '**/*.md', cwd: './docs' }, { namespace: 'docs' })
FieldTypeDefaultDescription
globstring | string[]requiredGlob pattern(s) — uses Node's matchesGlob
cwdstring?process.cwd()Base directory for the glob

Example

import { corpus, indexer } from '@crux/core/indexing'
import { filesSource } from '@crux/ingest/files'

const docsIndexer = indexer({
  id: 'docs',
  namespace: 'product-docs',
  store,
  dense: embedding,
})

const docsCorpus = corpus({
  id: 'docs',
  namespace: 'product-docs',
  store,
  indexer: docsIndexer,
})

const source = filesSource({ glob: 'docs/**/*.{md,html}', cwd: process.cwd() }, { namespace: 'product-docs' })

const result = await docsCorpus.sync(source.load(), {
  sourceSet: 'complete',
  stale: 'delete',
})

Use source.load() for corpus sync because failed files are returned as failed source results. Use source.documents() for fail-fast scripts:

for await (const document of source.documents()) {
  console.log(document.sourceId, document.parts.length)
}

On this page