File Loaders

fileSource, filesSource — load IngestDocuments from individual files, globs, or directories.

import { fileSource, filesSource } from '@crux/ingest/files'
import type { FileSourceOptions, FilesSourceOptions, FilesDirectoryInput, FilesGlobInput } from '@crux/ingest/files'

Supported file types

fileSource() and filesSource() use the same bundled parser index as the rest of @crux/ingest:

Extension	Format	Notes
`.txt`	Plain text	Loaded verbatim
`.md`, `.markdown`	Markdown	Extracts headings, paragraphs, code/list text, and GFM tables
`.html`, `.htm`	HTML	Extracts title, text blocks, headings, lists, code, and tables
`.pdf`	PDF	Extracts page parts; image-only pages can use an OCR hook
`.csv`	CSV	Extracts a table part
`.json`	JSON	Extracts JSON path parts
`.docx`	DOCX	Extracts text and tables through Mammoth
`.xlsx`	XLSX	Extracts sheet and table parts through ExcelJS

The file extension determines the parser. For other formats, pre-extract to text, provide a custom parser, or use textSource().

For parser output details, see the @crux/ingest overview.

`fileSource(path, options)`

Load a single file.

fileSource('./docs/intro.md', {
  namespace: 'product-docs',
  sourceId?: string,         // defaults to absolute path
  parsers?: IngestParser[],   // source-local parser overrides
  ocr?: OcrHook,              // optional PDF OCR hook
})

Returns a SourceLoader that yields exactly one IngestDocument.

Option	Type	Description
`namespace`	`string`	Required, non-empty. Used as the indexer namespace prefix.
`sourceId`	`string?`	Optional. Defaults to the file's absolute path.
`parsers`	`IngestParser[]?`	Optional parser overrides. Custom parsers beat built-ins for matching formats.
`ocr`	`OcrHook?`	Optional hook for PDF pages without extractable text.

Document metadata

Each loaded file produces:

{
  namespace: '...',
  sourceId: '/abs/path/to/file.md',
  content: '...',
  title: 'file.md',                  // basename
  metadata: {
    sourcePath: '/abs/path/to/file.md',
    format: 'md',
    parser: 'markdown',
  },
  parts: [
    { id: 'md:text:1', kind: 'text', role: 'heading', content: 'Intro' },
    { id: 'md:text:2', kind: 'text', role: 'paragraph', content: '...' },
  ],
}

For HTML, title is overridden with the extracted <title> if present. For PDFs, title is overridden with the extracted document title when metadata exposes one.

content is derived from parts, so the default indexer still chunks text. The parts remain available for custom chunkers and UI provenance.

`filesSource(input, options)`

Load multiple files. The input parameter accepts three shapes:

Array of paths

filesSource(['./a.md', './b.md', './c.md'], { namespace: 'docs' })

Field	Type	Default	Description
`directory`	`string`	required	Directory path
`recursive`	`boolean?`	`true`	Walk subdirectories

Glob

filesSource({ glob: '**/*.md', cwd: './docs' }, { namespace: 'docs' })

Field	Type	Default	Description
`glob`	`string \| string[]`	required	Glob pattern(s) — uses Node's `matchesGlob`
`cwd`	`string?`	`process.cwd()`	Base directory for the glob

Example

import { corpus, indexer } from '@crux/core/indexing'
import { filesSource } from '@crux/ingest/files'

const docsIndexer = indexer({
  id: 'docs',
  namespace: 'product-docs',
  store,
  dense: embedding,
})

const docsCorpus = corpus({
  id: 'docs',
  namespace: 'product-docs',
  store,
  indexer: docsIndexer,
})

const source = filesSource({ glob: 'docs/**/*.{md,html}', cwd: process.cwd() }, { namespace: 'product-docs' })

const result = await docsCorpus.sync(source.load(), {
  sourceSet: 'complete',
  stale: 'delete',
})

Use source.load() for corpus sync because failed files are returned as failed source results. Use source.documents() for fail-fast scripts:

for await (const document of source.documents()) {
  console.log(document.sourceId, document.parts.length)
}

Reference: @crux/ingest overview
Reference: URLs — load HTTP-fetched documents
Reference: Indexing — indexer

File Loaders

On this page