File Loaders
fileSource, filesSource — load IngestDocuments from individual files, globs, or directories.
import { fileSource, filesSource } from '@crux/ingest/files'
import type { FileSourceOptions, FilesSourceOptions, FilesDirectoryInput, FilesGlobInput } from '@crux/ingest/files'Supported file types
fileSource() and filesSource() use the same bundled parser index as the rest of @crux/ingest:
| Extension | Format | Notes |
|---|---|---|
.txt | Plain text | Loaded verbatim |
.md, .markdown | Markdown | Extracts headings, paragraphs, code/list text, and GFM tables |
.html, .htm | HTML | Extracts title, text blocks, headings, lists, code, and tables |
.pdf | Extracts page parts; image-only pages can use an OCR hook | |
.csv | CSV | Extracts a table part |
.json | JSON | Extracts JSON path parts |
.docx | DOCX | Extracts text and tables through Mammoth |
.xlsx | XLSX | Extracts sheet and table parts through ExcelJS |
The file extension determines the parser. For other formats, pre-extract to text, provide a custom parser, or use textSource().
For parser output details, see the @crux/ingest overview.
fileSource(path, options)
Load a single file.
fileSource('./docs/intro.md', {
namespace: 'product-docs',
sourceId?: string, // defaults to absolute path
parsers?: IngestParser[], // source-local parser overrides
ocr?: OcrHook, // optional PDF OCR hook
})Returns a SourceLoader that yields exactly one IngestDocument.
| Option | Type | Description |
|---|---|---|
namespace | string | Required, non-empty. Used as the indexer namespace prefix. |
sourceId | string? | Optional. Defaults to the file's absolute path. |
parsers | IngestParser[]? | Optional parser overrides. Custom parsers beat built-ins for matching formats. |
ocr | OcrHook? | Optional hook for PDF pages without extractable text. |
Document metadata
Each loaded file produces:
{
namespace: '...',
sourceId: '/abs/path/to/file.md',
content: '...',
title: 'file.md', // basename
metadata: {
sourcePath: '/abs/path/to/file.md',
format: 'md',
parser: 'markdown',
},
parts: [
{ id: 'md:text:1', kind: 'text', role: 'heading', content: 'Intro' },
{ id: 'md:text:2', kind: 'text', role: 'paragraph', content: '...' },
],
}For HTML, title is overridden with the extracted <title> if present. For PDFs, title is overridden with the extracted document title when metadata exposes one.
content is derived from parts, so the default indexer still chunks text. The parts remain available for custom chunkers and UI provenance.
filesSource(input, options)
Load multiple files. The input parameter accepts three shapes:
Array of paths
filesSource(['./a.md', './b.md', './c.md'], { namespace: 'docs' })Directory
filesSource({ directory: './docs', recursive: true }, { namespace: 'docs' })| Field | Type | Default | Description |
|---|---|---|---|
directory | string | required | Directory path |
recursive | boolean? | true | Walk subdirectories |
Glob
filesSource({ glob: '**/*.md', cwd: './docs' }, { namespace: 'docs' })| Field | Type | Default | Description |
|---|---|---|---|
glob | string | string[] | required | Glob pattern(s) — uses Node's matchesGlob |
cwd | string? | process.cwd() | Base directory for the glob |
Example
import { corpus, indexer } from '@crux/core/indexing'
import { filesSource } from '@crux/ingest/files'
const docsIndexer = indexer({
id: 'docs',
namespace: 'product-docs',
store,
dense: embedding,
})
const docsCorpus = corpus({
id: 'docs',
namespace: 'product-docs',
store,
indexer: docsIndexer,
})
const source = filesSource({ glob: 'docs/**/*.{md,html}', cwd: process.cwd() }, { namespace: 'product-docs' })
const result = await docsCorpus.sync(source.load(), {
sourceSet: 'complete',
stale: 'delete',
})Use source.load() for corpus sync because failed files are returned as failed source results. Use source.documents() for fail-fast scripts:
for await (const document of source.documents()) {
console.log(document.sourceId, document.parts.length)
}Related
- Reference: @crux/ingest overview
- Reference: URLs — load HTTP-fetched documents
- Reference: Indexing —
indexer