Crux
API Reference@crux/ingest

URL Loaders

urlSource, urlsSource — fetch HTTP(S) URLs and load them as IngestDocuments with HTML extraction.

import { urlSource, urlsSource } from '@crux/ingest/urls'
import type { UrlSourceOptions, UrlsSourceOptions } from '@crux/ingest/urls'

What it does

Fetches the URL with fetch, inspects the Content-Type header, and:

  • HTML (text/html, or content that looks like HTML) → extracts title, text blocks, headings, lists, code, and tables
  • Plain text / Markdown / CSV / JSON → parses into structured parts
  • PDF (application/pdf, .pdf, or %PDF bytes) → extracts page parts, with optional OCR hook support
  • DOCX / XLSX (.docx, .xlsx) → parses remote Office files when the URL extension identifies the format

For the full parser inventory and output shapes, see the @crux/ingest overview.

urlSource(url, options)

Load a single URL.

urlSource('https://example.com/article', {
  namespace: 'web',
  sourceId?: string,         // defaults to the URL
  fetch?: typeof fetch,       // for testing / custom HTTP clients
  parsers?: IngestParser[],
  ocr?: OcrHook,
})
OptionTypeDescription
namespacestringRequired, non-empty
sourceIdstring?Defaults to the URL
fetchtypeof fetch?Inject a custom fetch (e.g. for testing or proxies)
parsersIngestParser[]?Optional parser overrides. Custom parsers beat built-ins for matching formats.
ocrOcrHook?Optional hook for PDF pages without extractable text.

Document metadata

For HTML responses:

{
  namespace: '...',
  sourceId: 'https://example.com/article',
  content: '...',                 // extracted text
  title: '...',                   // from <title> tag
  metadata: {
    sourceUrl: 'https://example.com/article',
    format: 'html',
    parser: 'html',
  },
  parts: [
    { id: 'html:text:1', kind: 'text', role: 'paragraph', content: '...' },
  ],
}

For non-HTML responses (text/plain, text/markdown, text/csv, application/json):

{
  namespace: '...',
  sourceId: '...',
  content: rawText,
  metadata: {
    sourceUrl: '...',
    format: 'txt' | 'md' | 'csv' | 'json',
    parser: 'text' | 'markdown' | 'csv' | 'json',
  },
}

urlsSource(urls, options)

Load many URLs:

urlsSource(['https://example.com/a', 'https://example.com/b', 'https://example.com/c'], { namespace: 'web' })
OptionTypeDescription
namespacestringRequired, non-empty
fetchtypeof fetch?Custom fetch shared across all URLs
parsersIngestParser[]?Parser overrides shared across all URLs
ocrOcrHook?OCR hook shared across PDF URLs

URLs are loaded sequentially (one at a time). For high-volume ingestion, batch your own concurrency above this layer.

Errors

load() yields a failed source result for fetch and parse failures. That is the right mode for corpus sync because one bad URL should not necessarily block the whole job. documents() throws on the first failure for scripts and tests.

Example

import { corpus, indexer } from '@crux/core/indexing'
import { urlsSource } from '@crux/ingest/urls'

const docsIndexer = indexer({
  id: 'web',
  namespace: 'kb',
  store,
  dense: embedding,
})

const docsCorpus = corpus({
  id: 'web',
  namespace: 'kb',
  store,
  indexer: docsIndexer,
})

const source = urlsSource(
  ['https://docs.example.com/intro', 'https://docs.example.com/setup', 'https://docs.example.com/api'],
  { namespace: 'kb' },
)

await docsCorpus.sync(source.load(), {
  sourceSet: 'complete',
  stale: 'delete',
})

On this page