URL Loaders

urlSource, urlsSource — fetch HTTP(S) URLs and load them as IngestDocuments with HTML extraction.

import { urlSource, urlsSource } from '@crux/ingest/urls'
import type { UrlSourceOptions, UrlsSourceOptions } from '@crux/ingest/urls'

What it does

Fetches the URL with fetch, inspects the Content-Type header, and:

HTML (text/html, or content that looks like HTML) → extracts title, text blocks, headings, lists, code, and tables
Plain text / Markdown / CSV / JSON → parses into structured parts
PDF (application/pdf, .pdf, or %PDF bytes) → extracts page parts, with optional OCR hook support
DOCX / XLSX (.docx, .xlsx) → parses remote Office files when the URL extension identifies the format

For the full parser inventory and output shapes, see the @crux/ingest overview.

`urlSource(url, options)`

Load a single URL.

urlSource('https://example.com/article', {
  namespace: 'web',
  sourceId?: string,         // defaults to the URL
  fetch?: typeof fetch,       // for testing / custom HTTP clients
  parsers?: IngestParser[],
  ocr?: OcrHook,
})

Option	Type	Description
`namespace`	`string`	Required, non-empty
`sourceId`	`string?`	Defaults to the URL
`fetch`	`typeof fetch?`	Inject a custom fetch (e.g. for testing or proxies)
`parsers`	`IngestParser[]?`	Optional parser overrides. Custom parsers beat built-ins for matching formats.
`ocr`	`OcrHook?`	Optional hook for PDF pages without extractable text.

Document metadata

For HTML responses:

{
  namespace: '...',
  sourceId: 'https://example.com/article',
  content: '...',                 // extracted text
  title: '...',                   // from <title> tag
  metadata: {
    sourceUrl: 'https://example.com/article',
    format: 'html',
    parser: 'html',
  },
  parts: [
    { id: 'html:text:1', kind: 'text', role: 'paragraph', content: '...' },
  ],
}

For non-HTML responses (text/plain, text/markdown, text/csv, application/json):

{
  namespace: '...',
  sourceId: '...',
  content: rawText,
  metadata: {
    sourceUrl: '...',
    format: 'txt' | 'md' | 'csv' | 'json',
    parser: 'text' | 'markdown' | 'csv' | 'json',
  },
}

`urlsSource(urls, options)`

Load many URLs:

urlsSource(['https://example.com/a', 'https://example.com/b', 'https://example.com/c'], { namespace: 'web' })

Option	Type	Description
`namespace`	`string`	Required, non-empty
`fetch`	`typeof fetch?`	Custom fetch shared across all URLs
`parsers`	`IngestParser[]?`	Parser overrides shared across all URLs
`ocr`	`OcrHook?`	OCR hook shared across PDF URLs

URLs are loaded sequentially (one at a time). For high-volume ingestion, batch your own concurrency above this layer.

load() yields a failed source result for fetch and parse failures. That is the right mode for corpus sync because one bad URL should not necessarily block the whole job. documents() throws on the first failure for scripts and tests.

Example

import { corpus, indexer } from '@crux/core/indexing'
import { urlsSource } from '@crux/ingest/urls'

const docsIndexer = indexer({
  id: 'web',
  namespace: 'kb',
  store,
  dense: embedding,
})

const docsCorpus = corpus({
  id: 'web',
  namespace: 'kb',
  store,
  indexer: docsIndexer,
})

const source = urlsSource(
  ['https://docs.example.com/intro', 'https://docs.example.com/setup', 'https://docs.example.com/api'],
  { namespace: 'kb' },
)

await docsCorpus.sync(source.load(), {
  sourceSet: 'complete',
  stale: 'delete',
})

Reference: @crux/ingest overview
Reference: Files — load from disk instead
Reference: Indexing — indexer

URL Loaders

What it does

`urlSource(url, options)`

Document metadata

`urlsSource(urls, options)`

Errors

Example

On this page

URL Loaders

What it does

urlSource(url, options)

Document metadata

urlsSource(urls, options)

Errors

Example

Related

On this page

`urlSource(url, options)`

`urlsSource(urls, options)`