Crawler-Style Custom Loader
Crawl a small site in userland and feed the pages into Crux indexing without making crawling built-in.
Crux does not ship a crawler in @crux/ingest, and that is intentional. Crawling has different tradeoffs around robots rules, concurrency, deduplication, link filtering, canonical URLs, and failure handling. Those concerns are real, but they do not belong in the minimal built-in loading layer.
What Crux does give you is the contract you need to build a crawler in userland: a loader that yields IngestDocument values.
The idea
We are going to build a tiny same-origin crawler that:
- starts from one URL
- follows links within the same host
- fetches pages
- strips them down to text
- yields
IngestDocumentrecords into an indexer
This is not meant to be a production crawler. It is meant to show the extension point clearly.
Full code
lib/rag/site-loader.ts
import { htmlToText } from '@crux/ingest/html'
import type { IngestDocument, SourceLoader } from '@crux/ingest'
export function crawlSite(options: {
namespace: string
startUrl: string
maxPages?: number
fetch?: typeof fetch
}): SourceLoader {
const fetchImpl = options.fetch ?? fetch
return {
async *load(): AsyncIterable<IngestDocument> {
const queue = [options.startUrl]
const seen = new Set<string>()
const origin = new URL(options.startUrl).origin
const maxPages = options.maxPages ?? 20
while (queue.length > 0 && seen.size < maxPages) {
const url = queue.shift()!
if (seen.has(url)) continue
seen.add(url)
const response = await fetchImpl(url)
if (!response.ok) continue
const html = await response.text()
const extracted = htmlToText(html)
yield {
namespace: options.namespace,
sourceId: url,
content: extracted.content,
title: extracted.title,
metadata: {
sourceUrl: url,
format: 'html',
},
}
for (const nextUrl of extractLinks(html, url, origin)) {
if (!seen.has(nextUrl)) queue.push(nextUrl)
}
}
},
}
}
function extractLinks(html: string, baseUrl: string, origin: string): string[] {
const hrefs = [...html.matchAll(/href=["']([^"']+)["']/gi)]
.map((match) => match[1])
.filter(Boolean)
const urls = hrefs
.map((href) => new URL(href, baseUrl).toString())
.filter((url) => url.startsWith(origin))
return [...new Set(urls)]
}lib/rag/index-site.ts
import { embedding } from '@crux/core/embedding'
import { corpus, indexer } from '@crux/core/indexing'
import { inMemoryDataStore, inMemoryVectorStore } from '@crux/core/storage'
import { crawlSite } from './site-loader'
const dense = embedding({ kind: 'dense', ... })
const data = inMemoryDataStore()
const vectors = inMemoryVectorStore()
const docsIndexer = indexer({
id: 'site-docs',
namespace: 'site-docs',
data,
vectors,
dense,
})
const docsCorpus = corpus({
id: 'site-docs',
namespace: 'site-docs',
store: data,
indexer: docsIndexer,
})
export async function syncSite(startUrl: string) {
const loader = crawlSite({
namespace: 'site-docs',
startUrl,
maxPages: 25,
})
return docsCorpus.sync(loader.load(), {
sourceSet: 'complete',
stale: 'delete',
})
}Why this is a cookbook and not a built-in
The important part here is not the toy crawler itself. It is the architectural boundary.
Crux should not pretend crawling is "just another file loader." The right built-in abstraction is the loader contract. That lets users build a crawler that matches their own needs while still benefiting from the same indexing and retrieval layers afterward.
Once the crawler yields IngestDocument, the rest of the stack does not care whether the source was a file, a URL list, or a custom crawl. The corpus sync still handles changed pages, unchanged pages, and stale pages the same way.
Where to take it next
In a real crawler you would usually add:
- robots handling
- canonical URL normalization
- content-type checks
- rate limiting
- failure retries
- deduplication by content hash
Those are crawler concerns, not retrieval concerns.