Quality
Evaluate any Crux primitive — typed evaluations, trace-backed assertions, scorers, variants, baselines, and deterministic replay.
Quality is Crux's one evaluation system. You author an Evaluation with
evaluate(), point its task at any Crux primitive (prompt, flow, agent,
retriever) or a plain function, and give it data (cases). Running it
produces an Experiment — a matrix of variants × cases × trials with
statistically honest aggregates — and gates drive CI exit codes. A
Baseline is an explicitly promoted experiment committed to your repo, so
every later run auto-compares against it. Cassettes replay model calls
deterministically at the executor boundary.
The ladder
Every feature is reachable from the previous level by adding keys to the same object — never by restructuring what you already wrote:
| Level | You want | You write | New concept |
|---|---|---|---|
| 0 | Smoke-check a prompt | tests: cases on the prompt definition | none |
| 1 | A real eval | one *.eval.ts file: evaluate({ task, data }) | Evaluation |
| 2 | Deep, granular checks | add expect:; wrap the task in target.* for defaults | Assertions |
| 3 | Measured quality | add scorers: and gates: | Scorers & gates |
| 4 | Compare & protect | add variants:, baseline:, trials:; promote | Variants & baselines |
| 5 | Determinism & cost control | add replay: | Replay |
Nothing on a level requires understanding the next one. This guide teaches in that order; Recipes shows the pattern for each primitive (flow, agent, retriever, RAG, context bakeoffs).
Level 0 — colocated prompt tests
The zero-ceremony entry point: declare data-only cases on the prompt itself.
The runner lowers them into a prompt:<id> evaluation — output-schema
validation gates, expected is reported alongside results.
const triagePrompt = prompt({
id: 'support-triage',
input: z.object({ message: z.string() }),
output: z.object({ queue: z.enum(['billing', 'technical', 'other']) }),
system: 'Route the message to the right support queue.',
tests: [
{ input: { message: 'My invoice is wrong' }, expected: { queue: 'billing' } },
{ name: 'vague message', input: { message: 'It is broken' } },
],
})Level 1 — your first evaluation
Create a file matched by the default Quality discovery globs:
evals/**/*.eval.ts and **/*.eval.ts. A conventional project does not need
crux.config.ts for file-defined evaluations. Any async function is a valid
task; data is an array of cases:
import { evaluate } from '@crux/core/quality'
export default evaluate({
task: (input: { word: string }) => input.word.toUpperCase(),
data: [
{ input: { word: 'crux' }, expected: 'CRUX' },
{ input: { word: 'quality' }, expected: 'QUALITY' },
],
expect: (ctx) => {
if (ctx.expected !== undefined) ctx.expect(ctx.output).toBe(ctx.expected)
},
})Two things to know about the shape:
expectedis opaque data. Nothing matches it implicitly — it is delivered to yourexpectcallback and to scorers, and you decide what it means. Declarative matching is an explicit scorer (scorers.exact()).- Types flow from the task. Case inputs and
ctx.outputare inferred from the task — the function signature here, or a prompt's zod schemas:
import { evaluate } from '@crux/core/quality'
import { supportPrompt } from '../prompts/support'
export default evaluate('support.refunds', {
task: supportPrompt,
data: [
{ name: 'simple refund', input: { question: 'How do refunds work?', locale: 'en' } },
{ name: 'dutch refund', input: { question: 'Hoe werkt een refund?', locale: 'nl' } },
],
})The id ('support.refunds') is optional — without it the runner derives one
from the file path (evals/support.eval.ts → evals.support). Explicit ids
are required once you promote baselines.
If the eval task is a deterministic stand-in for a production primitive, add
covers with Project Index definition ids so Devtools and Index Lint can link
the eval to the thing it protects:
export default evaluate('support.refunds.contract', {
covers: ['prompt:support'],
task: deterministicRefundAnswer,
data,
})Running evaluations
crux quality list # discovered evaluations, no execution
crux quality run # everything discovered in the project
crux quality run support.refunds # one evaluation by id
crux quality run --ci # CI reporter + exit codes
crux quality watch support.refunds # rerun changed cells on save
crux quality show <experiment-id> --json
crux quality progress support.refunds --limit 10 --json
crux quality cell-evidence <experiment-id> --case simple-refund --variant default --trial 0 --json
crux quality promote <experiment-id>Quality is the evaluation system, and crux quality ... is the CLI namespace
for running, watching, listing, inspecting, and promoting evaluations.
Exit codes are CI-ready: 0 all blocking gates passed, 1 a gate/assertion
failed or a cell errored, 2 definition error (nothing executed). See the
CLI reference.
Or embed a run in any test — the Vitest bridge is one line:
import supportEval from '../evals/support.eval'
it('support quality', async () => {
const experiment = await supportEval.run()
expect(experiment.passed).toBe(true)
})Every run persists an Experiment record to
.crux/quality/experiments/<ulid>.json (gitignored) with per-cell outputs,
assertions, scores, and devtools trace links.
Discovery and model setup
Quality discovery starts from source. With no crux.config.ts, the runner
uses an empty quality block, discovers file-defined evaluations by
convention, imports source-discovered prompt exports for colocated
prompt({ tests }), and persists records under .crux/quality. Prompt tests
fail closed with a Project Model diagnostic when a prompt uses a context or
injectable dependency that source discovery cannot prove.
Model-backed tasks still need an adapter generate and a model. Prefer
importing that runtime next to the eval so the model choice is visible in
source:
import { openai } from '@ai-sdk/openai'
import { generate } from '@crux/ai'
import type { GenerateFn, ModelRef } from '@crux/core/quality'
export interface QualityRuntime {
readonly generate: GenerateFn
readonly model: ModelRef
}
export function qualityRuntime(): QualityRuntime {
return {
generate,
model: openai('gpt-4o-mini'),
} satisfies QualityRuntime
}import { evaluate, target } from '@crux/core/quality'
import { supportPrompt } from '../prompts/support'
import { qualityRuntime } from './quality-runtime'
const runtime = qualityRuntime()
export default evaluate('support.model-backed', {
task: target.prompt(supportPrompt, {
generate: runtime.generate,
model: runtime.model,
}),
data: [{ input: { question: 'How do refunds work?', locale: 'en' } }],
})Quality config stays focused on discovery, persistence, redaction, and run
defaults. Do not put model/provider defaults in crux.config.ts; share those
with ordinary TypeScript helpers imported by eval files:
import { config } from '@crux/core'
export default config({
quality: {
defaults: { trials: 1, timeoutMs: 60_000 },
redact: ['customer.email'],
},
})For evaluation cells, redact dot-paths are relative to each persisted
snapshot value: customer.email scrubs that field from input, output, expected
values, assertion values, and cassettes whenever that shape appears. Feedback
payloads are root-qualified, such as metadata.customer.email,
expected.answer.privateNote, or proposal.statement. Authorization and
API-key-style fields are always redacted at every depth, so you do not need to
configure paths such as headers.authorization.
defaults fills trials/concurrency/timeoutMs/replay for evaluations
that don't declare them.
Next
- Assertions —
expect, trace-backed signals, honest failure semantics - Scorers & gates — the built-in library, judges, pass policies
- Variants & baselines — bakeoffs, paired statistics, promotion
- Replay — cassettes and deterministic CI
- Recipes — flows, agents, retrieval, RAG, context bakeoffs
- API reference