Quality

Evaluate any Crux primitive — typed evaluations, trace-backed assertions, scorers, variants, baselines, and deterministic replay.

Quality is Crux's one evaluation system. You author an Evaluation with evaluate(), point its task at any Crux primitive (prompt, flow, agent, retriever) or a plain function, and give it data (cases). Running it produces an Experiment — a matrix of variants × cases × trials with statistically honest aggregates — and gates drive CI exit codes. A Baseline is an explicitly promoted experiment committed to your repo, so every later run auto-compares against it. Cassettes replay model calls deterministically at the executor boundary.

The ladder

Every feature is reachable from the previous level by adding keys to the same object — never by restructuring what you already wrote:

Level	You want	You write	New concept
0	Smoke-check a prompt	`tests:` cases on the prompt definition	none
1	A real eval	one `*.eval.ts` file: `evaluate({ task, data })`	Evaluation
2	Deep, granular checks	add `expect:`; wrap the task in `target.*` for defaults	Assertions
3	Measured quality	add `scorers:` and `gates:`	Scorers & gates
4	Compare & protect	add `variants:`, `baseline:`, `trials:`; promote	Variants & baselines
5	Determinism & cost control	add `replay:`	Replay

Nothing on a level requires understanding the next one. This guide teaches in that order; Recipes shows the pattern for each primitive (flow, agent, retriever, RAG, context bakeoffs).

Level 0 — colocated prompt tests

The zero-ceremony entry point: declare data-only cases on the prompt itself. The runner lowers them into a prompt:<id> evaluation — output-schema validation gates, expected is reported alongside results.

const triagePrompt = prompt({
  id: 'support-triage',
  input: z.object({ message: z.string() }),
  output: z.object({ queue: z.enum(['billing', 'technical', 'other']) }),
  system: 'Route the message to the right support queue.',
  tests: [
    { input: { message: 'My invoice is wrong' }, expected: { queue: 'billing' } },
    { name: 'vague message', input: { message: 'It is broken' } },
  ],
})

Level 1 — your first evaluation

Create a file matched by the default Quality discovery globs: evals/**/*.eval.ts and **/*.eval.ts. A conventional project does not need crux.config.ts for file-defined evaluations. Any async function is a valid task; data is an array of cases:

evals/uppercase.eval.ts

import { evaluate } from '@crux/core/quality'

export default evaluate({
  task: (input: { word: string }) => input.word.toUpperCase(),
  data: [
    { input: { word: 'crux' }, expected: 'CRUX' },
    { input: { word: 'quality' }, expected: 'QUALITY' },
  ],
  expect: (ctx) => {
    if (ctx.expected !== undefined) ctx.expect(ctx.output).toBe(ctx.expected)
  },
})

Two things to know about the shape:

expected is opaque data. Nothing matches it implicitly — it is delivered to your expect callback and to scorers, and you decide what it means. Declarative matching is an explicit scorer (scorers.exact()).
Types flow from the task. Case inputs and ctx.output are inferred from the task — the function signature here, or a prompt's zod schemas:

evals/support.eval.ts

import { evaluate } from '@crux/core/quality'
import { supportPrompt } from '../prompts/support'

export default evaluate('support.refunds', {
  task: supportPrompt,
  data: [
    { name: 'simple refund', input: { question: 'How do refunds work?', locale: 'en' } },
    { name: 'dutch refund', input: { question: 'Hoe werkt een refund?', locale: 'nl' } },
  ],
})

The id ('support.refunds') is optional — without it the runner derives one from the file path (evals/support.eval.ts → evals.support). Explicit ids are required once you promote baselines.

If the eval task is a deterministic stand-in for a production primitive, add covers with Project Index definition ids so Devtools and Index Lint can link the eval to the thing it protects:

export default evaluate('support.refunds.contract', {
  covers: ['prompt:support'],
  task: deterministicRefundAnswer,
  data,
})

Running evaluations

crux quality list                 # discovered evaluations, no execution
crux quality run                  # everything discovered in the project
crux quality run support.refunds  # one evaluation by id
crux quality run --ci             # CI reporter + exit codes
crux quality watch support.refunds # rerun changed cells on save
crux quality show <experiment-id> --json
crux quality progress support.refunds --limit 10 --json
crux quality cell-evidence <experiment-id> --case simple-refund --variant default --trial 0 --json
crux quality promote <experiment-id>

Quality is the evaluation system, and crux quality ... is the CLI namespace for running, watching, listing, inspecting, and promoting evaluations.

Exit codes are CI-ready: 0 all blocking gates passed, 1 a gate/assertion failed or a cell errored, 2 definition error (nothing executed). See the CLI reference.

Or embed a run in any test — the Vitest bridge is one line:

import supportEval from '../evals/support.eval'

it('support quality', async () => {
  const experiment = await supportEval.run()
  expect(experiment.passed).toBe(true)
})

Every run persists an Experiment record to .crux/quality/experiments/<ulid>.json (gitignored) with per-cell outputs, assertions, scores, and devtools trace links.

Discovery and model setup

Quality discovery starts from source. With no crux.config.ts, the runner uses an empty quality block, discovers file-defined evaluations by convention, imports source-discovered prompt exports for colocated prompt({ tests }), and persists records under .crux/quality. Prompt tests fail closed with a Project Model diagnostic when a prompt uses a context or injectable dependency that source discovery cannot prove.

Model-backed tasks still need an adapter generate and a model. Prefer importing that runtime next to the eval so the model choice is visible in source:

evals/quality-runtime.ts

import { openai } from '@ai-sdk/openai'
import { generate } from '@crux/ai'
import type { GenerateFn, ModelRef } from '@crux/core/quality'

export interface QualityRuntime {
  readonly generate: GenerateFn
  readonly model: ModelRef
}

export function qualityRuntime(): QualityRuntime {
  return {
    generate,
    model: openai('gpt-4o-mini'),
  } satisfies QualityRuntime
}

evals/support-model.eval.ts

import { evaluate, target } from '@crux/core/quality'
import { supportPrompt } from '../prompts/support'
import { qualityRuntime } from './quality-runtime'

const runtime = qualityRuntime()

export default evaluate('support.model-backed', {
  task: target.prompt(supportPrompt, {
    generate: runtime.generate,
    model: runtime.model,
  }),
  data: [{ input: { question: 'How do refunds work?', locale: 'en' } }],
})

Quality config stays focused on discovery, persistence, redaction, and run defaults. Do not put model/provider defaults in crux.config.ts; share those with ordinary TypeScript helpers imported by eval files:

crux.config.ts

import { config } from '@crux/core'

export default config({
  quality: {
    defaults: { trials: 1, timeoutMs: 60_000 },
    redact: ['customer.email'],
  },
})

For evaluation cells, redact dot-paths are relative to each persisted snapshot value: customer.email scrubs that field from input, output, expected values, assertion values, and cassettes whenever that shape appears. Feedback payloads are root-qualified, such as metadata.customer.email, expected.answer.privateNote, or proposal.statement. Authorization and API-key-style fields are always redacted at every depth, so you do not need to configure paths such as headers.authorization.

defaults fills trials/concurrency/timeoutMs/replay for evaluations that don't declare them.

Assertions — expect, trace-backed signals, honest failure semantics
Scorers & gates — the built-in library, judges, pass policies
Variants & baselines — bakeoffs, paired statistics, promotion
Replay — cassettes and deterministic CI
Recipes — flows, agents, retrieval, RAG, context bakeoffs
API reference

The ladder

Level 0 — colocated prompt tests

Level 1 — your first evaluation

Running evaluations

Discovery and model setup

Next

On this page