Quality
@crux/core/quality — evaluate(), target, scorers, dataset(), cassette(), and the Experiment/Manifest/Baseline records.
@crux/core/quality is the Quality system's public surface: five values
plus types. Authoring lives in *.eval.ts files (or colocated
prompt({ tests })), execution in evaluation.run() and the
crux quality CLI.
import { evaluate, target, scorers, dataset, cassette } from '@crux/core/quality'Type exports: Evaluation, EvaluateOptions, Case, CaseContext,
AssertContext, BoundExpect, ScoreMap, Matchers, Target,
Capability, TaskLike, Scorer, ScorerArgs, Score, ScorerFactory,
Gates, ScoreGate, GateResult, Dataset, Cassette, ReplayMode,
Experiment, ExperimentCell, CellAssertionFailure,
CellAssertionOutcome, CellAssertionPhase, CellAssertionStatus,
CellAssertionValue, CellAssertionExpression,
CellAssertionExpressionOperator, QualitySourceFrame,
QualitySourceFrameSnapshot, QualitySourceUnavailable, RunOverrides,
EvaluationManifest, EvaluationCoverageTargetId, QualityConfig, and the
inference utilities InputOf, OutputOf, ParamsOf, CapsOf, CaseOf,
ExpectedOf.
evaluate(idOrOptions, options?)
Defines a runnable Evaluation. Two call forms — options only (id derived from the file path at discovery) or explicit id (required for baseline promotion):
evaluate({ task, data /* … */ })
evaluate('support.refunds', { task, data /* … */ })evaluate.only / evaluate.skip mirror Vitest focus/exclusion semantics
(only narrows the run and demotes gates to informational; skip takes an
optional reason via the case-level skip: string).
Options
| Key | Type / behavior |
|---|---|
task | The thing under test — the sole inference anchor for input/output types. Any Prompt, FlowHandle, agent, retriever, plain (input, params) => output function, or target.*() wrapper |
data | Case[], a dataset(), or a mixed array of both (concatenated) |
expect | Evaluation-level assertion callback over CaseContext — runs for every case, the primary assertion home |
assert | Evaluation-level post-score assertion callback over AssertContext — runs after scorers and can assert typed score values |
scorers | Scorer array (importable/shareable) or factory lambda (s) => [...] receiving the library pre-bound to the evaluation's types |
params | Execution defaults — "variant zero"; same shape variants override |
variants | Named overrides of the task's parameter surface; entries may also swap the whole task (same I/O types required) |
baseline | Reference variant name for paired comparison (keyof variants — autocompletes) |
trials | Executions per cell (default 1; per-case trials wins; collapses to 1 under replay-strict) |
gates | Declarative pass policy → exit code (see Gates) |
replay | ReplayMode or { mode, cassette? } |
concurrency / timeoutMs | Per-evaluation execution bounds (defaults: config, else 5 / 60 000 ms) |
covers | Optional Project Index definition ids, such as prompt:support or flow:research, that this eval protects when task is a deterministic stand-in |
tags / description | Metadata, surfaced in the manifest and reporter |
Inference contract: data, scorers, and expect are NoInfer positions —
a typo'd case key errors on the case property, never by widening the
task's input type. Gate keys under gates.scores are literal-typed from an
array spelling of scorers; with the factory spelling they are validated at
definition time instead (unknown key → immediate TypeError).
Cases
{
name?: string // stable identity; defaults to a content hash of input
input: TInput // typed from the task
expected?: TExpected // opaque data — delivered to scorers/expect, never matched implicitly
expect?: (ctx) => void // case-SPECIFIC assertions only (not portable to datasets)
assert?: (ctx) => void // case-SPECIFIC post-score assertions
trials?: number // wins over evaluation-level trials
tags?: string[]
only?: boolean
skip?: boolean | string // reason string shown in the reporter
}Tasks that capture steps (flows/agents) additionally accept multi-turn
cases: { turns: [{ user: '…' }, …] }.
Case identity keys watch-mode caching, history, and --case filtering:
explicit name (slugified), else sha256(canonicalJson(input)).slice(0, 12).
target
A Target is a parameterized, signal-capturing task. Bare primitives passed to
task: behave exactly as target.<kind>(primitive) with no defaults — use
target.* only to set execution defaults or unlock variant parameters. Users
never call a target; only the runner executes it.
target.prompt(p, { prompt?, model?, settings?, generate? })
target.flow(f, { model?, settings?, steps?, generate? }) // steps: per-step model/settings, keyed by step name
target.agent(a, { tools?, maxToolSteps?, model?, settings?, steps?, generate? })
target.retriever(r, { id?, query?, options? }) // query: maps case input → query string when input ≠ { query }
target({ id?, run: (input, params) => output }) // custom task with a name and/or typed params| Task kind | Capabilities (→ which expect.* namespaces exist) |
|---|---|
prompt | modelCalls, citations, safety |
flow | modelCalls, steps, toolCalls, routing, safety, memory |
agent | modelCalls, toolCalls, steps, handoffs, retrieval, citations, safety, memory, routing |
retriever | retrieval |
fn | (none — value matchers + always-on only) |
Agent tools: mocks are keyed by tool name; values are a static result or
(args) => result. maxToolSteps bounds the loop (default 15).
generate/model resolution order: variant override → evaluation params
→ target defaults. Bind live model work in eval code or eval-local helper
modules. A model-backed task with no generate available is a definition error
(exit 2).
CaseContext
The argument to expect callbacks:
{
input: TInput
output: TOutput
expected: TExpected | undefined
expect: BoundExpect // see below
variant: { name: string; params: Record<string, unknown> }
trial: number // 0-based
score(name, score, metadata?) // ad-hoc per-case score; joins the scorer score model
step(name, schema?) // flow/agent step access; schema (Standard Schema) narrows output
trace: { id?: string; url?: string } // devtools deep link
meta: { durationMs: number; costUsd?: number; usage?: TokenUsage }
}AssertContext
The argument to post-score assert callbacks:
{
input: TInput
output: TOutput
expected: TExpected | undefined
expect: BoundExpect // same matcher surface, records phase: "assert"
score: ScoreMap<TScoreName> // statically named scorer outputs
scores: readonly CellScore[] // all scores, including dynamic/ad-hoc and skipped scores
variant: { name: string; params: Record<string, unknown> }
trial: number
step(name, schema?)
trace: { id?: string; url?: string }
meta: { durationMs: number; costUsd?: number; usage?: TokenUsage }
}ctx.score is intentionally static-only: a scorer with literal name
citation_valid appears as ctx.score.citation_valid, while dynamic
runtime-named scores stay in ctx.scores. Numeric matchers record a
structured CellAssertionExpression, so later evidence read models can show
thresholds such as 0.58 >= 0.7 => false without parsing messages.
BoundExpect
ctx.expect(value)→ Vitest-compatible matchers (toBe,toEqual,toStrictEqual,toMatch,toMatchObject,toContain,toContainEqual,toHaveLength,toHaveProperty,toBeGreaterThan(OrEqual),toBeLessThan(OrEqual),toBeCloseTo,toBeDefined,toBeUndefined,toBeNull,toBeTruthy,toBeFalsy,toBeOneOf,toBeInstanceOf,toBeTypeOf,toSatisfy,.not). Matchers throw on failure (assertion-function semantics).ctx.expect.soft(value)→ same matchers, collect-don't-throw.- Always-on namespaces:
latency(toBeUnderMs,p95()),cost(toBeUnderUsd,tokens(),toHaveModel,toHaveNoFallback),errors(toHaveNone,toHaveRetriedAtMost). - Capability namespaces (exist only when the task captures that signal):
toolCalls—toHaveCalled(tool, args?),toHaveCalledAll,not.toHaveCalled,toMatchTrajectory('strict' | 'unordered' | 'subset' | 'superset', […]),toHaveCalledBefore(first, second),toHaveAllSucceeded,count()steps—toHaveRun,toHaveSucceeded,toHaveOrder(...names)(subsequence),count()handoffs—toHaveHandedOffTo,toHavePath(...agents),count()retrieval—toContainHit({ sourceId?, chunkId?, namespace? }),toHaveTopHit,hits(),count()citations—toCite,toAllResolve,toHaveNoDangling,toQuoteOutput,count()safety—toHavePassedGuardrails,toHaveBlocked(id),toHavePassedConstraint(id),toHaveAllConstraintsPassedmemory—toHaveRead(key?),toHaveWritten(key?),toHaveValue(key, value)routing—toHaveSelected(route),toHaveClassifiedAs(label),toHaveSelectedModel(id)modelCalls—count(),toHaveUsedModel(id),toHaveNoFallback()
Signals are read from the observability trace. Asserting on a signal that was
not captured in this execution throws UncapturedSignalError (naming the
signal and the task kinds that capture it) — never a vacuous pass. All
assertion outcomes lower into a per-cell pass score (1/0). New experiment
records expose ExperimentCell.assertions.outcomes, an ordered ledger for
passed, failed, uncaptured, and not-evaluated assertions. failures remains
the failed-outcome compatibility projection for older consumers. Outcomes
retain matcher messages when the runtime exposes them. They may also include
sourceFrame: either a narrow authored-source snapshot with line roles and
contentHash, or { kind: 'unavailable', reason } when the runner cannot
prove authored source. When that snapshot contains the assertion call, outcomes
may include subjectExpr, the authored argument passed to ctx.expect(...) or
ctx.expect.soft(...). Plain errors thrown from expect/assert callbacks are
stored as errored cells with a best-effort error.sourceRef, and the local
evidence API can resolve that source ref into the runtime-error check's source
frame. Outcomes from post-score assert callbacks use
phase: "assert"; outcomes from pre-score callbacks use phase: "expect".
Scorers still run on expect-failed cells.
Trace-backed signal outcomes may also include spanIds, the concrete
observability spans inspected by the matcher. Local cell evidence uses those
IDs as exact trace hotspots; score-threshold failures without exact span IDs
are marked with heuristic scorer/root-span hotspots instead.
Local Evidence Read Models
The local backend derives server-owned read models from experiment, baseline, source, and observability records:
QualityCellEvidencefor one experiment cell (caseId,variantName,trial), including the ordered assertion ledger, normalized checks, authoredsourceFrame, values at check, baseline output/deltas, trace hotspots, repro command, and provenance. Score floors and ceilings from gates or assertion expressions are exposed asscore-thresholdchecks with synthesized messages such as0.58 is below the 0.70 floor.QualityEvaluationProgressfor one evaluation, including recent run verdicts, pass rates, cost/duration, score mean/SEM/n series, and current baseline overlays.- Evaluation experiment relation reads for UI list/detail screens:
/api/quality/evaluations/{evaluationId}/experimentsreturns latest experiment summaries for one evaluation, while/api/quality/evaluations/experiment-groupsreturns buckets grouped by evaluation and sorted by each bucket's latest experiment.
They are available through the local CLI/dev-server API and are intentionally not reconstructed by UI clients.
Source frames are captured by the first-party runner and by direct
evaluation.run() calls. Stack refs that already point at authored source
files are read directly from disk; bundled runtime locations still require
source maps. Generated code without an authored source mapping is reported as
an explicit unavailable frame instead of being shown as source.
Connected CLI runs also retain full trace detail. When crux quality run has
a devtools server URL, the first-party runner installs the project's
observability HTTP transport and flushes graph records before exit, so
ExperimentCell.traceIds can resolve through /api/observability/runs/{runId}.
Direct evaluation.run() uses the same HTTP transport when CRUX_DEVTOOLS_URL,
DEVTOOLS_URL, or a reachable local localhost:4400 devtools server is
available. Offline runs may keep traceIds without a retained graph; clients
should render that as unavailable trace detail rather than a failed evaluation.
A retained root trace can still have an empty compact waterfall for a
callback-only cell that emitted no child spans; clients should expose the full
trace link when the backend returns retainedTraceIds.
scorers
Any autoevals-compatible ({ input, output, expected }) => Score function
works unmodified, where Score is { name, score: number | null, label?, metadata? } (null = skipped/not-applicable). Built-ins carry scorerName
(compile-time gate-key linkage) and costClass: 'code' | 'model'.
Code-class: exact(), contains(), regex({ pattern }), levenshtein(),
jsonValid(), jsonDiff(), retrieval.hitRateAtK(k), retrieval.recallAtK(k),
retrieval.precisionAtK(k), retrieval.mrr(), retrieval.ndcg(k?) (the
retrieval.* family reads expected: { sources: [{ sourceId, chunkId? }] }).
Model-backed scorers require explicit eval-local bindings when they spend tokens:
scorers.judge({
name: 'helpful', // required, literal-typed → gate keys
rubric: '…', // free-rubric grading 0–1, or:
choiceScores: { good: 1, bad: 0 }, // classification with mapped scores
select: (o) => o.answer, // REQUIRED when output isn't a string
generate: qualityRuntime.generate,
model: qualityRuntime.judgeModel,
useCoT: true, // default true; rationale → metadata.rationale
})
scorers.embeddingSimilarity({ embed: embedText }) // cosine vs expected
scorers.rag.faithfulness({ generate: qualityRuntime.generate, model: qualityRuntime.judgeModel })
scorers.rag.answerRelevancy({ generate: qualityRuntime.generate, model: qualityRuntime.judgeModel })
scorers.rag.contextPrecision({ generate: qualityRuntime.generate, model: qualityRuntime.judgeModel })
scorers.rag.contextRecall({ generate: qualityRuntime.generate, model: qualityRuntime.judgeModel })Via the factory spelling (scorers: (s) => […]) the same library arrives
pre-bound to the evaluation's generics — judge.select is contextually typed
against the output. The rag.* scorers read retrieved context from captured
retrieval signals (fallback: input.context) and return score: null when
none exists.
Gates
gates: {
passRate?: { min: number } // over the per-cell `pass` score
scores?: { [scorerName]: { min?, max?, minDeltaVsBaseline? } }
latency?: { p95Ms?, meanMs? }
cost?: { maxPerCaseUsd?, maxTotalUsd? }
consistency?: { passAtK?, passAllTrials? } // with trials > 1
}Zero-config default (no gates): assertions gate, scores inform — exit
non-zero iff any cell errored or any expect failed. Declaring any gate
replaces the default policy. Gates are false-safe (errored cells always
fail); they evaluate per non-baseline variant; filtered runs demote them to
informational. A minDeltaVsBaseline gate with no reference yet reports
passed: false, informational: true — visible, never blocking.
dataset(path, { input, expected? })
Schema-validated file dataset — JSON, JSONL, or CSV by extension, validated
with any Standard Schema library (zod, valibot, arktype). Resolved lazily by
the runner; validation failure is a definition error. Rows are pure data (no
expect callbacks).
cassette(name, opts?) and replay
type ReplayMode = 'live' | 'record-new' | 'replay-strict' | 'refresh'
cassette('support-agent', { mode?, match? }) // match: override the normalized key (advanced)Interception at the ExecutorSpec/SdkGateway boundary; entry key = hash of
(call kind, target id, prompt hash, model, canonicalized settings,
tool-schema hash, input) — volatile fields excluded, so one cassette covers
all variants and judge calls. Storage:
.crux/quality/cassettes/<name>.json with { recordedAt, sdkVersion, models } metadata and a staleness warning past 90 days. Redaction applies at
write time, always. Mode resolution: --replay → cassette() mode →
evaluation declaration → quality.defaults.replay → live. Trials collapse
to 1 under replay-strict; a strict miss fails the cell closed with the
missing key. See the Replay guide.
Evaluation and Experiment
interface Evaluation {
readonly _tag: 'CruxEvaluation' // discovery discriminant
readonly id: string | undefined // explicit, or undefined → path-derived at collect
readonly manifest: EvaluationManifest // structural facts, computed WITHOUT executing the task
run(overrides?: RunOverrides): Promise<Experiment>
}
interface RunOverrides {
variants?: string[] // typed to the declared variant names
cases?: string[] // names/ids, `*` glob
trials?: number
replayMode?: ReplayMode
reuseOutputs?: boolean // re-score cached outputs without executing (judge iteration)
signal?: AbortSignal
concurrency?: number
}Experiment is the persisted result: cases (one ExperimentCell per case
× variant × trial, each linking devtools traceIds), per-variant
aggregates (mean ± SEM per score, pass@k/pass^k), comparison (paired
deltas vs the baseline), gates, passed, and fingerprints
(configFingerprint, taskFingerprint).
Experiment.promote({ id?, variant? }) → { baselineId, path } writes the
committed Baseline record (explicit id required; filtered runs refused).
Records are versioned JSON contracts (schemaVersion, additive-only within a
major): the Experiment record
(.crux/quality/experiments/<ulid>.json, gitignored, also --json), the
Evaluation manifest (crux quality list --json, devtools pre-run
rendering; includes covers when explicit Project Index coverage targets are
declared), and the Baseline record
(.crux/quality/baselines/<evaluationId>.json, committed — how CI knows the
reference; configFingerprint drift demotes comparisons to informational).
Manifest + Experiment record are the complete agent-facing contract: list
evaluations, run one, read the assertion outcome ledger or legacy failures
with sourceRef + traceIds, edit, rerun — no parsing of human-oriented
output.
Identity and discovery
- Derived id = POSIX relative path from the quality root, eval suffix
stripped,
/→.—evals/support/refunds.eval.ts→evals.support.refunds; multi-export files append#<exportName>(default export appends nothing). - Duplicate ids across the project are a collect-time definition error.
- Promotion requires an explicit id: the CLI prints the one-line pin
(
--pin-idin CI) and never rewrites source; the drift guard errors a later run whose id no longer matches a promoted baseline. - Colocated
prompt({ tests })lowers to an evaluation with idprompt:<promptId>(source'prompt-tests'in the manifest; output-schema validation gates).
QualityConfig (crux.config.ts)
quality: {
id?: string // workbench id; default: package name
dir?: string // persistence root; default '.crux/quality'
include?: string | string[] // discovery globs; default ['evals/**/*.eval.ts', '**/*.eval.ts']
exclude?: string | string[]
redact?: string[] // value-relative cell paths; feedback paths are root-qualified
defaults?: { trials?, concurrency?, timeoutMs?, replay? }
}For evaluation cells, redact paths are relative to each snapshot value:
customer.email redacts that field from persisted input, output, expected
values, assertion values, and cassettes when present. Feedback records contain
separate payload roots, so use root-qualified paths such as
metadata.customer.email, expected.answer.privateNote, or
proposal.statement. Authorization and API-key-style fields are redacted at
every depth even when redact is empty.
Inference utilities
InputOf<T> / OutputOf<T> / ParamsOf<T> / CapsOf<T> / ExpectedOf<T>
extract an evaluation-relevant type from any task. CaseOf<T, TExpected?> is
the documented one-annotation escape hatch for extracted case arrays:
export const cases = [
/* … */
] satisfies CaseOf<typeof supportPrompt>[]See also
- Guide: Quality (ladder-ordered), Recipes
- CLI:
crux quality - Scoring internals: Scoring