Quality

@crux/core/quality — evaluate(), target, scorers, dataset(), cassette(), and the Experiment/Manifest/Baseline records.

@crux/core/quality is the Quality system's public surface: five values plus types. Authoring lives in *.eval.ts files (or colocated prompt({ tests })), execution in evaluation.run() and the crux quality CLI.

import { evaluate, target, scorers, dataset, cassette } from '@crux/core/quality'

Type exports: Evaluation, EvaluateOptions, Case, CaseContext, AssertContext, BoundExpect, ScoreMap, Matchers, Target, Capability, TaskLike, Scorer, ScorerArgs, Score, ScorerFactory, Gates, ScoreGate, GateResult, Dataset, Cassette, ReplayMode, Experiment, ExperimentCell, CellAssertionFailure, CellAssertionOutcome, CellAssertionPhase, CellAssertionStatus, CellAssertionValue, CellAssertionExpression, CellAssertionExpressionOperator, QualitySourceFrame, QualitySourceFrameSnapshot, QualitySourceUnavailable, RunOverrides, EvaluationManifest, EvaluationCoverageTargetId, QualityConfig, and the inference utilities InputOf, OutputOf, ParamsOf, CapsOf, CaseOf, ExpectedOf.

`evaluate(idOrOptions, options?)`

Defines a runnable Evaluation. Two call forms — options only (id derived from the file path at discovery) or explicit id (required for baseline promotion):

evaluate({ task, data /* … */ })
evaluate('support.refunds', { task, data /* … */ })

evaluate.only / evaluate.skip mirror Vitest focus/exclusion semantics (only narrows the run and demotes gates to informational; skip takes an optional reason via the case-level skip: string).

Options

Key	Type / behavior
`task`	The thing under test — the sole inference anchor for input/output types. Any `Prompt`, `FlowHandle`, agent, retriever, plain `(input, params) => output` function, or `target.*()` wrapper
`data`	`Case[]`, a `dataset()`, or a mixed array of both (concatenated)
`expect`	Evaluation-level assertion callback over `CaseContext` — runs for every case, the primary assertion home
`assert`	Evaluation-level post-score assertion callback over `AssertContext` — runs after scorers and can assert typed score values
`scorers`	Scorer array (importable/shareable) or factory lambda `(s) => [...]` receiving the library pre-bound to the evaluation's types
`params`	Execution defaults — "variant zero"; same shape variants override
`variants`	Named overrides of the task's parameter surface; entries may also swap the whole `task` (same I/O types required)
`baseline`	Reference variant name for paired comparison (`keyof variants` — autocompletes)
`trials`	Executions per cell (default 1; per-case `trials` wins; collapses to 1 under replay-strict)
`gates`	Declarative pass policy → exit code (see Gates)
`replay`	`ReplayMode` or `{ mode, cassette? }`
`concurrency` / `timeoutMs`	Per-evaluation execution bounds (defaults: config, else 5 / 60 000 ms)
`covers`	Optional Project Index definition ids, such as `prompt:support` or `flow:research`, that this eval protects when `task` is a deterministic stand-in
`tags` / `description`	Metadata, surfaced in the manifest and reporter

Inference contract: data, scorers, and expect are NoInfer positions — a typo'd case key errors on the case property, never by widening the task's input type. Gate keys under gates.scores are literal-typed from an array spelling of scorers; with the factory spelling they are validated at definition time instead (unknown key → immediate TypeError).

Cases

{
  name?: string                  // stable identity; defaults to a content hash of input
  input: TInput                  // typed from the task
  expected?: TExpected           // opaque data — delivered to scorers/expect, never matched implicitly
  expect?: (ctx) => void         // case-SPECIFIC assertions only (not portable to datasets)
  assert?: (ctx) => void         // case-SPECIFIC post-score assertions
  trials?: number                // wins over evaluation-level trials
  tags?: string[]
  only?: boolean
  skip?: boolean | string        // reason string shown in the reporter
}

Tasks that capture steps (flows/agents) additionally accept multi-turn cases: { turns: [{ user: '…' }, …] }.

Case identity keys watch-mode caching, history, and --case filtering: explicit name (slugified), else sha256(canonicalJson(input)).slice(0, 12).

`target`

A Target is a parameterized, signal-capturing task. Bare primitives passed to task: behave exactly as target.<kind>(primitive) with no defaults — use target.* only to set execution defaults or unlock variant parameters. Users never call a target; only the runner executes it.

target.prompt(p, { prompt?, model?, settings?, generate? })
target.flow(f, { model?, settings?, steps?, generate? })       // steps: per-step model/settings, keyed by step name
target.agent(a, { tools?, maxToolSteps?, model?, settings?, steps?, generate? })
target.retriever(r, { id?, query?, options? })                  // query: maps case input → query string when input ≠ { query }
target({ id?, run: (input, params) => output })                 // custom task with a name and/or typed params

Task kind	Capabilities (→ which `expect.*` namespaces exist)
`prompt`	`modelCalls`, `citations`, `safety`
`flow`	`modelCalls`, `steps`, `toolCalls`, `routing`, `safety`, `memory`
`agent`	`modelCalls`, `toolCalls`, `steps`, `handoffs`, `retrieval`, `citations`, `safety`, `memory`, `routing`
`retriever`	`retrieval`
`fn`	(none — value matchers + always-on only)

Agent tools: mocks are keyed by tool name; values are a static result or (args) => result. maxToolSteps bounds the loop (default 15).

generate/model resolution order: variant override → evaluation params → target defaults. Bind live model work in eval code or eval-local helper modules. A model-backed task with no generate available is a definition error (exit 2).

`CaseContext`

The argument to expect callbacks:

{
  input: TInput
  output: TOutput
  expected: TExpected | undefined
  expect: BoundExpect            // see below
  variant: { name: string; params: Record<string, unknown> }
  trial: number                  // 0-based
  score(name, score, metadata?)  // ad-hoc per-case score; joins the scorer score model
  step(name, schema?)            // flow/agent step access; schema (Standard Schema) narrows output
  trace: { id?: string; url?: string }   // devtools deep link
  meta: { durationMs: number; costUsd?: number; usage?: TokenUsage }
}

`AssertContext`

The argument to post-score assert callbacks:

{
  input: TInput
  output: TOutput
  expected: TExpected | undefined
  expect: BoundExpect            // same matcher surface, records phase: "assert"
  score: ScoreMap<TScoreName>    // statically named scorer outputs
  scores: readonly CellScore[]   // all scores, including dynamic/ad-hoc and skipped scores
  variant: { name: string; params: Record<string, unknown> }
  trial: number
  step(name, schema?)
  trace: { id?: string; url?: string }
  meta: { durationMs: number; costUsd?: number; usage?: TokenUsage }
}

ctx.score is intentionally static-only: a scorer with literal name citation_valid appears as ctx.score.citation_valid, while dynamic runtime-named scores stay in ctx.scores. Numeric matchers record a structured CellAssertionExpression, so later evidence read models can show thresholds such as 0.58 >= 0.7 => false without parsing messages.

`BoundExpect`

ctx.expect(value) → Vitest-compatible matchers (toBe, toEqual, toStrictEqual, toMatch, toMatchObject, toContain, toContainEqual, toHaveLength, toHaveProperty, toBeGreaterThan(OrEqual), toBeLessThan(OrEqual), toBeCloseTo, toBeDefined, toBeUndefined, toBeNull, toBeTruthy, toBeFalsy, toBeOneOf, toBeInstanceOf, toBeTypeOf, toSatisfy, .not). Matchers throw on failure (assertion-function semantics).
ctx.expect.soft(value) → same matchers, collect-don't-throw.
Always-on namespaces: latency (toBeUnderMs, p95()), cost (toBeUnderUsd, tokens(), toHaveModel, toHaveNoFallback), errors (toHaveNone, toHaveRetriedAtMost).
Capability namespaces (exist only when the task captures that signal):
- toolCalls — toHaveCalled(tool, args?), toHaveCalledAll, not.toHaveCalled, toMatchTrajectory('strict' | 'unordered' | 'subset' | 'superset', […]), toHaveCalledBefore(first, second), toHaveAllSucceeded, count()
- steps — toHaveRun, toHaveSucceeded, toHaveOrder(...names) (subsequence), count()
- handoffs — toHaveHandedOffTo, toHavePath(...agents), count()
- retrieval — toContainHit({ sourceId?, chunkId?, namespace? }), toHaveTopHit, hits(), count()
- citations — toCite, toAllResolve, toHaveNoDangling, toQuoteOutput, count()
- safety — toHavePassedGuardrails, toHaveBlocked(id), toHavePassedConstraint(id), toHaveAllConstraintsPassed
- memory — toHaveRead(key?), toHaveWritten(key?), toHaveValue(key, value)
- routing — toHaveSelected(route), toHaveClassifiedAs(label), toHaveSelectedModel(id)
- modelCalls — count(), toHaveUsedModel(id), toHaveNoFallback()

Signals are read from the observability trace. Asserting on a signal that was not captured in this execution throws UncapturedSignalError (naming the signal and the task kinds that capture it) — never a vacuous pass. All assertion outcomes lower into a per-cell pass score (1/0). New experiment records expose ExperimentCell.assertions.outcomes, an ordered ledger for passed, failed, uncaptured, and not-evaluated assertions. failures remains the failed-outcome compatibility projection for older consumers. Outcomes retain matcher messages when the runtime exposes them. They may also include sourceFrame: either a narrow authored-source snapshot with line roles and contentHash, or { kind: 'unavailable', reason } when the runner cannot prove authored source. When that snapshot contains the assertion call, outcomes may include subjectExpr, the authored argument passed to ctx.expect(...) or ctx.expect.soft(...). Plain errors thrown from expect/assert callbacks are stored as errored cells with a best-effort error.sourceRef, and the local evidence API can resolve that source ref into the runtime-error check's source frame. Outcomes from post-score assert callbacks use phase: "assert"; outcomes from pre-score callbacks use phase: "expect". Scorers still run on expect-failed cells.

Trace-backed signal outcomes may also include spanIds, the concrete observability spans inspected by the matcher. Local cell evidence uses those IDs as exact trace hotspots; score-threshold failures without exact span IDs are marked with heuristic scorer/root-span hotspots instead.

Local Evidence Read Models

The local backend derives server-owned read models from experiment, baseline, source, and observability records:

QualityCellEvidence for one experiment cell (caseId, variantName, trial), including the ordered assertion ledger, normalized checks, authored sourceFrame, values at check, baseline output/deltas, trace hotspots, repro command, and provenance. Score floors and ceilings from gates or assertion expressions are exposed as score-threshold checks with synthesized messages such as 0.58 is below the 0.70 floor.
QualityEvaluationProgress for one evaluation, including recent run verdicts, pass rates, cost/duration, score mean/SEM/n series, and current baseline overlays.
Evaluation experiment relation reads for UI list/detail screens: /api/quality/evaluations/{evaluationId}/experiments returns latest experiment summaries for one evaluation, while /api/quality/evaluations/experiment-groups returns buckets grouped by evaluation and sorted by each bucket's latest experiment.

They are available through the local CLI/dev-server API and are intentionally not reconstructed by UI clients.

Source frames are captured by the first-party runner and by direct evaluation.run() calls. Stack refs that already point at authored source files are read directly from disk; bundled runtime locations still require source maps. Generated code without an authored source mapping is reported as an explicit unavailable frame instead of being shown as source.

Connected CLI runs also retain full trace detail. When crux quality run has a devtools server URL, the first-party runner installs the project's observability HTTP transport and flushes graph records before exit, so ExperimentCell.traceIds can resolve through /api/observability/runs/{runId}. Direct evaluation.run() uses the same HTTP transport when CRUX_DEVTOOLS_URL, DEVTOOLS_URL, or a reachable local localhost:4400 devtools server is available. Offline runs may keep traceIds without a retained graph; clients should render that as unavailable trace detail rather than a failed evaluation. A retained root trace can still have an empty compact waterfall for a callback-only cell that emitted no child spans; clients should expose the full trace link when the backend returns retainedTraceIds.

`scorers`

Any autoevals-compatible ({ input, output, expected }) => Score function works unmodified, where Score is { name, score: number | null, label?, metadata? } (null = skipped/not-applicable). Built-ins carry scorerName (compile-time gate-key linkage) and costClass: 'code' | 'model'.

Code-class: exact(), contains(), regex({ pattern }), levenshtein(), jsonValid(), jsonDiff(), retrieval.hitRateAtK(k), retrieval.recallAtK(k), retrieval.precisionAtK(k), retrieval.mrr(), retrieval.ndcg(k?) (the retrieval.* family reads expected: { sources: [{ sourceId, chunkId? }] }).

Model-backed scorers require explicit eval-local bindings when they spend tokens:

scorers.judge({
  name: 'helpful', // required, literal-typed → gate keys
  rubric: '…', // free-rubric grading 0–1, or:
  choiceScores: { good: 1, bad: 0 }, // classification with mapped scores
  select: (o) => o.answer, // REQUIRED when output isn't a string
  generate: qualityRuntime.generate,
  model: qualityRuntime.judgeModel,
  useCoT: true, // default true; rationale → metadata.rationale
})
scorers.embeddingSimilarity({ embed: embedText }) // cosine vs expected
scorers.rag.faithfulness({ generate: qualityRuntime.generate, model: qualityRuntime.judgeModel })
scorers.rag.answerRelevancy({ generate: qualityRuntime.generate, model: qualityRuntime.judgeModel })
scorers.rag.contextPrecision({ generate: qualityRuntime.generate, model: qualityRuntime.judgeModel })
scorers.rag.contextRecall({ generate: qualityRuntime.generate, model: qualityRuntime.judgeModel })

Via the factory spelling (scorers: (s) => […]) the same library arrives pre-bound to the evaluation's generics — judge.select is contextually typed against the output. The rag.* scorers read retrieved context from captured retrieval signals (fallback: input.context) and return score: null when none exists.

Gates

gates: {
  passRate?: { min: number }                       // over the per-cell `pass` score
  scores?: { [scorerName]: { min?, max?, minDeltaVsBaseline? } }
  latency?: { p95Ms?, meanMs? }
  cost?: { maxPerCaseUsd?, maxTotalUsd? }
  consistency?: { passAtK?, passAllTrials? }       // with trials > 1
}

Zero-config default (no gates): assertions gate, scores inform — exit non-zero iff any cell errored or any expect failed. Declaring any gate replaces the default policy. Gates are false-safe (errored cells always fail); they evaluate per non-baseline variant; filtered runs demote them to informational. A minDeltaVsBaseline gate with no reference yet reports passed: false, informational: true — visible, never blocking.

`dataset(path, { input, expected? })`

Schema-validated file dataset — JSON, JSONL, or CSV by extension, validated with any Standard Schema library (zod, valibot, arktype). Resolved lazily by the runner; validation failure is a definition error. Rows are pure data (no expect callbacks).

`cassette(name, opts?)` and replay

type ReplayMode = 'live' | 'record-new' | 'replay-strict' | 'refresh'
cassette('support-agent', { mode?, match? })   // match: override the normalized key (advanced)

Interception at the ExecutorSpec/SdkGateway boundary; entry key = hash of (call kind, target id, prompt hash, model, canonicalized settings, tool-schema hash, input) — volatile fields excluded, so one cassette covers all variants and judge calls. Storage: .crux/quality/cassettes/<name>.json with { recordedAt, sdkVersion, models } metadata and a staleness warning past 90 days. Redaction applies at write time, always. Mode resolution: --replay → cassette() mode → evaluation declaration → quality.defaults.replay → live. Trials collapse to 1 under replay-strict; a strict miss fails the cell closed with the missing key. See the Replay guide.

`Evaluation` and `Experiment`

interface Evaluation {
  readonly _tag: 'CruxEvaluation' // discovery discriminant
  readonly id: string | undefined // explicit, or undefined → path-derived at collect
  readonly manifest: EvaluationManifest // structural facts, computed WITHOUT executing the task
  run(overrides?: RunOverrides): Promise<Experiment>
}

interface RunOverrides {
  variants?: string[] // typed to the declared variant names
  cases?: string[] // names/ids, `*` glob
  trials?: number
  replayMode?: ReplayMode
  reuseOutputs?: boolean // re-score cached outputs without executing (judge iteration)
  signal?: AbortSignal
  concurrency?: number
}

Experiment is the persisted result: cases (one ExperimentCell per case × variant × trial, each linking devtools traceIds), per-variant aggregates (mean ± SEM per score, pass@k/pass^k), comparison (paired deltas vs the baseline), gates, passed, and fingerprints (configFingerprint, taskFingerprint). Experiment.promote({ id?, variant? }) → { baselineId, path } writes the committed Baseline record (explicit id required; filtered runs refused).

Records are versioned JSON contracts (schemaVersion, additive-only within a major): the Experiment record (.crux/quality/experiments/<ulid>.json, gitignored, also --json), the Evaluation manifest (crux quality list --json, devtools pre-run rendering; includes covers when explicit Project Index coverage targets are declared), and the Baseline record (.crux/quality/baselines/<evaluationId>.json, committed — how CI knows the reference; configFingerprint drift demotes comparisons to informational). Manifest + Experiment record are the complete agent-facing contract: list evaluations, run one, read the assertion outcome ledger or legacy failures with sourceRef + traceIds, edit, rerun — no parsing of human-oriented output.

Identity and discovery

Derived id = POSIX relative path from the quality root, eval suffix stripped, / → . — evals/support/refunds.eval.ts → evals.support.refunds; multi-export files append #<exportName> (default export appends nothing).
Duplicate ids across the project are a collect-time definition error.
Promotion requires an explicit id: the CLI prints the one-line pin (--pin-id in CI) and never rewrites source; the drift guard errors a later run whose id no longer matches a promoted baseline.
Colocated prompt({ tests }) lowers to an evaluation with id prompt:<promptId> (source 'prompt-tests' in the manifest; output-schema validation gates).

`QualityConfig` (`crux.config.ts`)

quality: {
  id?: string                    // workbench id; default: package name
  dir?: string                   // persistence root; default '.crux/quality'
  include?: string | string[]    // discovery globs; default ['evals/**/*.eval.ts', '**/*.eval.ts']
  exclude?: string | string[]
  redact?: string[]              // value-relative cell paths; feedback paths are root-qualified
  defaults?: { trials?, concurrency?, timeoutMs?, replay? }
}

For evaluation cells, redact paths are relative to each snapshot value: customer.email redacts that field from persisted input, output, expected values, assertion values, and cassettes when present. Feedback records contain separate payload roots, so use root-qualified paths such as metadata.customer.email, expected.answer.privateNote, or proposal.statement. Authorization and API-key-style fields are redacted at every depth even when redact is empty.

Inference utilities

InputOf<T> / OutputOf<T> / ParamsOf<T> / CapsOf<T> / ExpectedOf<T> extract an evaluation-relevant type from any task. CaseOf<T, TExpected?> is the documented one-annotation escape hatch for extracted case arrays:

export const cases = [
  /* … */
] satisfies CaseOf<typeof supportPrompt>[]