Assertions
Evaluation-level expect callbacks, Vitest-style matchers, and trace-backed signal assertions that fail honestly.
Level 2 adds expect: — an assertion callback that runs for every case.
It receives the case's full context and a bound, capability-typed expect
surface:
export default evaluate({
task: supportPrompt,
data: [{ input: { question: 'How do refunds work?', locale: 'en' }, expected: { mustMention: 'refund' } }],
expect: (ctx) => {
ctx.expect(ctx.output.answer).toContain(String(ctx.expected?.mustMention ?? ''))
ctx.expect(ctx.output.confidence).toBeGreaterThanOrEqual(0.5)
ctx.expect.latency.toBeUnderMs(5000)
ctx.expect.safety.toHavePassedGuardrails()
},
})The evaluation-level callback is the primary assertion home — in practice
~90% of per-case assertions are the same structure repeated, so write them
once here. Cases may also carry their own expect for genuinely
case-specific checks (evaluation-level runs first; both report
independently). Cases with callbacks are not portable to datasets — keep
case expect the exception.
Use assert: when a check needs scorer results. It runs after all scorers,
receives the same ctx.expect matcher surface, and records outcomes with
phase: 'assert':
export default evaluate({
task: supportPrompt,
data: [{ input: { question: 'How do refunds work?', locale: 'en' } }],
scorers: [
scorers.judge({
name: 'citation_valid',
rubric: 'Are the cited claims supported?',
select: (output) => output.answer,
}),
],
assert: (ctx) => {
ctx.expect(ctx.score.citation_valid).toBeGreaterThanOrEqual(0.7)
},
})The case context
| Field | What it is |
|---|---|
ctx.input / ctx.output | Typed from the task (zod schemas for prompts, signature for fns) |
ctx.expected | The case's expected payload, or undefined — opaque data, never matched implicitly |
ctx.expect | The bound assertion surface (below) |
ctx.variant | { name, params } of the variant this cell ran under |
ctx.trial | 0-based trial index |
ctx.score(name, score, metadata?) | Ad-hoc per-case score (0–1) — joins the same score model as scorers |
ctx.step(name, schema?) | Flow/agent step access with Standard Schema narrowing |
ctx.trace | { id, url } — the devtools deep link for this cell |
ctx.meta | { durationMs, costUsd, usage } |
The post-score assert context
| Field | What it is |
|---|---|
ctx.input / ctx.output | Same typed cell payload as expect |
ctx.expected | The case's expected payload, or undefined |
ctx.expect | Same bound matcher surface, recording phase: 'assert' |
ctx.score | Readonly map of statically named scorer outputs |
ctx.scores | All cell scores, including dynamic/ad-hoc scores and skipped null scores |
ctx.variant / ctx.trial | Same cell coordinates as expect |
ctx.step / ctx.trace | Same step access and trace link as expect |
ctx.meta | { durationMs, costUsd, usage } |
Static scorer names are available as properties (ctx.score.citation_valid).
Scores produced by dynamic scorer functions or pre-score ctx.score(name, …)
remain visible in ctx.scores, where the runtime name can be inspected
explicitly. Numeric matchers retain a structured expression, so evidence
surfaces can show threshold checks like 0.58 >= 0.7 => false without parsing
failure messages.
Value matchers
ctx.expect(value) returns a Vitest-compatible matcher set: toBe,
toEqual, toStrictEqual, toMatch, toMatchObject, toContain,
toContainEqual, toHaveLength, toHaveProperty, toBeGreaterThan(OrEqual),
toBeLessThan(OrEqual), toBeCloseTo, toBeDefined, toBeUndefined,
toBeNull, toBeTruthy, toBeFalsy, toBeOneOf, toBeInstanceOf,
toBeTypeOf, toSatisfy, and .not.
Assertions throw on failure (Vitest semantics — later lines may rely on
earlier assertions having passed). The engine records every assertion that
ran, so reports show position: 3 ran · 2 not evaluated. New experiment
records also include assertions.outcomes, an ordered ledger of passed,
failed, uncaptured, and not-evaluated assertions with matcher metadata,
source refs, retained matcher messages, and actual/expected values where
available. When the runner can resolve authored source, outcomes can also
include a narrow sourceFrame snapshot and subjectExpr, the authored
argument passed to ctx.expect(...) or ctx.expect.soft(...). Generated-only
or missing source is reported as kind: 'unavailable' rather than shown as if
it were authored code. The older assertions.failures array remains available
for compatibility. If a helper throws a plain Error before a matcher records
an outcome, the cell is still marked errored, but local evidence records a
best-effort error.sourceRef and can resolve it to the callback crash line. To
record a failure and keep going, use soft:
Trace-backed signal matchers can also attach spanIds to an outcome when the
matcher inspected concrete observability spans. The local evidence view treats
those span IDs as exact hotspots; score-threshold failures without exact spans
fall back to labeled heuristic hotspots such as the scorer span or root span.
The local backend also exposes a joined QualityCellEvidence record for one
case x variant x trial cell. That record contains the ordered assertion
ledger, normalized checks, authored source context, curated values at the
check, baseline comparison evidence, and trace hotspots so web devtools and
the TUI do not have to rebuild evidence from raw experiment files. Score gates
and score assertion expressions are exposed as score-threshold checks with
messages such as 0.58 is below the 0.70 floor.
When crux quality run is connected to a running devtools server, the runner
also posts the cell's full observability graph before it exits. The experiment
cell's traceIds then open in the Runs detail view. Direct
evaluation.run() calls do the same when CRUX_DEVTOOLS_URL, DEVTOOLS_URL,
or a reachable local localhost:4400 server is available, which keeps
Vitest-generated experiments connected to their root run detail. If a run was
produced offline, or trace retention has expired, cell evidence still renders
from the experiment record and the trace detail degrades as unavailable. A
retained root trace can legitimately have no child spans when the cell only ran
plain callback assertions; the full trace link remains the retention signal.
expect: (ctx) => {
ctx.expect.soft(ctx.output.answer).toContain('refund')
ctx.expect.soft(ctx.output.answer).toContain('14 days')
ctx.score('answer-length', Math.min(1, ctx.output.answer.length / 200))
},All assertion outcomes lower into a per-cell pass score (1 all passed,
0 any failed) — the default gate policy and passRate gates read it.
Signal namespaces — trace-backed, honest
The differentiator: assertions over what the execution actually did, read from the observability trace — not guessed from output shapes.
Three namespaces always exist:
ctx.expect.latency.toBeUnderMs(5000)
ctx.expect.cost.toBeUnderUsd(0.05)
ctx.expect.errors.toHaveNone()The rest are capability-typed: they exist at compile time only when the task kind captures that signal family.
| Task kind | Capabilities |
|---|---|
| prompt | modelCalls, citations, safety |
| flow | modelCalls, steps, toolCalls, routing, safety, memory |
| agent | all nine: + handoffs, retrieval |
| retriever | retrieval |
| plain fn | none |
So ctx.expect.toolCalls autocompletes on an agent task and is a compile
error on a prompt task. Examples across the namespaces:
ctx.expect.toolCalls.toHaveCalled('lookupOrder', { orderId: '1234' })
ctx.expect.toolCalls.toMatchTrajectory('subset', [{ tool: 'lookupOrder' }])
ctx.expect.toolCalls.toHaveCalledBefore('search', 'write')
ctx.expect.steps.toHaveOrder('plan', 'write') // subsequence, not exhaustive
ctx.expect.handoffs.toHaveHandedOffTo('escalation-agent')
ctx.expect.retrieval.toContainHit({ sourceId: 'kb-refunds' })
ctx.expect.citations.toAllResolve()
ctx.expect.safety.toHaveBlocked('pii-guardrail')
ctx.expect.memory.toHaveWritten('customer-sentiment')
ctx.expect.routing.toHaveSelectedModel('gpt-5-mini')
ctx.expect.modelCalls.count().toBeLessThanOrEqual(3)For trajectory assertions, prefer the outcome-first modes ('subset',
'superset') over 'strict'/'unordered' — exact trajectories are brittle;
reserve strict for hard protocol checks.
Honest failure semantics: asserting on a signal that was not captured in
this execution throws an UncapturedSignalError naming the signal and which
task kinds capture it — never a vacuous pass. The type system prevents most
of these at compile time; the runtime backstop covers variant task
substitution and conditional capture.
Step access
For flows and agents, ctx.step() reads a named step's result. Step outputs
are unknown until you narrow them with a Standard Schema (step names and
outputs are created imperatively inside the flow handler, so they cannot be
statically typed):
expect: (ctx) => {
ctx.expect.steps.toHaveSucceeded('write')
const write = ctx.step('write', z.object({ summary: z.string() }))
ctx.expect(write.output.summary.length).toBeGreaterThan(0)
},ctx.step(name) without a schema returns { output: unknown, status, durationMs }.
Next
Scorers & gates — graded measures on top of binary assertions.