Assertions

Evaluation-level expect callbacks, Vitest-style matchers, and trace-backed signal assertions that fail honestly.

Level 2 adds expect: — an assertion callback that runs for every case. It receives the case's full context and a bound, capability-typed expect surface:

export default evaluate({
  task: supportPrompt,
  data: [{ input: { question: 'How do refunds work?', locale: 'en' }, expected: { mustMention: 'refund' } }],
  expect: (ctx) => {
    ctx.expect(ctx.output.answer).toContain(String(ctx.expected?.mustMention ?? ''))
    ctx.expect(ctx.output.confidence).toBeGreaterThanOrEqual(0.5)
    ctx.expect.latency.toBeUnderMs(5000)
    ctx.expect.safety.toHavePassedGuardrails()
  },
})

The evaluation-level callback is the primary assertion home — in practice ~90% of per-case assertions are the same structure repeated, so write them once here. Cases may also carry their own expect for genuinely case-specific checks (evaluation-level runs first; both report independently). Cases with callbacks are not portable to datasets — keep case expect the exception.

Use assert: when a check needs scorer results. It runs after all scorers, receives the same ctx.expect matcher surface, and records outcomes with phase: 'assert':

export default evaluate({
  task: supportPrompt,
  data: [{ input: { question: 'How do refunds work?', locale: 'en' } }],
  scorers: [
    scorers.judge({
      name: 'citation_valid',
      rubric: 'Are the cited claims supported?',
      select: (output) => output.answer,
    }),
  ],
  assert: (ctx) => {
    ctx.expect(ctx.score.citation_valid).toBeGreaterThanOrEqual(0.7)
  },
})

The case context

Field	What it is
`ctx.input` / `ctx.output`	Typed from the task (zod schemas for prompts, signature for fns)
`ctx.expected`	The case's `expected` payload, or `undefined` — opaque data, never matched implicitly
`ctx.expect`	The bound assertion surface (below)
`ctx.variant`	`{ name, params }` of the variant this cell ran under
`ctx.trial`	0-based trial index
`ctx.score(name, score, metadata?)`	Ad-hoc per-case score (0–1) — joins the same score model as scorers
`ctx.step(name, schema?)`	Flow/agent step access with Standard Schema narrowing
`ctx.trace`	`{ id, url }` — the devtools deep link for this cell
`ctx.meta`	`{ durationMs, costUsd, usage }`

The post-score assert context

Field	What it is
`ctx.input` / `ctx.output`	Same typed cell payload as `expect`
`ctx.expected`	The case's `expected` payload, or `undefined`
`ctx.expect`	Same bound matcher surface, recording `phase: 'assert'`
`ctx.score`	Readonly map of statically named scorer outputs
`ctx.scores`	All cell scores, including dynamic/ad-hoc scores and skipped `null` scores
`ctx.variant` / `ctx.trial`	Same cell coordinates as `expect`
`ctx.step` / `ctx.trace`	Same step access and trace link as `expect`
`ctx.meta`	`{ durationMs, costUsd, usage }`

Static scorer names are available as properties (ctx.score.citation_valid). Scores produced by dynamic scorer functions or pre-score ctx.score(name, …) remain visible in ctx.scores, where the runtime name can be inspected explicitly. Numeric matchers retain a structured expression, so evidence surfaces can show threshold checks like 0.58 >= 0.7 => false without parsing failure messages.

Value matchers

ctx.expect(value) returns a Vitest-compatible matcher set: toBe, toEqual, toStrictEqual, toMatch, toMatchObject, toContain, toContainEqual, toHaveLength, toHaveProperty, toBeGreaterThan(OrEqual), toBeLessThan(OrEqual), toBeCloseTo, toBeDefined, toBeUndefined, toBeNull, toBeTruthy, toBeFalsy, toBeOneOf, toBeInstanceOf, toBeTypeOf, toSatisfy, and .not.

Assertions throw on failure (Vitest semantics — later lines may rely on earlier assertions having passed). The engine records every assertion that ran, so reports show position: 3 ran · 2 not evaluated. New experiment records also include assertions.outcomes, an ordered ledger of passed, failed, uncaptured, and not-evaluated assertions with matcher metadata, source refs, retained matcher messages, and actual/expected values where available. When the runner can resolve authored source, outcomes can also include a narrow sourceFrame snapshot and subjectExpr, the authored argument passed to ctx.expect(...) or ctx.expect.soft(...). Generated-only or missing source is reported as kind: 'unavailable' rather than shown as if it were authored code. The older assertions.failures array remains available for compatibility. If a helper throws a plain Error before a matcher records an outcome, the cell is still marked errored, but local evidence records a best-effort error.sourceRef and can resolve it to the callback crash line. To record a failure and keep going, use soft:

Trace-backed signal matchers can also attach spanIds to an outcome when the matcher inspected concrete observability spans. The local evidence view treats those span IDs as exact hotspots; score-threshold failures without exact spans fall back to labeled heuristic hotspots such as the scorer span or root span. The local backend also exposes a joined QualityCellEvidence record for one case x variant x trial cell. That record contains the ordered assertion ledger, normalized checks, authored source context, curated values at the check, baseline comparison evidence, and trace hotspots so web devtools and the TUI do not have to rebuild evidence from raw experiment files. Score gates and score assertion expressions are exposed as score-threshold checks with messages such as 0.58 is below the 0.70 floor.

When crux quality run is connected to a running devtools server, the runner also posts the cell's full observability graph before it exits. The experiment cell's traceIds then open in the Runs detail view. Direct evaluation.run() calls do the same when CRUX_DEVTOOLS_URL, DEVTOOLS_URL, or a reachable local localhost:4400 server is available, which keeps Vitest-generated experiments connected to their root run detail. If a run was produced offline, or trace retention has expired, cell evidence still renders from the experiment record and the trace detail degrades as unavailable. A retained root trace can legitimately have no child spans when the cell only ran plain callback assertions; the full trace link remains the retention signal.

expect: (ctx) => {
  ctx.expect.soft(ctx.output.answer).toContain('refund')
  ctx.expect.soft(ctx.output.answer).toContain('14 days')
  ctx.score('answer-length', Math.min(1, ctx.output.answer.length / 200))
},

All assertion outcomes lower into a per-cell pass score (1 all passed, 0 any failed) — the default gate policy and passRate gates read it.

Signal namespaces — trace-backed, honest

The differentiator: assertions over what the execution actually did, read from the observability trace — not guessed from output shapes.

Three namespaces always exist:

ctx.expect.latency.toBeUnderMs(5000)
ctx.expect.cost.toBeUnderUsd(0.05)
ctx.expect.errors.toHaveNone()

The rest are capability-typed: they exist at compile time only when the task kind captures that signal family.

Task kind	Capabilities
prompt	`modelCalls`, `citations`, `safety`
flow	`modelCalls`, `steps`, `toolCalls`, `routing`, `safety`, `memory`
agent	all nine: + `handoffs`, `retrieval`
retriever	`retrieval`
plain fn	none

So ctx.expect.toolCalls autocompletes on an agent task and is a compile error on a prompt task. Examples across the namespaces:

ctx.expect.toolCalls.toHaveCalled('lookupOrder', { orderId: '1234' })
ctx.expect.toolCalls.toMatchTrajectory('subset', [{ tool: 'lookupOrder' }])
ctx.expect.toolCalls.toHaveCalledBefore('search', 'write')
ctx.expect.steps.toHaveOrder('plan', 'write') // subsequence, not exhaustive
ctx.expect.handoffs.toHaveHandedOffTo('escalation-agent')
ctx.expect.retrieval.toContainHit({ sourceId: 'kb-refunds' })
ctx.expect.citations.toAllResolve()
ctx.expect.safety.toHaveBlocked('pii-guardrail')
ctx.expect.memory.toHaveWritten('customer-sentiment')
ctx.expect.routing.toHaveSelectedModel('gpt-5-mini')
ctx.expect.modelCalls.count().toBeLessThanOrEqual(3)

For trajectory assertions, prefer the outcome-first modes ('subset', 'superset') over 'strict'/'unordered' — exact trajectories are brittle; reserve strict for hard protocol checks.

Honest failure semantics: asserting on a signal that was not captured in this execution throws an UncapturedSignalError naming the signal and which task kinds capture it — never a vacuous pass. The type system prevents most of these at compile time; the runtime backstop covers variant task substitution and conditional capture.

Step access

For flows and agents, ctx.step() reads a named step's result. Step outputs are unknown until you narrow them with a Standard Schema (step names and outputs are created imperatively inside the flow handler, so they cannot be statically typed):

expect: (ctx) => {
  ctx.expect.steps.toHaveSucceeded('write')
  const write = ctx.step('write', z.object({ summary: z.string() }))
  ctx.expect(write.output.summary.length).toBeGreaterThan(0)
},

ctx.step(name) without a schema returns { output: unknown, status, durationMs }.

Scorers & gates — graded measures on top of binary assertions.

The case context

The post-score assert context

Value matchers

Signal namespaces — trace-backed, honest

Step access

Next

On this page