Replay
Deterministic, token-free runs — cassettes record model calls at the executor boundary and replay them by normalized call identity.
Level 5 adds replay: — cassettes that intercept model calls at the adapter's
executor boundary, record outcomes once, and replay them deterministically.
Strict replay runs with zero live calls and fails closed on anything
unrecorded.
import { evaluate, cassette } from '@crux/core/quality'
export default evaluate('support.agent-loop', {
task: target.agent(supportAgent, { tools: { lookupOrder: { status: 'shipped' } } }),
data: cases,
replay: { mode: 'replay-strict', cassette: cassette('support-agent') },
})replay: 'replay-strict' (just the mode string) is enough — the cassette is
then named after the evaluation id.
Modes
| Mode | Behavior |
|---|---|
live | No cassette I/O (the default) |
record-new | Replay hits, record misses — the incremental workflow |
replay-strict | Replay only; a miss fails the cell closed with the missing key and a re-record hint |
refresh | Re-record every exercised key; unexercised entries are kept |
Mode resolution, strongest first: --replay CLI flag → cassette() mode →
the evaluation's replay: declaration → quality.defaults.replay → live.
Use CLI flags and package scripts for the run posture you want:
crux quality run --replay live # local smoke, no cassette I/O
crux quality run --replay record-new # record what's missing
crux quality run --ci --replay replay-strict # CI: zero live calls
crux quality run --replay refresh # refresh exercised entriesHow matching works
Interception happens at the ExecutorSpec/SdkGateway boundary — below prompts, flows, agents, tool loops, and judge scorers, so all of them replay transparently. Each call gets a normalized identity: a hash of (call kind, target id, prompt hash, model, canonicalized settings, tool-schema hash, input), with volatile fields (timestamps, request ids) excluded. That means:
- One cassette covers all variants — different params produce different keys, so a bakeoff replays every variant from the same file.
- Judge scorers replay too — a strict CI run spends zero judge tokens.
- Changing the prompt/model/settings invalidates exactly the affected entries, nothing else.
- Validation-retry sequences replay faithfully (invalid structured attempts are recorded with their errors).
Storage and hygiene
Cassettes live in .crux/quality/cassettes/<name>.json — commit them.
Metadata records recordedAt, SDK version, and models; the reporter warns
when a cassette is older than 90 days. Project redact dot-paths (plus
always-on defaults like authorization headers and API keys) are applied at
write time — secrets never reach disk. These paths are relative to the recorded
call/result payloads, so use domain paths such as customer.email; do not add
authorization/API-key paths unless you have a non-standard field name outside
the always-on set.
Under replay-strict, trials collapse to 1: replaying the same recording
n times measures nothing.
A CI posture that works
{
"scripts": {
"quality": "crux quality run --replay live",
"quality:record": "crux quality run --replay record-new",
"quality:ci": "crux quality run --ci --replay replay-strict --junit report.xml"
}
}- CI runs
crux quality run --ci --replay replay-strict— deterministic, token-free, fails closed on unrecorded calls and on gate regressions against the committed baseline. - Locally, record new work with
--replay record-new(orrefreshafter intentional prompt changes), review the cassette diff, and commit it alongside the change.
--ci controls reporter posture and exit-code friendliness. It does not need
to hide replay choices; keep replay explicit in scripts unless your project
intentionally sets quality.defaults.replay in config.
Next
Recipes — the pattern for each primitive.