Scorers & gates
Graded measures with the built-in scorer library, LLM judges, autoevals compatibility, datasets, and declarative pass policies.
Level 3 adds scorers (graded 0–1 measures, reported with honest statistics) and gates (the declarative pass policy that drives the exit code).
Scorers
A scorer is any autoevals-compatible function — ({ input, output, expected }) → { name, score: 0–1 | null, label?, metadata? }. The existing scorer
ecosystem plugs in unchanged. score: null means "skipped / not applicable"
and is excluded from aggregates.
Two spellings:
// Array form — importable, shareable (the default)
scorers: [scorers.exact(), scorers.levenshtein()],
// Factory form — the library arrives pre-bound to this evaluation's types
scorers: (s) => [
s.judge({ name: 'helpful', rubric: 'Does the answer resolve the question?', select: (o) => o.answer }),
s.levenshtein(),
],Use the factory form when the task output is structured: select is then
contextually typed against ctx.output (no annotation needed). With the
standalone import, judge on a non-string output requires an annotated
select parameter.
The built-in library
Code-class (run anywhere, free):
| Scorer | Measures |
|---|---|
scorers.exact() | Output equals expected |
scorers.contains(opts?) | Output contains the needle (default: expected) |
scorers.regex({ pattern }) | Pattern match |
scorers.levenshtein() | Normalized edit distance to expected |
scorers.jsonValid() / scorers.jsonDiff() | JSON validity / structural diff |
scorers.retrieval.hitRateAtK(k) · recallAtK(k) · precisionAtK(k) · mrr() · ndcg(k?) | Ranking quality against expected: { sources: [{ sourceId, chunkId? }] } |
Model-backed (bind generate, model, and embed from the eval or an
eval-local helper):
| Scorer | Measures |
|---|---|
scorers.judge({ name, rubric | choiceScores, select?, generate, model, useCoT? }) | LLM-as-judge — free rubric (0–1) or classification with mapped scores |
scorers.embeddingSimilarity({ embed }) | Cosine similarity to expected |
scorers.rag.faithfulness({ generate, model }) | Is the answer grounded in the retrieved context? |
scorers.rag.answerRelevancy({ generate, model }) | Does the answer address the question? |
scorers.rag.contextPrecision({ generate, model }) / contextRecall({ generate, model }) | Retrieved-context quality vs the question / reference |
Judges run chain-of-thought by default; the rationale lands in
score.metadata.rationale. Judge and RAG scorers require explicit
generate/model bindings when they spend tokens. The rag.* scorers read
retrieved context from the captured retrieval signals (with input.context as
a fallback) and skip honestly (score: null) when there is none.
choiceScores replaces a rubric with classification:
s.judge({
name: 'tone',
choiceScores: { professional: 1, neutral: 0.6, rude: 0 },
generate: qualityRuntime.generate,
model: qualityRuntime.judgeModel,
select: (o) => o.answer,
})Datasets
Golden sets live in files, validated with any Standard Schema library (zod, valibot, arktype) — JSON, JSONL, or CSV by extension:
import { evaluate, dataset } from '@crux/core/quality'
export default evaluate('support.quality-bar', {
task: supportPrompt,
data: dataset('golden/support.jsonl', {
input: z.object({ question: z.string(), locale: z.enum(['en', 'nl']) }),
expected: z.object({ answer: z.string() }),
}),
scorers: (s) => [
s.judge({ name: 'helpful', rubric: 'Does the answer resolve the question?', select: (o) => o.answer }),
s.levenshtein(),
],
gates: { passRate: { min: 0.95 }, scores: { helpful: { min: 0.7 } } },
})data: accepts an inline case array, a dataset, or a mixed array of both
(concatenated). Datasets resolve lazily at execute time; a schema failure is
a definition error (exit 2). Dataset rows are pure data — they cannot carry
expect callbacks.
Gates
Without gates, the zero-config policy applies: assertions gate, scores
inform — the run exits non-zero iff any cell errored or any expect
failed; scorer values are reported but never block. Declaring any gate
replaces that policy:
gates: {
passRate: { min: 0.95 }, // over the per-cell `pass` score
scores: {
helpful: { min: 0.7 }, // keyed by scorer name — autocompletes
levenshtein: { min: 0.6, max: 1 },
},
latency: { p95Ms: 5000 },
cost: { maxPerCaseUsd: 0.05, maxTotalUsd: 2 },
consistency: { passAtK: 0.9, passAllTrials: true }, // with trials > 1
}Gate keys under scores are typed from your scorer names when scorers are a
literal array declared before gates; with the factory form they are
validated at definition time instead (an unknown key throws immediately).
Gates are false-safe: errored cells always fail the run regardless of
gate configuration. Filtered runs (--case, --variant, only) demote
gates to informational — they print, but never fail CI — because partial case
populations make the statistics dishonest.
Statistics you can trust
Aggregate scores always report mean ± SEM (standard error of the mean),
and the reporter prints both. With trials: n, per-case consistency is
aggregated as pass@k (≥1 passing trial) and pass^k (all trials pass),
which the consistency gates read. Comparisons (next level) are paired per
case, not pooled.
Next
Variants & baselines — compare candidates and protect against regressions.