Recipes
The evaluation pattern for each primitive — prompts, flows, agents, retrieval, RAG pipelines, and context bakeoffs.
Every recipe is the same three keys — task, data, and whichever levels you
need — the primitive only changes what gets inferred and which signal
namespaces exist.
Prompt
Pass the prompt as the task; case inputs come from its merged input schema,
ctx.output from its output schema. Wrap in target.prompt() only to set
execution defaults:
import { evaluate, target } from '@crux/core/quality'
export default evaluate('support.refunds', {
task: target.prompt(supportPrompt, { model: 'gpt-5' }),
data: [{ input: { question: 'How do refunds work?', locale: 'en' } }],
expect: (ctx) => {
ctx.expect(ctx.output.answer).toContain('refund')
ctx.expect.safety.toHavePassedGuardrails()
ctx.expect.modelCalls.toHaveNoFallback()
},
})Flow
Flows capture steps (plus toolCalls, routing, memory, safety).
Assert order and success at the step level, narrow step outputs with a
schema, and override models per step in variants:
const researchFlow = flow<{ summary: string }, { topic: string }>('research', async (f) => {
const sources = await f.step('search', async () => [`results for ${f.input.topic}`])
return f.step('write', async () => ({ summary: sources.join('\n') }))
})
export default evaluate('research.pipeline', {
task: target.flow(researchFlow, { steps: { write: { model: 'gpt-5-mini' } } }),
data: [{ input: { topic: 'refunds' } }],
expect: (ctx) => {
ctx.expect.steps.toHaveOrder('search', 'write')
ctx.expect.steps.toHaveSucceeded('write')
const write = ctx.step('write', z.object({ summary: z.string() }))
ctx.expect(write.output.summary.length).toBeGreaterThan(0)
},
})Agent
Agents capture all nine signal families. target.agent() adds scripted tool
results (tools: mocks — static values or (args) => result functions) and
a loop bound (maxToolSteps). Multi-turn cases use turns:
export default evaluate('support.agent-loop', {
task: target.agent(supportAgent, {
tools: { lookupOrder: { status: 'shipped' } },
maxToolSteps: 8,
}),
data: [
{ input: { question: 'Where is my order?', locale: 'en' } },
{ turns: [{ user: 'Hi' }, { user: 'I want a refund for order 1234' }] },
],
expect: (ctx) => {
ctx.expect.toolCalls.toHaveCalled('lookupOrder')
ctx.expect.toolCalls.toMatchTrajectory('subset', [{ tool: 'lookupOrder' }])
ctx.expect.handoffs.count().toBeLessThanOrEqual(1)
},
replay: { mode: 'replay-strict', cassette: cassette('support-agent') },
})Retriever
task: target.retriever(r) takes { query: string } inputs by default (pass
a query: mapper for other shapes) and outputs the hit list. The
scorers.retrieval.* family reads expected: { sources: [...] }:
export default evaluate('docs.retrieval', {
task: target.retriever(docsRetriever),
data: [
{
input: { query: 'how do refunds work' },
expected: { sources: [{ sourceId: 'kb-refunds' }] },
},
],
expect: (ctx) => {
ctx.expect.retrieval.toContainHit({ sourceId: 'kb-refunds' })
ctx.expect.retrieval.count().toBeGreaterThan(0)
},
scorers: [scorers.retrieval.recallAtK(5), scorers.retrieval.mrr()],
})RAG pipeline
For end-to-end RAG, the task is your real pipeline (a function or flow that
retrieves, then answers). The judge-backed scorers.rag.* read the retrieved
context from the captured retrieval signals — no manual context plumbing —
and skip honestly (score: null) when nothing was retrieved:
export default evaluate('docs.rag', {
task: async (input: { question: string }) => {
const hits = await docsRetriever.retrieve(input.question)
return { answer: await answerFrom(hits), sources: hits.map((h) => h.sourceId) }
},
data: [{ input: { question: 'How do refunds work?' } }],
scorers: (s) => [s.rag.faithfulness(), s.rag.answerRelevancy()],
})Context bakeoff
Measure what a context buys you: two prompts with the same I/O contract, one grounded, compared as variants. (A static context doesn't change the merged input type, so the prompts stay compatible.)
const refundPolicy = context({
id: 'refund-policy',
system: 'Refunds are accepted within 14 days of purchase.',
})
const groundedPrompt = prompt({
id: 'support-grounded',
input: z.object({ question: z.string(), locale: z.enum(['en', 'nl']) }),
output: z.object({ answer: z.string(), confidence: z.number() }),
system: 'You answer support questions precisely.',
use: [refundPolicy],
})
export default evaluate('support.context-bakeoff', {
task: supportPrompt,
data: cases,
variants: {
ungrounded: {},
grounded: { prompt: groundedPrompt },
},
baseline: 'ungrounded',
scorers: (s) => [
s.judge({ name: 'grounded-answer', rubric: 'Is the answer specific about policy?', select: (o) => o.answer }),
],
gates: { scores: { 'grounded-answer': { minDeltaVsBaseline: 0 } } },
})Sharing cases
Cases are plain data — export arrays and reuse them across evaluations. The
one-annotation escape hatch for extracted arrays is satisfies CaseOf<…>:
import type { CaseOf } from '@crux/core/quality'
export const sharedCases = [
{ input: { question: 'How do refunds work?', locale: 'en' }, expected: { answer: 'Within 14 days.' } },
] satisfies CaseOf<typeof supportPrompt, { answer: string }>[]Vitest bridge and programmatic runs
evaluation.run(overrides?) is the programmatic entry — embed an evaluation
in any test suite as a one-liner, or drive subset runs while iterating:
it('support quality', async () => {
const experiment = await supportEval.run()
expect(experiment.passed).toBe(true)
})
// focused local iteration (filtered runs demote gates to informational)
await bakeoff.run({ variants: ['current', 'candidate'], replayMode: 'replay-strict' })
await gated.run({ cases: ['refund-*'], trials: 1 })