Scoring
LLM-as-a-judge scoring and pre-built quality metrics.
import { llmJudge, metrics, judgeConstraint } from '@crux/core/scoring'llmJudge(config)
Create a reusable judge instance.
| Field | Type | Description |
|---|---|---|
id | string | Judge identifier |
criteria | string | What to evaluate |
scale | { min: number, max: number } | Score range |
rubric | Record<number, string>? | Score descriptions |
chainOfThought | boolean? | Enable reasoning (default: true) |
fewShot | JudgeFewShot[]? | Calibration examples |
generate | GenerateObjectFn | Structured generation function |
model | unknown | Judge model |
Returns: JudgeInstance
| Method | Description |
|---|---|
.score(input) | Score an input/output pair |
.id | Judge identifier |
.score() input:
| Field | Type | Description |
|---|---|---|
input | string | The query/prompt |
output | string | The response to evaluate |
reference | string? | Optional ground truth |
.score() returns JudgeResult:
| Field | Type | Description |
|---|---|---|
score | number | Clamped to scale |
reasoning | string | Chain-of-thought explanation |
metricId | string | Judge id |
metrics
Six pre-built judges. Each takes { generate: GenerateObjectFn, model } and returns a JudgeInstance.
| Metric | Criteria | Scale |
|---|---|---|
metrics.relevance() | Is the output relevant to the input query? | 1–5 |
metrics.faithfulness() | Is the output faithful to provided context? | 1–5 |
metrics.coherence() | Is the output logically coherent and well-structured? | 1–5 |
metrics.completeness() | Does the output address all aspects of the query? | 1–5 |
metrics.toxicity() | Is the output free from harmful content? (5 = safe) | 1–5 |
metrics.conciseness() | Is the output appropriately concise? | 1–5 |
const relevance = metrics.relevance({ generate: generateObjectFn, model })
const result = await relevance.score({ input: query, output: response })judgeConstraint(judge, opts)
Bridge a judge into a normal Constraint for online enforcement of scored quality. The returned constraint introduces nothing new — it behaves exactly like a hand-written constraint(): the safety session runs it with retries, audits, and observability unchanged, and the judge's reasoning becomes the corrective feedback for regeneration rounds.
const brandVoice = llmJudge({
id: 'brand-voice',
criteria: 'Does the copy match the warm, direct brand voice?',
scale: { min: 1, max: 10 },
})
const brandVoiceGate = judgeConstraint(brandVoice, { min: 7 })
// → an ordinary Constraint named "brand-voice"| Option | Type | Description |
|---|---|---|
min | number | Minimum acceptable score on the judge's own scale (inclusive) |
severity | 'assert' | 'suggest'? | Constraint severity (default: 'assert') |
maxRetries | number? | Per-constraint retry budget (default: 2) |
category | string? | Risk-category label carried into audits |
feedback | (result: JudgeResult) => string? | Retry feedback (default: the judge's reasoning) |
generate | GenerateObjectFn? | Judge generate override for the production call |
model | unknown? | Judge model override for the production call |
input | (output, ctx) => string? | Derive the judge's input field (default: empty) |
Returns: Constraint<TSchema> — pass it anywhere constraints are accepted (per-call, per-prompt, or createSafetyPlugin()). The factory is generic over the parsed-output schema like constraint() itself: annotate the input callback's parameter as ConstraintOutput<typeof mySchema> and output.parsed is typed instead of unknown.
Check metadata.judge on audit entries carries a JudgeConstraintVerdict: { metricId, score, min, reasoning, detail? } — detail is present (and typed by the judge's TDetail) when the judge has a detailSchema.
For quality runs, scorers.judge() in @crux/core/quality reuses this judge machinery with rubric and choice-score modes.
Types
import type {
JudgeConfig, JudgeInstance, JudgeResult, JudgeInput,
JudgeScoreOptions, JudgeFewShot, JudgeConstraintOptions, JudgeConstraintVerdict,
} from '@crux/core/scoring'