Why Crux

Bad LLM output is rarely
a model problem.

When LLM features fail in production, the fix usually isn't the prompt and isn't the model. It's a missing memory write, a stale retrieval, a guardrail that should've blocked the input, a router that picked the wrong model, an eval that should've caught it before ship. That layer is the harness: everything around the model call. Crux is a kit of typed building blocks for it. Pick what you need, drop the rest. No runtime to adopt, no framework to fight.

What the harness handles

The things a prompt alone can’t do.

A prompt sets the model’s instructions. A harness makes the call behave: shaping inputs, constraining outputs, observing everything in between. These are the capabilities every production LLM app ends up needing.

Steerability

Guardrails screen inputs, constraints validate outputs and retry the model with feedback, contexts shape the system message in priority order. The model behaves. It doesn't get to freelance.

Composable context

Brand voice, formatting rules, schema instructions, retrieved docs. Every contribution is a context() block. Priority-ordered, token-budgeted, reused across every prompt that needs them.

Type safety end to end

Inputs and outputs are Zod schemas. The compiler catches missing fields, refactors are real refactors, and the typed object you get back is the object you defined.

Provider portability

Define the prompt once. Switch OpenAI for Anthropic for Gemini at the call site, or hand it to your agent framework. Same definition, same tests, same traces.

Tests where the prompt lives

Inline test cases sit on the prompt itself. LLM-as-a-judge scoring catches drift. A CLI runner sweeps your prompt across providers in CI and surfaces regressions before they ship.

Observable by default

Every block emits a structured span: the resolved system, the guardrail that fired, the constraint that retried, the judge that scored. Locally in devtools, in OTel in production.

Routing & cost as code

Cascade to a cheaper model when a classifier says it's safe. Cache responses. Attribute spend per prompt, model, session, or flow. Set warn/limit budgets that throw before the bill does.

SLOT

The kit grows. Each new capability is its own import. Never a runtime upgrade, never a migration.

How Crux got here

Typed prompts grew up.

Crux didn’t start as a harness. It started as a way to make a single prompt typeable. Then we kept noticing the same gaps in production code, and the kit grew to fill them.

THENv0.1

Typed prompts

Crux started as a way to stop concatenating strings. Give prompts schemas, contexts, and TypeScript types so refactoring stopped being guesswork.

NEXTv0.4

Context engineering

It grew to cover the whole input side: composable contexts, block-based memory, retrieval, compaction. The right inputs to the model, assembled deliberately.

NOWv1.x

A harness around the model

Today Crux is the full layer between your app and your LLM: typed prompts, tools, guardrails, constraints, routing, evaluation, observability. Every block production LLM apps already need.

Same task. Two harnesses.

A prompt instructs. A harness steers.

Same edit task, written without and with Crux. The Crux version is longer because it does more: a guardrail screens the input, a context block carries brand voice, a constraint validates the output and retries the model when it's wrong. The model behaves, and you can see exactly why.

WITHOUT CRUX · A PROMPTdraft-edit.ts

import { generateObject } from 'ai'
import { openai } from '@ai-sdk/openai'
 
// Prompt logic scattered, duplicated, untyped.
const systemPrompt =
  '## Brand\nProfessional, concise tone.\n\n' +
  brandRules + formatRules
 
const result = await generateObject({
  model: openai('gpt-4o'),
  system: systemPrompt,
  prompt: userMessage,
  schema: z.object({
    edits: z.array(z.object({
      blockId: z.string(),
      text: z.string(),
    })),
  }),
})
 
// You get a typed object back — and that's it.
// No input screening. No retry on bad output.
// No trace. No way to swap providers.
// The model decides what 'behaves' means.

WITH CRUX · A HARNESSdraft-edit.ts

import { context, prompt } from '@use-crux/core'
import { guardrail } from '@use-crux/core/safety'
import { constrain } from '@use-crux/core/constrain'
import { generate } from '@use-crux/ai'
import { openai } from '@ai-sdk/openai'
 
const brand = context({
  priority: 30,
  system: '## Brand\nProfessional, concise tone.',
})
 
const injection = guardrail({
  phase: 'input',
  validate: detectPromptInjection,
})
 
const valid = constrain({
  schema: editsShape,
  retry: { max: 2, withFeedback: true },
})
 
const edit = prompt({
  id: 'draft-edit',
  use: [brand, injection, valid],
  input:  z.object({ instruction: z.string() }),
  output: editsShape,
  system: 'You are an expert content editor.',
})
 
const result = await generate(edit, {
  model: openai('gpt-4o'),
  input: { instruction: 'Fix the intro' },
})
 
// Typed. Screened. Constrained. Traced.
// Swap openai('gpt-4o') for anthropic(...)
// and every test still passes.

Design principles

How Crux thinks about context.

Six choices that shape every API in the kit. If you've felt the friction of treating prompts like strings, these will sound familiar.

01
Prompts are data, not strings
A prompt is a typed data structure with schemas, contexts, hooks, and settings. You can inspect, test, and compose it before any model call.
02
Composition over configuration
Small primitives compose through consistent interfaces: asContext(), asTools(), CruxStore, generate(). Build exactly what you need.
03
SDK-agnostic by default
Prompts don't import a provider. The same definition runs on Vercel AI SDK, OpenAI, Google GenAI, or your agent framework.
04
Evaluation is not optional
Inline test cases, LLM-as-a-judge scoring, and a CLI runner make testing prompts as natural as testing code.
05
Small API surface
Core concepts are prompt(), context(), and generate(). Memory, compaction, scoring, and agents build on top through the same patterns.
06
Observable by default
Every generation, memory op, compaction, and eval is traceable through devtools. Zero overhead when disabled.

Scope

What Crux is not.

Bounded on purpose. Crux does one thing: the typed, observable layer around your model call. It leaves the rest to tools designed for it.

Not a runtime

There's no Crux server, no execution loop, no orchestration engine. Everything Crux gives you runs in your code, against your SDK, on your infra. Nothing to operationally adopt.

Not a framework

Crux doesn't manage your routing, deployment, or app structure. It's a toolkit you use where you need it.

Not an agent framework

Crux provides coordination primitives (blackboards, handoffs) but delegates execution to AI SDK, OpenAI, or Google GenAI. It also integrates with agent frameworks like Convex Agent and Mastra.

Not a prompt management platform

No hosted dashboard, no A/B testing SaaS. Your prompts live in your codebase, versioned with git, reviewed in PRs.

Who it's for

TypeScript developers shipping LLM features
that need to actually work.

In Next.js, Node.js, or Convex. If you're managing more than a handful of prompts and shipping to users who expect reliable output, Crux gives you the structure to manage that complexity without adopting a heavyweight framework.

Next.jsNode.jsConvexVercel EdgeCloudflare WorkersAWS Lambda

Get started in 5 minutes.

One install, one prompt. Add memory, retrieval, guardrails, and traces as you need them.

Get Started See Comparisons

$npm install @use-crux/core

Bad LLM output is rarelya model problem.

The things a prompt alone can’t do.

Typed prompts grew up.

Typed prompts

Context engineering

A harness around the model

A prompt instructs. A harness steers.

How Crux thinks about context.

Prompts are data, not strings

Composition over configuration

SDK-agnostic by default

Evaluation is not optional

Small API surface

Observable by default