Budget manager

Advisory token-pressure tracker. Decide when to compact; apply your own strategy.

A budget manager tracks token usage across multiple sources (system, history, tools, retrieved context) and reports a pressure level: normal, warning, or critical. It does no compaction itself — it tells you when to act, not how.

What problem does this solve?

Different parts of a request consume tokens at different rates: a long-running chat history grows linearly, retrieved RAG context arrives in chunks, tools come and go between turns. To compact intelligently you need to know what is taking up the budget, not just that you're close to the limit.

The budget manager gives you a pure tracker — no LLM calls, no state, no side effects. You report token counts as you build the request; you read pressure level when you need to decide what to drop or compress.

When should I use it?

You want fine-grained pressure measurement by source (system / history / tools / context)
You're combining compaction strategies (extract structured first, then summarize rest)
You need a hook for telemetry — emit a metric every time pressure crosses a threshold
You want to make compaction decisions in your code, not via a built-in primitive

When should I NOT use it?

You're happy with sliding window's "compact when N is exceeded" — it doesn't need pressure tracking
You don't have a token counter — without one, the budget manager has no inputs
You only have one source of tokens — you can compute pressure inline without the abstraction

Quick start

budget.ts

import { createBudgetManager } from '@crux/core/compaction'

const budget = createBudgetManager({
  limit: 128_000,
  warningThreshold: 0.8,     // 80% used → warning
  criticalThreshold: 0.95,   // 95% used → critical
})

// Report token counts as you build the request
budget.report('system', 2000)
budget.report('history', 45_000)
budget.report('tools', 3000)
budget.report('rag', 12_000)

const state = budget.check()
state.used        // 62000
state.available   // 66000
state.pressure    // 0.484
state.level       // 'normal'
state.breakdown   // { system: 2000, history: 45000, tools: 3000, rag: 12000 }

Reacting to pressure

The typical pattern: build the request, measure, decide, compact:

budget.report('system', countTokens(systemText))
budget.report('history', countTokens(messages))
budget.report('rag', countTokens(retrievedDocs))

const state = budget.check()

if (state.level === 'critical') {
  // Aggressive: extract structured facts + sliding-window
  await extractAndStoreFacts(messages)
  messages = await slidingWindow.getMessages()
} else if (state.level === 'warning') {
  // Soft: trim retrieved docs to top-K
  retrievedDocs = retrievedDocs.slice(0, 3)
}

Hooks

onBudgetCheck fires every time pressure level changes — handy for metrics:

const budget = createBudgetManager({
  limit: 128_000,
  onBudgetCheck: ({ level, pressure, breakdown }) => {
    if (level === 'critical') {
      metrics.increment('budget.critical', { breakdown })
    }
  },
})

The hook also flows through to devtools and OTel via the standard instrumentation pipeline.

Choosing thresholds

warningThreshold: 0.8 is a reasonable default — gives you headroom for the response itself
criticalThreshold: 0.95 leaves ~5% for safety; below this and you risk a 4xx from the provider
For models with reserved-output budgets, subtract the response budget from limit before passing in

Budget manager

What problem does this solve?

When should I use it?

When should I NOT use it?

Quick start

Reacting to pressure

Hooks

Choosing thresholds

Where to next

Compaction overview

Sliding window

Token budget

On this page