Token Optimization Playbook

Part of: Effective AI Utilization — Table of Contents

Tokens are the fundamental unit of both AI capability and AI cost. Every token you send is money spent and context window consumed. Optimization isn't about being cheap — it's about being intentional with a finite resource.

The BrianBot Baseline

BrianBot tracks tokens at the output level — every pipeline step increments totalInputTokens and totalOutputTokens on the episode record, and a UsageRecord is created per run. This is the minimum viable token tracking: you know what you spent, but only after you spent it.

What's missing is the other half: pre-flight budgeting. There's no check before submitting a prompt to ensure it fits within the model's context window. No chunking strategy if it doesn't. No cost estimate before the call is made. The costEstimateCents field exists in the schema but is always zero.

The Token Optimization Stack

Think of token optimization as five layers, from most to least impactful:

1. Model Selection (biggest lever)

Choosing Haiku over Sonnet for a task that doesn't need Sonnet's reasoning is the single biggest cost reduction. BrianBot's implicit tiering (Haiku for extraction, Sonnet for generation) is the right instinct. See Model Routing Strategies for the full framework.

2. Prompt Engineering

Every word in your system prompt is repeated on every call. A 500-token system prompt across 1000 calls is 500K input tokens. Audit ruthlessly: does this instruction actually change the output? If not, cut it. BrianBot's prompts are overridable at three levels (see Prompt Architecture) — use this to A/B test leaner prompts.

3. Context Window Management

BrianBot uses a 30-day rolling window for memory summaries — the only context management strategy in the codebase. This is a good start but leaves opportunities: summarize older context rather than dropping it, prioritize recent and relevant over chronological, pre-compute embeddings to select only the most relevant context. See Context Window Management for detailed strategies.

4. Response Constraints

maxTokens per step ranges from 1024 (metadata) to 8192 (transcript). These caps prevent runaway generation costs but should be tuned to actual usage — if metadata responses average 200 tokens, a 1024 cap wastes nothing but a 4096 cap would be negligent.

5. Caching

BrianBot has no caching layer. For repeated or similar prompts, response caching (even short-lived) can dramatically reduce token usage. Anthropic's prompt caching feature can cache system prompts across calls. Neither is currently used.

Implementing Cost Tracking

The gap between "we track tokens" and "we understand costs" is a multiplication:

cost = (inputTokens × inputPricePerToken) + (outputTokens × outputPricePerToken)

Maintain a price table per model, update it when pricing changes, and populate that costEstimateCents field. This turns token data into business intelligence. See Cost Tracking and Budget Controls for the full implementation.

The 80/20 of Token Optimization

For most applications, these three moves capture 80% of the value: use the smallest model that produces acceptable output, keep system prompts under 300 tokens where possible, and implement response caching for any prompt pattern that repeats more than 10 times per day.

Related: Model Routing Strategies, Context Window Management, Cost Tracking and Budget Controls, Prompt Architecture

🏷️#ai 🏷️#token-optimization 🏷️#cost 🏷️#brianbot