Context window management is the set of techniques used to decide what content fills a finite 📝Large Language Model (LLM) prompt and what gets dropped or compressed, so the model has what it needs without being diluted by noise.

Every language model has a hard input limit — the context window — measured in tokens. Context management is the discipline of filling that window productively. Stuff it with irrelevant history and outputs degrade; trim too aggressively and the model answers from gaps. The practice matters most for long-running agents, retrieval-augmented systems, and any application where the set of potentially relevant content vastly exceeds what the model can read in one pass. A well-managed context is not a full one — it is a focused one.

Key Facts

Category: AI engineering technique
Primary constraint: Per-model token limit (Claude Opus 4.6 ranges 200K–1M depending on tier; GPT-5 at 400K)
Related practices: Prompt architecture, retrieval-augmented generation, tokenization, summarization
Failure modes: Silent truncation, latency spikes, "lost in the middle" recall degradation, API errors on oversize payloads

How It Works

Time-based filtering — drop context older than N days or turns. Cheap, predictable, lossy. A 45-day-old memory that is more relevant than yesterday's is invisible to this approach
Summarization chains — instead of dropping old context, compress it. A month of memories becomes a paragraph; two months a sentence. Recursive summarization preserves signal at decreasing fidelity
Relevance-based selection — embed the current query and candidate context, select by semantic similarity. This is the core of 📝Retrieval-Augmented Generation (RAG) systems. Higher compute upfront, dramatically better context quality
Priority-based budgeting — allocate fixed shares of the window to categories (e.g., 40% primary content, 30% relevant history, 20% system instructions, 10% examples) and truncate the lowest-priority category first when budget is exceeded
Pre-flight token counting — use a tokenizer (tiktoken for OpenAI, Anthropic's counter) to measure the assembled prompt before sending. Apply the management strategy when the count exceeds the model's window minus the reserved output budget

Why It Matters

More context is not better context. A focused 2,000-token prompt with exactly the right information routinely outperforms a 50,000-token prompt containing everything possibly relevant. Modern models exhibit measurable "lost in the middle" recall degradation — accuracy falls for information placed between the beginning and end of long contexts. Context management is therefore not just a cost or latency optimization; it is a correctness discipline. The goal is never to fill the window. The goal is to give the model exactly what it needs and nothing more.

FAQ

What is an LLM context window?

The context window is the maximum number of tokens a language model can read in a single request. Content beyond that limit must be dropped, summarized, or retrieved on demand rather than passed inline.

Which context-management strategy is best?

It depends on the workload. Conversational agents with stable topics benefit most from summarization chains. Systems answering open-ended questions over large corpora need relevance-based selection. Multi-category prompts (instructions + examples + history) benefit from priority budgeting.

Does a bigger context window make management unnecessary?

No. Even with 1M-token windows, "lost in the middle" effects and per-token cost make focused prompts both more accurate and more economical than maxed-out ones. Bigger windows raise the ceiling; they do not remove the discipline.

What is pre-flight token counting?

Measuring a prompt's token count before sending it to the model, so the management strategy can be applied if the assembled payload would exceed the window. Skipping this check risks silent truncation or API errors at call time.

📝Retrieval-Augmented Generation (RAG) — relevance-based context selection at scale
📝Model Context Protocol (MCP) — protocol for streaming external context into agents on demand
📝Large Language Models (LLMs) — the systems whose window limits drive this practice

📝Brian Bot currently runs a 30-day rolling window for memory summaries loaded into the transcript prompt — anything older drops off. It is the simplest management strategy available: time-based filtering. Predictable, cheap, and lossy. A memory from 45 days ago that is more relevant than yesterday's is invisible to the current system. Migration to relevance-based selection is on the roadmap; for now the rolling window is what keeps the stack fast and the costs predictable.