A context window is the total number of tokens — input plus output — that a 📝Large Language Model (LLM) can reference when generating a response. It functions as the model's working memory: everything the model can "see" during a single exchange, distinct from the broader corpus it was trained on. System prompts, conversation history, uploaded documents, images, tool definitions, the model's own reasoning, and the generated response all consume tokens within this shared space.

How It Works

Every turn in a conversation accumulates tokens linearly. The input phase contains all previous conversation history plus the current message; the output phase generates a response that becomes part of future input. Once the total approaches the context window limit, older content must be managed — either through summarization (compaction), selective removal, or starting a new session.

A larger context window allows handling more complex, multi-document, and lengthy prompts, but more context is not automatically better. As token count grows, accuracy and recall degrade — a phenomenon known as context rot. A focused 2,000-token prompt with exactly the right information routinely outperforms a 50,000-token prompt stuffed with everything possibly relevant. Curating what occupies the window matters as much as how large the window is.

Extended Thinking

When extended thinking is enabled, thinking tokens count toward the context window and are billed as output tokens. However, previous thinking blocks are automatically stripped from context in subsequent turns — they are not carried forward as input. This architecture is token-efficient: extensive reasoning happens without compounding token waste across a multi-turn conversation. The effective calculation becomes: context_window = (input_tokens - previous_thinking_tokens) + current_turn_tokens.

Context Awareness

Some models track their remaining token budget throughout a conversation, receiving updates after each tool call. This enables the model to pace itself on long-running tasks rather than guessing how much space remains — analogous to competing in a cooking show with a visible clock versus without one.

Managing Context

When conversations approach context limits, two primary strategies apply:

Server-side compaction — automatically summarizes earlier conversation content, enabling sessions that extend beyond the context window
Context editing — fine-grained strategies like clearing old tool results or thinking blocks to reclaim space

How It Works

Extended Thinking

Context Awareness

Managing Context

Contexts