Claude Code Four-Tier Context Compression Architecture Explained

Introduction

As AI coding agents become a standard part of software engineering workflows in 2026, long-running development sessions have exposed a recurring technical problem: Context Rot. Even models with very large context windows, including Anthropic’s Opus 4.6 and Sonnet 4.6 used in Claude Code, can suffer from degraded reasoning when the conversation history grows too large. The issue is not only token exhaustion, but also attention dilution. Earlier requirements, failed debugging attempts, tool outputs and finalized fixes compete for limited model focus, causing the agent to forget key constraints or even revert previously completed changes.

Although Opus 4.6 provides a native 200,000-token context window and can be extended to 1 million tokens, real development sessions consume context faster than expected. A medium-sized Spring Boot authentication refactoring task can use roughly 23,000 tokens after only 20 to 30 interaction rounds. The main contributors include source-file reads, grep results, shell outputs, test logs, system prompts, CLAUDE.md content and custom agent definitions. In a monitored 200k-token session using 132,500 tokens, interactive dialogue and tool results alone consumed 113,700 tokens, or 56.9% of the entire window.

To address this, Claude Code does not simply delete the oldest messages. Instead, it uses a progressive four-tier compression system designed to preserve essential task information while freeing context space. Based on source modules such as autoCompact.ts, microCompact.ts and reactiveCompact.ts, this article analyzes Claude Code’s compression pipeline, trigger thresholds, summarization prompts, failure safeguards and post-compaction recovery logic.

1. Why Claude Code Needs Context Compression

Claude Code’s compression mechanism is driven by two practical engineering problems: uncontrolled token growth and context-quality degradation.

A typical active coding session using 132,500 of 200,000 available tokens shows how quickly context becomes saturated. System prompts consume about 6.7k tokens, tool definitions 7.4k, custom agent configuration 196 tokens, persistent memory files 2.5k and preloaded skills 1.9k. The largest portion comes from historical dialogue and tool outputs, which occupy 113,700 tokens. Only 34,500 tokens remain available, while another 33k tokens are reserved as an auto-compaction safety buffer.

This distribution explains why long-session coding agents cannot rely solely on larger context windows. Every file read, shell command, grep search or test run can inject large raw outputs into the conversation. Many of these outputs are useful briefly but become redundant after the model has already acted on them. If left untreated, they continue to occupy context and increase cost.

Context Rot is more damaging than token pressure alone. When early instructions, failed attempts and final solutions are all retained at full length, the model may lose the hierarchy of importance. A documented failure case involved a task to revise an API return format. After completing the fix and running tests, the model later modified unrelated functional logic and reversed the formatting improvement because the original requirement was buried inside an oversized history. Anthropic’s internal statistics also show that enabling a 1-million-token window reduces compaction frequency by only about 15%, proving that raw context expansion is not a complete solution.

2. Four-Tier Progressive Compression Pipeline

Claude Code’s compaction system follows a layered strategy: lightweight rule-based cleanup first, LLM summarization second, proactive request blocking third and emergency fallback deletion last.

Compression Stage	Trigger Rule	Core Processing Logic	Resource Profile
MicroCompaction	Before API calls; 60-minute idle threshold	Removes redundant old tool outputs while keeping the latest 5 results	No additional LLM call
AutoCompaction	Around 167k tokens on 200k context; around 967k on 1M context	Uses a forked LLM agent to generate structured summaries	One additional inference call
Blocking Limit	Around 88.5% usable context	Blocks outgoing API requests before failure	Prevents invalid 413 requests
Reactive Fallback	After prompt_too_long API error	Deletes earliest chronological messages	Last-resort emergency cleanup

2.1 MicroCompaction: Zero-Cost Pre-Cleanup

MicroCompaction runs silently before API requests and introduces no extra LLM cost. It is governed by predefined rules in modules such as timeBasedMCConfig.ts and targets high-volume tool outputs. Tools such as FILE_READ, SHELL, GREP, GLOB, WEB_SEARCH and FILE_EDIT can generate large historical payloads. Claude Code replaces older results with placeholder text such as [Old tool result content cleared], while preserving the latest five tool results through the keepRecent:5 rule.

The 60-minute idle threshold aligns with Prompt Cache’s one-hour TTL. If a user resumes after a long break, stale tool results can be cleared before full payload retransmission becomes necessary. This resembles operating-system page replacement: recently accessed data remains available, while older low-value outputs are evicted because their useful information has usually been absorbed into later discussion or code changes.

2.2 AutoCompaction: Structured LLM Summarization

AutoCompaction is the core compression layer. For a standard 200k-token Opus 4.6 context, Claude Code reserves 20,000 tokens for summary output. This number is derived from production compaction statistics where p99.99 summary length reached 17,387 tokens. After subtracting this reserve from 200k, the effective usable window becomes 180k tokens. A further 13k-token safety buffer places the default auto-compaction trigger at 167k tokens, or 83.5% of the raw context window.

For 1-million-token context variants, usable capacity becomes 980k tokens and the activation threshold is around 967k tokens, or 96.7% of raw capacity. Developers can adjust these thresholds through environment variables such as CLAUDE_CODE_AUTO_COMPACT_WINDOW and CLAUDE_AUTOCOMPACT_PCT_OVERRIDE.

AutoCompaction has two summary-generation routes. The first is Session Memory Compact, an experimental incremental path that reuses pre-generated session notes if background memory is enabled. Final summaries are constrained between 10,000 and 40,000 tokens, and at least five recent text-containing message blocks are preserved. This route is faster because it avoids full conversation reprocessing.

The second is Full Conversation Summarization, the default production path. Claude Code creates an isolated one-turn forked agent that shares the parent session’s toolset and Prompt Cache prefix. This agent reads the conversation history and generates a structured summary using a dedicated compaction prompt.

The prompt design is strict. It explicitly forbids tool invocation, because an isolated summarization agent calling tools can cause compression failure. Internal data shows that unconstrained Sonnet 4.6 tool-trigger failure rate can reach 2.79%, compared with only 0.01% for Sonnet 4.5, making the prohibition necessary. The output uses two XML blocks: <analysis> and <summary>. The analysis block serves as temporary reasoning and is stripped later by formatCompactSummary, while the summary block preserves task intent, technical context, modified files, errors, fixes, pending work, original user instructions and suggested next steps. A key design detail is that original user prompts must be preserved verbatim to prevent requirement drift after compression.

Partial Compact further improves efficiency. If a prior summary already exists, Claude Code compresses only messages added after that summary through configurable from and up_to boundaries instead of reprocessing the entire history.

2.3 Blocking Limit and Circuit Breaker

If AutoCompaction cannot keep the session under control, Claude Code applies preventive safeguards.

The Blocking Limit reserves a manual compact buffer of around 3,000 tokens. When context usage exceeds about 88.5% of the usable window, Claude Code blocks outgoing API submission rather than allowing a likely prompt_too_long failure. This avoids invalid requests, wasted latency and unnecessary billing.

The circuit breaker prevents repeated compaction loops. After three consecutive auto-compact failures, automatic retries stop. March 2026 BigQuery production data showed 1,279 unique sessions with more than 50 sequential compression failures, with an extreme case reaching 3,272 invalid retries. Without the cutoff, these loops could waste about 250,000 API calls per day globally.

2.4 Reactive Fallback

Reactive Fallback is the last emergency layer. It activates only after a real prompt_too_long API response occurs despite earlier protections. The fallback deletes earliest messages chronologically until the request fits within valid limits. In mainstream external builds, this layer remains mostly stubbed, indicating that Anthropic relies primarily on proactive compression rather than post-failure recovery.

3. Post-Compaction Recovery Workflow

Compression does not end with summary generation. Claude Code runs Post-Compact Cleanup to restore session consistency. For main-thread compaction tasks, it resets MicroCompaction runtime flags, memory metadata cache, system prompt segment cache and tool approval records. At the same time, preloaded Skill content is intentionally retained so user-defined domain expertise is not lost.

A particularly useful recovery feature is automatic source-file reload. After compaction, Claude Code can reload up to five recently accessed files, capped at 5,000 tokens per file and 50,000 tokens in total. This lets developers continue coding without manually reopening files immediately after history replacement.

4. Manual Tuning Options

Claude Code exposes several manual controls for developers.

The /compact command allows real-time compression and can include directional instructions, such as /compact focus on the database migration code. These instructions are appended to the compaction prompt. Manual compaction uses a looser 3k-token buffer than automatic compaction’s 13k reserve, making it useful at milestone boundaries or before switching task direction.

Project-level CLAUDE.md can also store persistent compaction guidance through a Compact Instructions block. These rules are injected into future compaction prompts automatically. Environment variables such as DISABLE_COMPACT and DISABLE_AUTO_COMPACT allow broader runtime configuration, especially in containerized or scripted development environments.

5. Role in Claude Code’s Context Governance System

Compaction works alongside three other Claude Code context-management components. CLAUDE.md acts as a static project cache loaded into new sessions. The memory system preserves user preferences and project metadata across sessions. Sub-agent isolation splits complex work into independent context sandboxes. Compaction functions as runtime garbage collection, condensing redundant dialogue into structured summaries during active development.

This combination reduces token inflation while preserving task continuity. In larger engineering organizations, similar principles can be applied across multiple model providers.

Conclusion

Claude Code’s four-tier compression architecture reflects a production-focused engineering approach. Instead of relying on a single summarization algorithm, it distributes context management across rule-based cleanup, structured LLM summarization, proactive blocking and emergency fallback deletion. Each threshold, prompt constraint and recovery step is shaped by real production data, including BigQuery monitoring of invalid API calls and compaction failure loops from early 2026.

For AI engineering teams building custom coding agents or long-session LLM applications, this architecture offers several practical lessons: preserve original user requirements verbatim, separate scratchpad reasoning from reusable summaries, prevent uncontrolled retry loops and treat context compression as a runtime system rather than a one-time prompt trick. As agentic coding workflows continue to lengthen in 2026, progressive compaction will become a required infrastructure layer for commercial AI coding assistants, balancing cost control, historical fidelity and long-duration development reliability.