Context Editing + Compaction: Complete Strategy for Selective History Clearing and Server-Side Auto-Summarization
Chapter 27: Context Compaction: Automatic Summarization and Lossless Conversation Continuation
27.1 The Physical Limit of Context Windows
Even with Claude's 200K token context window, long-running agent tasks will eventually hit the ceiling. A multi-step code debugging session, extended research task, or days-long project assistant accumulates conversation history faster than you might expect — each tool call, each code snippet, each back-and-forth exchange burns tokens that never return.
Context Compaction is the engineering response to this constraint. The goal is to compress accumulated conversation history with minimal information loss, enabling seamless continuation of the task.
Naive truncation — simply discarding old messages — is cheap but dangerous. An agent that forgets it already explored a failed approach, or that the user explicitly ruled out a particular solution, will waste time and frustrate users. Compaction preserves the semantically critical content while dramatically reducing token count.
When to Trigger Compaction
def should_compact(messages: list[dict], system: str,
threshold: float = 0.75) -> bool:
"""
Trigger compaction when estimated token usage exceeds threshold
of the model's context limit.
"""
MODEL_LIMIT = 200_000 # claude-opus-4-5, claude-sonnet-4-5
total_chars = len(system)
for msg in messages:
content = msg.get("content", "")
if isinstance(content, str):
total_chars += len(content)
elif isinstance(content, list):
for block in content:
if isinstance(block, dict) and "text" in block:
total_chars += len(block["text"])
# Rough approximation: 3-4 chars per token
estimated_tokens = total_chars // 3
return estimated_tokens / MODEL_LIMIT > threshold
27.2 Claude Code's Built-In Compaction
Claude Code, Anthropic's official CLI, ships with automatic context compaction. Understanding its design informs how to build equivalent mechanisms in custom agents.
Trigger Conditions
Claude Code automatically compacts when:
- Budget threshold: Context token usage exceeds ~75% of the model limit
- Manual command: User runs
/compact - Task boundary: A significantly different new request is detected
Compaction Flow
Full conversation history (100K tokens)
│
▼
┌──────────────────────────┐
│ Summarization subtask │
│ Model: Claude Haiku │
│ Goal: extract key state │
└──────────────────────────┘
│
▼
Summary message (~2K tokens)
│
▼
┌──────────────────────────┐
│ Rebuilt history │
│ [summary] + [recent N] │
└──────────────────────────┘
│
▼
Compacted context (~20K tokens)
What the Summary Contains
For a coding session, Claude Code generates structured summaries covering:
## Auto-Generated Conversation Summary
### Task Progress
- Completed: Set up FastAPI project structure, created auth module skeleton
- In Progress: Implementing JWT token validation middleware
- Pending: Write unit tests for auth endpoints
### Key Decisions Made
- Using PyJWT library (not python-jose) for token handling
- Token expiry: 15 min access, 7 days refresh
- Refresh tokens stored in Redis keyed by user_id
### Current File State
- Modified: src/auth/middleware.py, src/auth/models.py
- Key implementation: JWTMiddleware class (lines 45-89)
### Active Context
- Working on: validate_token() function
- Last error: AttributeError on line 67, payload["sub"] not found
27.3 Building a Custom Compactor
import anthropic
from dataclasses import dataclass
@dataclass
class CompactionResult:
summary: str
compressed_messages: list[dict]
tokens_saved: int
messages_summarized: int
class ContextCompactor:
"""Custom context compactor for agent systems"""
TASK_PROMPTS = {
"general": """Generate a structured summary of this conversation including:
1. Completed tasks and key decisions
2. Work currently in progress
3. Outstanding action items
4. Important constraints or user preferences
The summary must be detailed enough for a new assistant instance to continue
seamlessly without repeating completed work.""",
"coding": """Summarize this coding session including:
1. Task goal and completion status
2. Modified files and key changes
3. Current blocking issue or error (if any)
4. Technical decisions (libraries chosen, architecture patterns)
5. Next action to take""",
"research": """Summarize this research session including:
1. Research question and objective
2. Key findings gathered so far
3. Hypotheses or paths already eliminated
4. What still needs to be investigated"""
}
def __init__(self, client: anthropic.Anthropic):
self.client = client
self.summary_model = "claude-haiku-4-5"
def _generate_summary(self, messages: list[dict],
task_type: str = "general") -> str:
prompt = self.TASK_PROMPTS.get(task_type, self.TASK_PROMPTS["general"])
parts = []
for msg in messages:
role = msg["role"].upper()
content = msg.get("content", "")
if isinstance(content, str):
parts.append(f"{role}: {content[:2000]}")
elif isinstance(content, list):
for block in content:
if isinstance(block, dict):
if block.get("type") == "text":
parts.append(f"{role}: {block['text'][:1000]}")
elif block.get("type") == "tool_use":
import json
parts.append(
f"{role}: [tool_call: {block['name']}"
f"({json.dumps(block['input'])[:150]})]"
)
elif block.get("type") == "tool_result":
parts.append(
f"TOOL_RESULT: {str(block.get('content',''))[:400]}"
)
resp = self.client.messages.create(
model=self.summary_model,
max_tokens=1024,
messages=[{"role": "user",
"content": f"{prompt}\n\n---\n\n" + "\n\n".join(parts)}]
)
return resp.content[0].text
def compact(self, messages: list[dict], keep_recent_turns: int = 3,
task_type: str = "general") -> CompactionResult:
keep = keep_recent_turns * 2
if len(messages) <= keep:
return CompactionResult("", messages, 0, 0)
to_summarize = messages[:-keep]
to_keep = messages[-keep:]
summary = self._generate_summary(to_summarize, task_type)
compressed = [
{"role": "user",
"content": f"[Auto-generated summary of {len(to_summarize)} prior messages]\n\n{summary}"},
{"role": "assistant",
"content": "I have the prior context. Please continue."}
] + to_keep
chars_saved = sum(len(str(m.get("content", ""))) for m in to_summarize) - len(summary)
return CompactionResult(
summary=summary,
compressed_messages=compressed,
tokens_saved=max(0, chars_saved // 3),
messages_summarized=len(to_summarize)
)
Smart Compactor: Preserve Critical Messages
Not all messages are equal. Tool call results that caused state changes, and explicit user constraints, should survive compaction intact:
class SmartCompactor(ContextCompactor):
CRITICAL_TOOLS = {"write_file", "execute_code", "database_query",
"create_resource", "deploy"}
def _is_critical(self, msg: dict) -> bool:
content = msg.get("content", "")
if isinstance(content, list):
for block in content:
if not isinstance(block, dict):
continue
if block.get("type") == "tool_use" and block.get("name") in self.CRITICAL_TOOLS:
return True
if block.get("type") == "tool_result":
text = str(block.get("content", "")).lower()
if "error" in text or "exception" in text:
return True
if isinstance(content, str):
hard_constraints = ["must not", "never", "required", "constraint",
"不能", "必须", "禁止"]
if any(kw in content.lower() for kw in hard_constraints):
return True
return False
def smart_compact(self, messages: list[dict], keep_recent_turns: int = 3,
task_type: str = "coding") -> CompactionResult:
keep = keep_recent_turns * 2
to_process = messages[:-keep] if len(messages) > keep else []
to_keep = messages[-keep:]
critical = [m for m in to_process if self._is_critical(m)]
ordinary = [m for m in to_process if not self._is_critical(m)]
summary = self._generate_summary(ordinary, task_type) if ordinary else ""
compressed = []
if summary:
compressed += [
{"role": "user", "content": f"[Conversation summary]\n{summary}"},
{"role": "assistant", "content": "Understood, proceeding."}
]
compressed += critical + to_keep
return CompactionResult(summary, compressed, len(to_process) * 80,
len(ordinary))
27.4 Integrating Compaction into the Agent Loop
class CompactionAwareAgent:
COMPACT_THRESHOLD = 0.75
MODEL_LIMIT = 200_000
def __init__(self, system: str = ""):
self.client = anthropic.Anthropic()
self.compactor = SmartCompactor(self.client)
self.messages: list[dict] = []
self.system = system
self.compaction_count = 0
def _token_estimate(self) -> int:
total = len(self.system)
for m in self.messages:
total += len(str(m.get("content", "")))
return total // 3
def _maybe_compact(self, task_type: str = "coding"):
ratio = self._token_estimate() / self.MODEL_LIMIT
if ratio > self.COMPACT_THRESHOLD:
print(f"[Compaction] Usage at {ratio:.0%} — compacting...")
result = self.compactor.smart_compact(
self.messages, keep_recent_turns=5, task_type=task_type
)
self.messages = result.compressed_messages
self.compaction_count += 1
print(f"[Compaction #{self.compaction_count}] "
f"Summarized {result.messages_summarized} messages, "
f"saved ~{result.tokens_saved:,} tokens")
def turn(self, user_input: str, tools: list[dict] | None = None,
task_type: str = "coding") -> str:
self.messages.append({"role": "user", "content": user_input})
self._maybe_compact(task_type)
kwargs = {
"model": "claude-opus-4-5",
"max_tokens": 4096,
"system": self.system,
"messages": self.messages,
}
if tools:
kwargs["tools"] = tools
response = self.client.messages.create(**kwargs)
# Handle tool use loop (simplified)
while response.stop_reason == "tool_use" and tools:
self.messages.append({"role": "assistant", "content": response.content})
tool_results = self._handle_tools(response.content)
self.messages.append({"role": "user", "content": tool_results})
self._maybe_compact(task_type)
response = self.client.messages.create(**kwargs)
text = next((b.text for b in response.content if hasattr(b, "text")), "")
self.messages.append({"role": "assistant", "content": text})
return text
def _handle_tools(self, content) -> list[dict]:
return [
{"type": "tool_result", "tool_use_id": b.id,
"content": f"[Tool {b.name} executed]"}
for b in content if b.type == "tool_use"
]
27.5 Measuring Compaction Quality
Compaction introduces a risk: silent information loss. The only way to know what you lost is to test systematically.
def measure_compaction_fidelity(
client: anthropic.Anthropic,
original_messages: list[dict],
compacted_messages: list[dict],
probe_questions: list[str]
) -> float:
"""
Measure how much information survived compaction by comparing
answers to probe questions under both contexts.
"""
consistent = 0
for question in probe_questions:
q_msg = [{"role": "user",
"content": f"Answer in one sentence: {question}"}]
original_ans = client.messages.create(
model="claude-haiku-4-5", max_tokens=128,
messages=original_messages + q_msg
).content[0].text
compacted_ans = client.messages.create(
model="claude-haiku-4-5", max_tokens=128,
messages=compacted_messages + q_msg
).content[0].text
verdict = client.messages.create(
model="claude-haiku-4-5", max_tokens=5,
messages=[{"role": "user",
"content": f"Do these two answers convey the same meaning?\n"
f"A: {original_ans}\nB: {compacted_ans}\n"
f"Reply YES or NO only."}]
).content[0].text.strip().upper()
if "YES" in verdict:
consistent += 1
return consistent / len(probe_questions)
27.6 Hierarchical Compaction for Very Long Sessions
When a single-level summary would itself become too long (sessions spanning hundreds of turns), use hierarchical compaction:
def hierarchical_compact(client: anthropic.Anthropic,
messages: list[dict],
chunk_size: int = 20) -> list[dict]:
"""Two-level compaction for very long sessions"""
compactor = ContextCompactor(client)
chunks = [messages[i:i+chunk_size]
for i in range(0, len(messages), chunk_size)]
# Level 1: summarize each chunk independently
l1_summaries = []
for i, chunk in enumerate(chunks[:-1]):
s = compactor._generate_summary(chunk)
l1_summaries.append(f"[Segment {i+1}]\n{s}")
# Level 2: merge all L1 summaries into one
if len(l1_summaries) > 3:
combined = "\n\n".join(l1_summaries)
final = client.messages.create(
model="claude-haiku-4-5", max_tokens=2000,
messages=[{"role": "user",
"content": f"Merge these segment summaries into one "
f"coherent narrative:\n\n{combined}"}]
).content[0].text
else:
final = "\n\n".join(l1_summaries)
return [
{"role": "user", "content": f"[Session history summary]\n{final}"},
{"role": "assistant", "content": "I have the full prior context. Please continue."}
] + chunks[-1] # Keep the last chunk verbatim
Summary
Context compaction is a non-negotiable component of any production agent system. The key engineering principles:
- Trigger compaction at 70-80% context usage — before hitting the wall, not after
- Use a fast, cheap model (Claude Haiku) for summary generation to keep compaction cost low
- Preserve critical messages (state-changing tool calls, explicit constraints) verbatim
- Keep the most recent N turns uncompacted for precise short-term continuity
- Validate compaction quality with probe questions — silent information loss is the primary risk
- For sessions spanning hundreds of turns, use two-level hierarchical compaction
The next chapter moves from context management to knowledge retrieval: RAG architecture — how to give Claude access to knowledge that exceeds any context window by building retrieval-augmented generation systems.