Chapter 27

Context Editing + Compaction: Complete Strategy for Selective History Clearing and Server-Side Auto-Summarization

Chapter 27: Context Compaction: Automatic Summarization and Lossless Conversation Continuation

27.1 The Physical Limit of Context Windows

Even with Claude's 200K token context window, long-running agent tasks will eventually hit the ceiling. A multi-step code debugging session, extended research task, or days-long project assistant accumulates conversation history faster than you might expect — each tool call, each code snippet, each back-and-forth exchange burns tokens that never return.

Context Compaction is the engineering response to this constraint. The goal is to compress accumulated conversation history with minimal information loss, enabling seamless continuation of the task.

Naive truncation — simply discarding old messages — is cheap but dangerous. An agent that forgets it already explored a failed approach, or that the user explicitly ruled out a particular solution, will waste time and frustrate users. Compaction preserves the semantically critical content while dramatically reducing token count.

When to Trigger Compaction

def should_compact(messages: list[dict], system: str,
                   threshold: float = 0.75) -> bool:
    """
    Trigger compaction when estimated token usage exceeds threshold
    of the model's context limit.
    """
    MODEL_LIMIT = 200_000  # claude-opus-4-5, claude-sonnet-4-5

    total_chars = len(system)
    for msg in messages:
        content = msg.get("content", "")
        if isinstance(content, str):
            total_chars += len(content)
        elif isinstance(content, list):
            for block in content:
                if isinstance(block, dict) and "text" in block:
                    total_chars += len(block["text"])

    # Rough approximation: 3-4 chars per token
    estimated_tokens = total_chars // 3
    return estimated_tokens / MODEL_LIMIT > threshold

27.2 Claude Code's Built-In Compaction

Claude Code, Anthropic's official CLI, ships with automatic context compaction. Understanding its design informs how to build equivalent mechanisms in custom agents.

Trigger Conditions

Claude Code automatically compacts when:

Budget threshold: Context token usage exceeds ~75% of the model limit
Manual command: User runs /compact
Task boundary: A significantly different new request is detected

Compaction Flow

Full conversation history (100K tokens)
            │
            ▼
┌──────────────────────────┐
│  Summarization subtask   │
│  Model: Claude Haiku     │
│  Goal: extract key state │
└──────────────────────────┘
            │
            ▼
Summary message (~2K tokens)
            │
            ▼
┌──────────────────────────┐
│  Rebuilt history         │
│  [summary] + [recent N]  │
└──────────────────────────┘
            │
            ▼
Compacted context (~20K tokens)

What the Summary Contains

For a coding session, Claude Code generates structured summaries covering:

## Auto-Generated Conversation Summary

### Task Progress
- Completed: Set up FastAPI project structure, created auth module skeleton
- In Progress: Implementing JWT token validation middleware  
- Pending: Write unit tests for auth endpoints

### Key Decisions Made
- Using PyJWT library (not python-jose) for token handling
- Token expiry: 15 min access, 7 days refresh
- Refresh tokens stored in Redis keyed by user_id

### Current File State
- Modified: src/auth/middleware.py, src/auth/models.py
- Key implementation: JWTMiddleware class (lines 45-89)

### Active Context
- Working on: validate_token() function
- Last error: AttributeError on line 67, payload["sub"] not found

27.3 Building a Custom Compactor

import anthropic
from dataclasses import dataclass

@dataclass
class CompactionResult:
    summary: str
    compressed_messages: list[dict]
    tokens_saved: int
    messages_summarized: int

class ContextCompactor:
    """Custom context compactor for agent systems"""

    TASK_PROMPTS = {
        "general": """Generate a structured summary of this conversation including:
1. Completed tasks and key decisions
2. Work currently in progress
3. Outstanding action items
4. Important constraints or user preferences

The summary must be detailed enough for a new assistant instance to continue
seamlessly without repeating completed work.""",

        "coding": """Summarize this coding session including:
1. Task goal and completion status
2. Modified files and key changes
3. Current blocking issue or error (if any)
4. Technical decisions (libraries chosen, architecture patterns)
5. Next action to take""",

        "research": """Summarize this research session including:
1. Research question and objective
2. Key findings gathered so far
3. Hypotheses or paths already eliminated
4. What still needs to be investigated"""
    }

    def __init__(self, client: anthropic.Anthropic):
        self.client = client
        self.summary_model = "claude-haiku-4-5"

    def _generate_summary(self, messages: list[dict],
                           task_type: str = "general") -> str:
        prompt = self.TASK_PROMPTS.get(task_type, self.TASK_PROMPTS["general"])

        parts = []
        for msg in messages:
            role = msg["role"].upper()
            content = msg.get("content", "")
            if isinstance(content, str):
                parts.append(f"{role}: {content[:2000]}")
            elif isinstance(content, list):
                for block in content:
                    if isinstance(block, dict):
                        if block.get("type") == "text":
                            parts.append(f"{role}: {block['text'][:1000]}")
                        elif block.get("type") == "tool_use":
                            import json
                            parts.append(
                                f"{role}: [tool_call: {block['name']}"
                                f"({json.dumps(block['input'])[:150]})]"
                            )
                        elif block.get("type") == "tool_result":
                            parts.append(
                                f"TOOL_RESULT: {str(block.get('content',''))[:400]}"
                            )

        resp = self.client.messages.create(
            model=self.summary_model,
            max_tokens=1024,
            messages=[{"role": "user",
                        "content": f"{prompt}\n\n---\n\n" + "\n\n".join(parts)}]
        )
        return resp.content[0].text

    def compact(self, messages: list[dict], keep_recent_turns: int = 3,
                task_type: str = "general") -> CompactionResult:
        keep = keep_recent_turns * 2
        if len(messages) <= keep:
            return CompactionResult("", messages, 0, 0)

        to_summarize = messages[:-keep]
        to_keep = messages[-keep:]

        summary = self._generate_summary(to_summarize, task_type)

        compressed = [
            {"role": "user",
             "content": f"[Auto-generated summary of {len(to_summarize)} prior messages]\n\n{summary}"},
            {"role": "assistant",
             "content": "I have the prior context. Please continue."}
        ] + to_keep

        chars_saved = sum(len(str(m.get("content", ""))) for m in to_summarize) - len(summary)
        return CompactionResult(
            summary=summary,
            compressed_messages=compressed,
            tokens_saved=max(0, chars_saved // 3),
            messages_summarized=len(to_summarize)
        )

Smart Compactor: Preserve Critical Messages

Not all messages are equal. Tool call results that caused state changes, and explicit user constraints, should survive compaction intact:

class SmartCompactor(ContextCompactor):
    CRITICAL_TOOLS = {"write_file", "execute_code", "database_query",
                      "create_resource", "deploy"}

    def _is_critical(self, msg: dict) -> bool:
        content = msg.get("content", "")
        if isinstance(content, list):
            for block in content:
                if not isinstance(block, dict):
                    continue
                if block.get("type") == "tool_use" and block.get("name") in self.CRITICAL_TOOLS:
                    return True
                if block.get("type") == "tool_result":
                    text = str(block.get("content", "")).lower()
                    if "error" in text or "exception" in text:
                        return True
        if isinstance(content, str):
            hard_constraints = ["must not", "never", "required", "constraint",
                                "不能", "必须", "禁止"]
            if any(kw in content.lower() for kw in hard_constraints):
                return True
        return False

    def smart_compact(self, messages: list[dict], keep_recent_turns: int = 3,
                      task_type: str = "coding") -> CompactionResult:
        keep = keep_recent_turns * 2
        to_process = messages[:-keep] if len(messages) > keep else []
        to_keep = messages[-keep:]

        critical = [m for m in to_process if self._is_critical(m)]
        ordinary = [m for m in to_process if not self._is_critical(m)]

        summary = self._generate_summary(ordinary, task_type) if ordinary else ""

        compressed = []
        if summary:
            compressed += [
                {"role": "user", "content": f"[Conversation summary]\n{summary}"},
                {"role": "assistant", "content": "Understood, proceeding."}
            ]
        compressed += critical + to_keep

        return CompactionResult(summary, compressed, len(to_process) * 80,
                                len(ordinary))

27.4 Integrating Compaction into the Agent Loop

class CompactionAwareAgent:
    COMPACT_THRESHOLD = 0.75
    MODEL_LIMIT = 200_000

    def __init__(self, system: str = ""):
        self.client = anthropic.Anthropic()
        self.compactor = SmartCompactor(self.client)
        self.messages: list[dict] = []
        self.system = system
        self.compaction_count = 0

    def _token_estimate(self) -> int:
        total = len(self.system)
        for m in self.messages:
            total += len(str(m.get("content", "")))
        return total // 3

    def _maybe_compact(self, task_type: str = "coding"):
        ratio = self._token_estimate() / self.MODEL_LIMIT
        if ratio > self.COMPACT_THRESHOLD:
            print(f"[Compaction] Usage at {ratio:.0%} — compacting...")
            result = self.compactor.smart_compact(
                self.messages, keep_recent_turns=5, task_type=task_type
            )
            self.messages = result.compressed_messages
            self.compaction_count += 1
            print(f"[Compaction #{self.compaction_count}] "
                  f"Summarized {result.messages_summarized} messages, "
                  f"saved ~{result.tokens_saved:,} tokens")

    def turn(self, user_input: str, tools: list[dict] | None = None,
             task_type: str = "coding") -> str:
        self.messages.append({"role": "user", "content": user_input})
        self._maybe_compact(task_type)

        kwargs = {
            "model": "claude-opus-4-5",
            "max_tokens": 4096,
            "system": self.system,
            "messages": self.messages,
        }
        if tools:
            kwargs["tools"] = tools

        response = self.client.messages.create(**kwargs)

        # Handle tool use loop (simplified)
        while response.stop_reason == "tool_use" and tools:
            self.messages.append({"role": "assistant", "content": response.content})
            tool_results = self._handle_tools(response.content)
            self.messages.append({"role": "user", "content": tool_results})
            self._maybe_compact(task_type)
            response = self.client.messages.create(**kwargs)

        text = next((b.text for b in response.content if hasattr(b, "text")), "")
        self.messages.append({"role": "assistant", "content": text})
        return text

    def _handle_tools(self, content) -> list[dict]:
        return [
            {"type": "tool_result", "tool_use_id": b.id,
             "content": f"[Tool {b.name} executed]"}
            for b in content if b.type == "tool_use"
        ]

27.5 Measuring Compaction Quality

Compaction introduces a risk: silent information loss. The only way to know what you lost is to test systematically.

def measure_compaction_fidelity(
    client: anthropic.Anthropic,
    original_messages: list[dict],
    compacted_messages: list[dict],
    probe_questions: list[str]
) -> float:
    """
    Measure how much information survived compaction by comparing
    answers to probe questions under both contexts.
    """
    consistent = 0
    for question in probe_questions:
        q_msg = [{"role": "user",
                  "content": f"Answer in one sentence: {question}"}]

        original_ans = client.messages.create(
            model="claude-haiku-4-5", max_tokens=128,
            messages=original_messages + q_msg
        ).content[0].text

        compacted_ans = client.messages.create(
            model="claude-haiku-4-5", max_tokens=128,
            messages=compacted_messages + q_msg
        ).content[0].text

        verdict = client.messages.create(
            model="claude-haiku-4-5", max_tokens=5,
            messages=[{"role": "user",
                        "content": f"Do these two answers convey the same meaning?\n"
                                   f"A: {original_ans}\nB: {compacted_ans}\n"
                                   f"Reply YES or NO only."}]
        ).content[0].text.strip().upper()

        if "YES" in verdict:
            consistent += 1

    return consistent / len(probe_questions)

27.6 Hierarchical Compaction for Very Long Sessions

When a single-level summary would itself become too long (sessions spanning hundreds of turns), use hierarchical compaction:

def hierarchical_compact(client: anthropic.Anthropic,
                          messages: list[dict],
                          chunk_size: int = 20) -> list[dict]:
    """Two-level compaction for very long sessions"""
    compactor = ContextCompactor(client)

    chunks = [messages[i:i+chunk_size]
              for i in range(0, len(messages), chunk_size)]

    # Level 1: summarize each chunk independently
    l1_summaries = []
    for i, chunk in enumerate(chunks[:-1]):
        s = compactor._generate_summary(chunk)
        l1_summaries.append(f"[Segment {i+1}]\n{s}")

    # Level 2: merge all L1 summaries into one
    if len(l1_summaries) > 3:
        combined = "\n\n".join(l1_summaries)
        final = client.messages.create(
            model="claude-haiku-4-5", max_tokens=2000,
            messages=[{"role": "user",
                        "content": f"Merge these segment summaries into one "
                                   f"coherent narrative:\n\n{combined}"}]
        ).content[0].text
    else:
        final = "\n\n".join(l1_summaries)

    return [
        {"role": "user", "content": f"[Session history summary]\n{final}"},
        {"role": "assistant", "content": "I have the full prior context. Please continue."}
    ] + chunks[-1]  # Keep the last chunk verbatim

Summary

Context compaction is a non-negotiable component of any production agent system. The key engineering principles:

Trigger compaction at 70-80% context usage — before hitting the wall, not after
Use a fast, cheap model (Claude Haiku) for summary generation to keep compaction cost low
Preserve critical messages (state-changing tool calls, explicit constraints) verbatim
Keep the most recent N turns uncompacted for precise short-term continuity
Validate compaction quality with probe questions — silent information loss is the primary risk
For sessions spanning hundreds of turns, use two-level hierarchical compaction

The next chapter moves from context management to knowledge retrieval: RAG architecture — how to give Claude access to knowledge that exceeds any context window by building retrieval-augmented generation systems.

Rate this chapter

4.8 / 5 (5 ratings)