Chapter 16

Dual Compression System: Context Window Management Mechanism

Chapter 16: The Dual Compression System — Context Window Management

A 100K context window sounds enormous — until you start a real programming task. This chapter reveals how Hermes uses an elegant dual compression system to fit nearly unlimited work history into a finite context window.


16.1 Why Compression Is Necessary: The Reality of 100K Window Consumption

16.1.1 The Gap Between Theory and Reality

A 100K token context window sounds more than sufficient. But let's examine how much a real 2-hour programming session actually consumes:

# 2-hour programming session token consumption analysis (empirical data)

session_token_analysis = {
    "session_basics": {
        "total_duration": "127 minutes",
        "tool_calls": 83,
        "file_reads": 24,
        "shell_commands": 31,
        "python_executions": 28,
        "total_tokens": 94_847
    },
    
    "token_source_distribution": {
        "system_prompt":          {"tokens": 2_156,  "pct": "2.3%"},
        "MEMORY.md + skill inject":{"tokens": 3_847,  "pct": "4.1%"},
        "user_messages":          {"tokens": 8_234,  "pct": "8.7%"},
        "model_thinking_chain":   {"tokens": 12_891, "pct": "13.6%"},
        "model_response_text":    {"tokens": 11_447, "pct": "12.1%"},
        "tool_call_parameters":   {"tokens": 7_823,  "pct": "8.2%"},
        "tool_return_results":    {"tokens": 48_449, "pct": "51.1%"},  # ← BIGGEST!
    }
}

# Tool return results consume 51.1% of total tokens!
# This is the primary optimization target for the compression system.

16.1.2 Typical Token Explosion Scenarios

Scenario 1: Reading a large log file
──────────────────────────────────────
[Tool Call] file_read: "server.log"
[Tool Result] 2000 lines of log output
[Token Cost] ~15,000 tokens per tool call

Scenario 2: Running test suite
──────────────────────────────────────
[Tool Call] shell_exec: "pytest tests/ -v"
[Tool Result] 847 test case outputs
[Token Cost] ~12,000 tokens

Scenario 3: Analyzing Python package list
──────────────────────────────────────────
[Tool Call] python_exec: "list(pkg_resources.working_set)"
[Token Cost] ~3,000 tokens

Just these 3 tool calls consume ~30,000 tokens —
nearly the full 32K window limit!

16.1.3 Token Exhaustion Without Compression

Context window exhaustion timeline (no compression):

Token Usage
100K ┤
     │                              ╭──── OVERFLOW!
 80K ┤                        ╭───╯
 60K ┤                  ╭────╯
 40K ┤             ╭───╯
 20K ┤   ╭────────╯
  0K └─────────────────────────────────→ Time
     0      30min    60min    90min   120min
     
     Session terminates after ~45-50 minutes: tokens exhausted!

16.2 The "Sacred Zone" Protection Mechanism

16.2.1 Sacred Zone Definition

The "Sacred Zone" is the portion of context that is absolutely never compressed. It consists of three components:

┌──────────────────────────────────────────────────────────┐
│                  Full Context Window                     │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │               Sacred Zone                          │  │
│  │      ← Always preserved verbatim →                 │  │
│  │                                                     │  │
│  │  [1] System prompt + MEMORY.md + Skill injection   │  │
│  │      (~6K tokens, fixed content)                   │  │
│  │                                                     │  │
│  │  [2] First conversation turn (first user message   │  │
│  │      + first assistant response) — task anchor     │  │
│  │                                                     │  │
│  │  [3] Most recent ~20K tokens (~15-20 recent steps) │  │
│  │      (maintains working memory integrity)           │  │
│  └────────────────────────────────────────────────────┘  │
│                          │                               │
│  ┌────────────────────────────────────────────────────┐  │
│  │            Compressible Zone                        │  │
│  │   ← Old tool outputs replaced with summaries →     │  │
│  │                                                     │  │
│  │  [Tool call 1] python_exec: pd.read_csv(...)       │  │
│  │  [Tool result] ████████ COMPRESSED (12K → 150 tok) │  │
│  │                                                     │  │
│  │  [Tool call 2] file_read: server.log               │  │
│  │  [Tool result] ████████ COMPRESSED (15K → 200 tok) │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

16.2.2 Why Protect System Prompt and First Turn?

System prompt protection:

  1. Defines the model's behavioral rules and tool list — losing this fundamentally changes model behavior
  2. MEMORY.md and skill injections contain the user's core context — losing them causes severe context disruption
  3. System prompt content is fixed; its global behavioral influence is enormous relative to its token cost

First conversation turn protection:

The first turn contains the original intent and constraints of the task. Research shows LLMs are prone to "goal drift" in multi-step tasks — as tool calls accumulate, the original task gets forgotten. Keeping the first turn acts as an "anchor," preventing goal drift:

# Goal drift example
initial_task = "Write a Python script to read a CSV and generate a report"

# Step 8 model thought (if first turn was compressed):
drifted_thought = """
<think>
I've installed pandas and matplotlib, data cleaning is done,
now I should... optimize database queries? (GOAL DRIFT!)
</think>
"""

# Step 8 model thought (first turn preserved):
anchored_thought = """
<think>
Based on the initial task, I need to generate the final report file.
Data analysis is complete, now I should plot charts with matplotlib
and save as a PDF report.
</think>
"""

16.2.3 Why 20K for Recent Protection?

The recent ~20K token protection serves several key engineering purposes:

  1. State continuity: Recent tool results are the direct basis for the next decision
  2. Error recovery: The agent needs complete error information to recover correctly
  3. Intermediate result reference: The agent frequently needs specific values from earlier steps
class SacredZoneDetector:
    def find_sacred_boundary(self, messages: List[Message]) -> int:
        """Returns the index where Sacred Zone starts (messages before = compressible)"""
        first_dialog_end = 1  # system msg + first user msg + first assistant response
        
        # Count backwards from end to find 20K token boundary
        total_recent_tokens = 0
        recent_boundary_idx = len(messages) - 1
        
        for i in range(len(messages) - 1, first_dialog_end, -1):
            msg_tokens = count_tokens(messages[i].content)
            total_recent_tokens += msg_tokens
            if total_recent_tokens >= self.recent_protection_tokens:  # default 20K
                recent_boundary_idx = i + 1
                break
        
        return recent_boundary_idx

16.3 Old Tool Output Replacement Algorithm

16.3.1 Type-Specific Compression Strategies

class ToolOutputCompressor:
    COMPRESSION_THRESHOLD = 500   # tokens
    TARGET_COMPRESSED_LENGTH = 150  # tokens
    
    async def compress(self, tool_result: ToolResult) -> str:
        content = tool_result.content
        if count_tokens(content) <= self.COMPRESSION_THRESHOLD:
            return content
        
        content_type = self._detect_content_type(content)
        
        dispatch = {
            "code_output":   self._compress_code_output,
            "log_file":      self._compress_log_file,
            "tabular_data":  self._compress_tabular_data,
            "json_response": self._compress_json_response,
            "file_content":  self._compress_file_content,
        }
        
        compressor = dispatch.get(content_type, self._compress_general)
        return await compressor(content) if asyncio.iscoroutinefunction(compressor) else compressor(content)
    
    def _compress_code_output(self, content: str) -> str:
        """Keep head + errors + tail"""
        lines = content.split('\n')
        if len(lines) <= 30:
            return content
        
        head = lines[:20]
        errors = [l for l in lines if any(kw in l for kw in ['Error', 'Exception', 'Traceback', 'Warning'])]
        tail = lines[-10:]
        
        result = head
        if errors:
            result += ["\n[Error messages]"] + errors[:5]
        result += [f"\n[...{len(lines) - 30} lines omitted...]"]
        result += tail
        return '\n'.join(result)
    
    def _compress_log_file(self, content: str) -> str:
        """Extract key events from logs"""
        lines = content.split('\n')
        important = [l for l in lines if any(kw in l for kw in ['ERROR', 'CRITICAL', 'WARN'])]
        
        summary = (
            f"[Log Summary] Total: {len(lines)} lines, "
            f"ERRORs: {sum(1 for l in lines if 'ERROR' in l)}, "
            f"WARNs: {sum(1 for l in lines if 'WARN' in l)}\n"
        )
        if important:
            summary += f"\nKey logs ({len(important)} items):\n" + '\n'.join(important[:20])
        return summary
    
    def _compress_tabular_data(self, content: str) -> str:
        """Keep structure info and statistical summary"""
        lines = content.split('\n')
        if len(lines) <= 10:
            return content
        
        return (
            f"[Tabular Data Summary]\n"
            f"Columns: {lines[0]}\n"
            f"Total rows: {len(lines) - 1}\n"
            f"First 5 rows:\n" + '\n'.join(lines[1:6]) + "\n"
            f"Last 3 rows:\n" + '\n'.join(lines[-3:])
        )

16.3.2 Compression Timing Control

class CompressionController:
    def __init__(self, config: HermesConfig):
        self.trigger_threshold = 0.75  # compress when 75% full
        self.target_threshold = 0.50   # compress to 50%
        self.max_context = config.context_window_size
    
    def should_compress(self, session: Session) -> bool:
        return session.token_count / self.max_context > self.trigger_threshold
    
    async def compress_session(self, session: Session) -> CompressionResult:
        before_tokens = session.token_count
        sacred_boundary = self.sacred_zone_detector.find_sacred_boundary(session.messages)
        
        # Identify compressible tool outputs (before sacred zone, >500 tokens)
        candidates = [
            (i, msg) for i, msg in enumerate(session.messages)
            if i < sacred_boundary
            and msg.role == "tool"
            and count_tokens(msg.content) > 500
        ]
        
        # Sort by size descending, compress largest first
        candidates.sort(key=lambda x: count_tokens(x[1].content), reverse=True)
        
        compressed_count = 0
        for i, msg in candidates:
            if session.token_count / self.max_context <= self.target_threshold:
                break  # Target reached
            
            original_content = msg.content
            compressed_content = await self.tool_compressor.compress(ToolResult(content=original_content))
            
            session.messages[i].content = compressed_content
            session.messages[i].metadata["compressed"] = True
            session.messages[i].metadata["original_tokens"] = count_tokens(original_content)
            session.update_token_count()
            compressed_count += 1
        
        return CompressionResult(
            before_tokens=before_tokens,
            after_tokens=session.token_count,
            compression_ratio=(before_tokens - session.token_count) / before_tokens,
            messages_compressed=compressed_count
        )

16.4 Empirical Compression Rate Data

16.4.1 Compression Effectiveness by Tool Type

Tool Type Avg Original Tokens Avg Compressed Tokens Compression Rate Info Retention
python_exec (heavy output) 8,432 342 96.0% 85%
file_read (code files) 12,841 687 94.6% 92%
shell_exec (command output) 4,127 198 95.2% 80%
web_search (result lists) 6,234 456 92.7% 88%
sqlite (query results) 9,876 512 94.8% 94%
Overall average 8,302 439 94.7% 88%

16.4.2 2-Hour Programming Session Analysis

2-hour session (83 tool calls):

WITHOUT compression:
    Raw tool results total:  ~127,000 tokens
    Other content:           ~47,000 tokens
    Total:                   ~174,000 tokens
    → Far exceeds 100K limit; session terminates at ~45 minutes!

WITH dual compression:
    System prompt (Sacred):        2,156 tokens
    MEMORY.md + Skills (Sacred):   3,847 tokens
    First conversation (Sacred):   1,234 tokens
    Recent 20K (Sacred):          19,847 tokens
    Historical dialogue (Sacred):   8,234 tokens
    Old tool results (compressed): 11,341 tokens (orig ~96K, ratio 88%)
    ──────────────────────────────────────────────
    Total:                        46,659 tokens (46.7% utilization)
    
    Saved ~127,000 tokens → session runs full 127 minutes!

16.4.3 Context Usage Over Time

Token usage over time:

Without compression:
  45 min → ~85K tokens → overflow, session ends

With dual compression (50% target):
  0 min  →   2K tokens
  30 min →  38K tokens (first compression triggered)
  30 min →  31K tokens (after compression)
  60 min →  46K tokens
  90 min →  49K tokens (second compression triggered)
  127 min → 47K tokens (session naturally ends, task complete)

16.5 Synergy with Anthropic Prompt Caching

16.5.1 Prompt Caching Mechanism

When using Claude 3.5 as Hermes's model backend, Anthropic's Prompt Caching can be enabled:

class AnthropicCachedBackend:
    async def generate(self, messages: List[Message], system: str) -> str:
        system_with_cache = [
            {
                "type": "text",
                "text": system,
                "cache_control": {"type": "ephemeral"}  # Cache this section
            }
        ]
        
        response = await self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=4096,
            system=system_with_cache,
            messages=messages
        )
        
        usage = response.usage
        logging.info(
            f"Tokens - Input: {usage.input_tokens}, "
            f"Cache Read: {getattr(usage, 'cache_read_input_tokens', 0)}, "
            f"Output: {usage.output_tokens}"
        )
        
        return response.content[0].text

16.5.2 Combined Cost Impact

Dual Compression × Prompt Caching Combined Benefits:
Scenario: 127-minute programming session (Claude 3.5 Sonnet pricing)

┌──────────────────────────────────────────────────────────┐
│                     Cost Comparison                      │
│                                                          │
│  No compression + No caching:                            │
│    Session terminates at ~45 min — task not completable  │
│                                                          │
│  Compression + No caching:                               │
│    Total API input across all calls: ~1,240,000 tokens   │
│    Cost: 1,240,000 × $3/1M = $3.72                      │
│                                                          │
│  Compression + Prompt Caching (Sacred Zone cached):      │
│    26K Sacred Zone tokens cached                         │
│    Cache reads cost 90% less than regular input          │
│    Estimated cost: ~$1.87 (saves ~50%)                  │
│                                                          │
│  Dual compression = functional guarantee (enables task)  │
│  Prompt caching = cost optimization (reduces expense)    │
└──────────────────────────────────────────────────────────┘

16.5.3 Cache Hit Rate Optimization

class CacheOptimizer:
    def optimize_for_caching(self, system_prompt: str, memory_injection: str) -> str:
        """
        Cache optimization principles:
        1. Place stable content at the front (system prompt, MEMORY.md)
        2. Place frequently changing content at the back (task-specific skills)
        3. Sacred Zone content never compressed → maintains cache validity
        """
        # Stable content → high cache hit rate
        stable_content = f"""
{system_prompt}

## User Persistent Memory (unchanged between sessions)
{memory_injection}
"""
        # Task-specific skill injections go after → lower cache hit rate, that's OK
        return stable_content
    
    def analyze_cache_effectiveness(self, api_responses: List[dict]) -> dict:
        total_input = sum(r.get('input_tokens', 0) for r in api_responses)
        total_cache_read = sum(r.get('cache_read_input_tokens', 0) for r in api_responses)
        hit_rate = total_cache_read / max(total_input, 1)
        return {
            "cache_hit_rate": f"{hit_rate:.1%}",
            "estimated_savings": f"{hit_rate * 90:.1f}%"  # reads cost 10% of normal
        }

16.6 Complete Dual Compression System Implementation

class DualCompressionSystem:
    """
    Hermes Dual Compression System
    
    Compression Layer 1: Sacred Zone Protection (structural compression)
      - Protects system prompt, first turn, most recent 20K tokens
      - Content-compresses tool outputs outside the Sacred Zone
    
    Compression Layer 2: Intelligent Content Summarization (semantic compression)
      - Uses type-specific compression strategies for different tool outputs
      - Retains key information, discards redundant detail
    """
    
    async def maybe_compress(self, session: Session) -> Optional[CompressionResult]:
        if not self.compression_controller.should_compress(session):
            return None
        
        logging.info(
            f"Compression triggered: {session.token_count} tokens "
            f"({session.token_count/self.compression_controller.max_context:.0%} usage)"
        )
        
        result = await self.compression_controller.compress_session(session)
        
        self.compression_log.append({
            "timestamp": datetime.now().isoformat(),
            "before_tokens": result.before_tokens,
            "after_tokens": result.after_tokens,
            "ratio": result.compression_ratio,
            "messages_compressed": result.messages_compressed
        })
        
        logging.info(
            f"Compression complete: {result.before_tokens} → {result.after_tokens} tokens "
            f"(ratio {result.compression_ratio:.1%}, compressed {result.messages_compressed} messages)"
        )
        
        return result
    
    def get_compression_stats(self) -> dict:
        if not self.compression_log:
            return {"total_compressions": 0}
        
        total_saved = sum(l["before_tokens"] - l["after_tokens"] for l in self.compression_log)
        avg_ratio = sum(l["ratio"] for l in self.compression_log) / len(self.compression_log)
        
        return {
            "total_compressions": len(self.compression_log),
            "total_tokens_saved": total_saved,
            "average_compression_ratio": f"{avg_ratio:.1%}",
        }

16.7 Compression Failure Fallback Strategy

class CompressionFallback:
    async def handle_compression_failure(self, session: Session, current_ratio: float) -> str:
        """
        Fallback priority:
        1. Aggressive compression: compress all non-Sacred Zone content to minimum
        2. Partial truncation: drop oldest non-Sacred Zone messages
        3. Session archival: archive current session, start new session with summary
        """
        if current_ratio > 0.90:
            return await self._aggressive_compress(session)
        elif current_ratio > 0.95:
            return self._truncate_oldest(session)
        else:
            summary = await self._create_session_summary(session)
            await self._archive_and_restart(session, summary)
            return f"[New session created] Original session archived. Summary: {summary[:200]}"
    
    async def _create_session_summary(self, session: Session) -> str:
        prompt = f"""Create a concise summary of this in-progress Agent session
to enable continuation in a new session:

Original task: {session.initial_task}
Completed steps: {session.completed_steps_summary}
Current state: {session.current_state}
Pending items: {session.pending_items}

Generate a summary (max 300 words) containing all critical information needed to continue:"""
        
        return await self.llm.generate(prompt, max_tokens=500)

Chapter Summary

Discussion Questions

  1. The Sacred Zone's "most recent 20K tokens" is based on engineering intuition. How would you design an adaptive Sacred Zone size that dynamically adjusts based on task complexity and tool call frequency?
  2. The log compression algorithm retains only ERROR and WARN-level logs. But if an Agent needs to analyze DEBUG logs for performance diagnosis, this strategy is problematic. How do you make compression strategies more "context-aware"?
  3. Session archival (new session + summary) is a last resort. Information loss in summaries is unavoidable. How do you quantify this information loss and minimize its impact in system design?
  4. Prompt Caching is only effective between requests with the same prefix. Does Hermes's Sacred Zone design intentionally optimize for Prompt Caching? If so, where is this reflected?
Rate this chapter
4.5  / 5  (22 ratings)

💬 Comments