Chapter 58

Token Budget Management and Toolset Optimization

Chapter 58: Token Budget Management and Toolset Optimization

Tokens are the Agent's fuel, and the budget is the range limit. An Agent without Token budget awareness is like driving without checking the fuel gauge—you only discover the context window is full when there's nowhere left to go.

58.1 The Token Budget Framework

Why Token Budget Management Matters

The Hermes Agent's context window is a finite resource. With Claude 3.5 Sonnet's 200K token context window as an example, this doesn't mean content can be filled arbitrarily:

Linear cost growth: Input token count directly drives API fees
Quadratic latency growth: Longer contexts mean slower prefill, degrading response speed
Attention dilution: In extremely long contexts, model attention to earlier information drops significantly
Tool call pollution: Historical tool results consume large volumes of tokens, diluting current task context

Layered Token Budget Architecture

Total Token Budget (example: 100,000 tokens)
├── System Prompt:          5,000 tokens  (5%)
├── Task Description:       2,000 tokens  (2%)
├── MEMORY.md content:      3,000 tokens  (3%)
├── Tool Definitions:       8,000 tokens  (8%)
├── Conversation History:  20,000 tokens (20%)
├── Tool Call Results:     30,000 tokens (30%)
├── Reasoning Buffer:      22,000 tokens (22%)
└── Safety Margin:         10,000 tokens (10%)

Recommended Budget Profiles by Task Type

Task Type	System Prompt	Tool Defs	History	Tool Results	Reasoning Buffer
Simple Q&A	2K	1K	5K	5K	10K
Code Generation	3K	3K	10K	20K	15K
Data Analysis	3K	5K	8K	40K	15K
Document Writing	4K	2K	15K	10K	20K
Multi-step Research	5K	8K	20K	30K	20K
System Operations	3K	6K	10K	35K	15K

58.2 Dynamic Tool Loading

The Case for Dynamic Loading

Hermes Agent supports dozens of tools, but each tool's schema definition averages 200–500 tokens. Loading all tools upfront:

50 tools × avg 300 tokens = 15,000 tokens
= 15% of a 100K budget

In practice, a single task typically uses only 3–5 tools. Dynamic loading saves 70–80% of tool-definition tokens.

Classification and On-Demand Loading Strategy

from typing import Dict, List, Optional, Set

TOOL_REGISTRY = {
    "web_search": {
        "category": "search",
        "token_cost": 280,
        "triggers": ["search", "find", "look up", "research"],
        "schema": {...}
    },
    "arxiv_search": {
        "category": "search",
        "token_cost": 320,
        "triggers": ["paper", "arxiv", "academic", "publication"],
        "schema": {...}
    },
    "code_executor": {
        "category": "code",
        "token_cost": 450,
        "triggers": ["run", "execute", "code", "script"],
        "schema": {...}
    },
    "file_reader": {
        "category": "file",
        "token_cost": 220,
        "triggers": ["read", "open file", "load file", "file content"],
        "schema": {...}
    },
    "sql_executor": {
        "category": "data",
        "token_cost": 380,
        "triggers": ["database", "SQL", "query", "table"],
        "schema": {...}
    },
}

class DynamicToolLoader:
    """
    Loads tools on demand based on task description,
    minimizing token consumption.
    """
    
    def __init__(self, token_budget: int = 8000):
        self.token_budget = token_budget
        self.loaded_tools: Dict[str, dict] = {}
        self.always_loaded: Set[str] = set()
    
    def set_always_loaded(self, tool_names: List[str]):
        self.always_loaded = set(tool_names)
    
    def infer_needed_tools(self, task_description: str) -> List[str]:
        task_lower = task_description.lower()
        needed = set(self.always_loaded)
        
        for tool_name, meta in TOOL_REGISTRY.items():
            for trigger in meta["triggers"]:
                if trigger.lower() in task_lower:
                    needed.add(tool_name)
                    # Also load tools in the same category
                    category = meta["category"]
                    for other_tool, other_meta in TOOL_REGISTRY.items():
                        if other_meta["category"] == category:
                            needed.add(other_tool)
                    break
        
        return list(needed)
    
    def load_tools_for_task(
        self,
        task_description: str,
        explicit_tools: Optional[List[str]] = None
    ) -> Dict[str, dict]:
        if explicit_tools:
            needed_tools = set(explicit_tools) | self.always_loaded
        else:
            needed_tools = set(self.infer_needed_tools(task_description))
        
        # Sort by token cost (cheapest first)
        sorted_tools = sorted(
            needed_tools,
            key=lambda t: TOOL_REGISTRY.get(t, {}).get("token_cost", 999)
        )
        
        loaded = {}
        total_tokens = 0
        
        for tool_name in sorted_tools:
            if tool_name not in TOOL_REGISTRY:
                continue
            cost = TOOL_REGISTRY[tool_name]["token_cost"]
            if total_tokens + cost <= self.token_budget:
                loaded[tool_name] = TOOL_REGISTRY[tool_name]["schema"]
                total_tokens += cost
        
        self.loaded_tools = loaded
        return loaded
    
    def get_token_usage(self) -> dict:
        total = sum(
            TOOL_REGISTRY[name]["token_cost"]
            for name in self.loaded_tools
            if name in TOOL_REGISTRY
        )
        return {
            "loaded_tools": list(self.loaded_tools.keys()),
            "tool_count": len(self.loaded_tools),
            "token_cost": total,
            "budget_used_pct": f"{total / self.token_budget * 100:.1f}%"
        }

# Usage example
loader = DynamicToolLoader(token_budget=8000)
loader.set_always_loaded(["file_reader"])

task = "Search for the latest AI papers and save results to a CSV file"
tools = loader.load_tools_for_task(task)
print("Loaded tools:", loader.get_token_usage())

58.3 MEMORY.md Slimming Strategies

The MEMORY.md Token Problem

Hermes Agent's MEMORY.md grows continuously with use, potentially reaching 5,000–10,000 tokens. Since it's injected in full every conversation, it creates significant waste.

Strategy 1: Tiered Storage

# MEMORY.md Tiered Structure

## Hot Memory — Read every time (≤500 tokens)
- User basic preferences
- Active projects (max 3)
- Recent key decisions

## Warm Memory — On-demand (≤2000 tokens)
- Project detailed background
- Tool usage history summary
- Common configuration parameters

## Cold Memory — Archived files (unlimited)
- Historical session summaries
- Completed project records
- Error analysis reports

Strategy 2: Automatic Summarization Compression

from typing import Tuple
import re

class MemoryOptimizer:
    def __init__(self, memory_path: str, token_limit: int = 3000):
        self.memory_path = memory_path
        self.token_limit = token_limit
    
    def estimate_tokens(self, text: str) -> int:
        return len(text) // 4  # ~4 chars per token
    
    def compress_old_entries(self, content: str, keep_days: int = 7) -> str:
        from datetime import datetime, timedelta
        cutoff_date = datetime.now() - timedelta(days=keep_days)
        date_pattern = re.compile(r'(\d{4}-\d{2}-\d{2})')
        lines = content.split('\n')
        compressed_lines = []
        skip_block = False
        archived_count = 0
        
        for line in lines:
            match = date_pattern.search(line)
            if match:
                try:
                    line_date = datetime.strptime(match.group(1), '%Y-%m-%d')
                    if line_date < cutoff_date:
                        skip_block = True
                        archived_count += 1
                        continue
                    else:
                        skip_block = False
                except ValueError:
                    pass
            if not skip_block:
                compressed_lines.append(line)
        
        if archived_count > 0:
            compressed_lines.append(f"\n<!-- {archived_count} old entries archived -->")
        return '\n'.join(compressed_lines)
    
    def auto_trim(self, llm_summarizer=None) -> Tuple[int, int]:
        with open(self.memory_path, 'r', encoding='utf-8') as f:
            content = f.read()
        before_tokens = self.estimate_tokens(content)
        
        if before_tokens <= self.token_limit:
            return before_tokens, before_tokens
        
        content = self.compress_old_entries(content)
        
        if self.estimate_tokens(content) > self.token_limit and llm_summarizer:
            content = llm_summarizer(content, max_tokens=self.token_limit)
        
        with open(self.memory_path, 'w', encoding='utf-8') as f:
            f.write(content)
        
        return before_tokens, self.estimate_tokens(content)

58.4 System Prompt Compression

Before vs. After Compression

Original (~800 tokens):

You are Hermes, an autonomous AI agent developed by NousResearch.
You are designed to help users accomplish complex tasks by breaking
them down into smaller steps and using various tools available to you.
You should always be helpful, accurate, and efficient...

Compressed (~200 tokens):

You: Hermes Agent (NousResearch). Mode: autonomous task execution.
Rules: ①Break tasks into steps ②Verify tool outputs ③Ask when unclear ④Log all actions
Tools: {TOOLS_PLACEHOLDER}
Format: JSON for structured data, markdown for reports.

Compression Principles

Principle	Example
Lists over paragraphs	`①②③` instead of "First... Second... Third..."
Remove filler phrases	Delete "You should always be..."
Use placeholders	`{TOOLS_PLACEHOLDER}` injected dynamically
Abbreviate	"ctx" for "context", "req" for "requirement"
Drop obvious rules	Don't state what LLMs already do by default

58.5 Token Budget Monitor Implementation

import time
import logging
from dataclasses import dataclass
from typing import Dict, List
from collections import deque

@dataclass
class TokenUsageRecord:
    timestamp: float
    session_id: str
    component: str
    tokens_used: int
    operation: str  # 'input' or 'output'

class TokenBudgetMonitor:
    def __init__(self, total_budget: int = 100_000):
        self.total_budget = total_budget
        self.current_usage: Dict[str, int] = {
            'system_prompt': 0, 'task_description': 0, 'memory': 0,
            'tools': 0, 'history': 0, 'tool_results': 0, 'output_buffer': 0,
        }
        self.history: deque = deque(maxlen=1000)
        self.alert_threshold = 0.85
        self.logger = logging.getLogger('agent.token_monitor')
    
    def record_usage(self, component: str, tokens: int,
                     session_id: str = "", operation: str = "input"):
        self.current_usage[component] = self.current_usage.get(component, 0) + tokens
        self.history.append(TokenUsageRecord(
            timestamp=time.time(), session_id=session_id,
            component=component, tokens_used=tokens, operation=operation
        ))
        self._check_alerts()
    
    def _check_alerts(self):
        total_used = sum(self.current_usage.values())
        ratio = total_used / self.total_budget
        if ratio >= 1.0:
            self.logger.error(f"TOKEN BUDGET EXCEEDED: {total_used}/{self.total_budget}")
        elif ratio >= self.alert_threshold:
            self.logger.warning(f"Token budget at {ratio*100:.1f}%")
    
    def get_budget_status(self) -> dict:
        total_used = sum(self.current_usage.values())
        return {
            "total_budget": self.total_budget,
            "total_used": total_used,
            "remaining": self.total_budget - total_used,
            "usage_percentage": f"{total_used/self.total_budget*100:.1f}%",
            "breakdown": {k: v for k, v in self.current_usage.items()},
            "alert": total_used > self.total_budget * self.alert_threshold
        }
    
    def suggest_optimizations(self) -> List[str]:
        suggestions = []
        total_used = sum(self.current_usage.values())
        if total_used == 0:
            return suggestions
        
        thresholds = {
            'tools': (0.15, "Enable dynamic tool loading"),
            'history': (0.25, "Enable rolling summary compression"),
            'tool_results': (0.35, "Retain only key tool output"),
            'memory': (0.05, "Run MEMORY.md slim-down optimization"),
        }
        for component, (threshold, suggestion) in thresholds.items():
            ratio = self.current_usage.get(component, 0) / total_used
            if ratio > threshold:
                suggestions.append(f"{component} at {ratio*100:.1f}%: {suggestion}")
        
        return suggestions

# Example usage
monitor = TokenBudgetMonitor(total_budget=100_000)
monitor.record_usage('system_prompt', 3200, 'sess_001')
monitor.record_usage('memory', 2800, 'sess_001')
monitor.record_usage('tools', 4500, 'sess_001')
monitor.record_usage('history', 18000, 'sess_001')
monitor.record_usage('tool_results', 32000, 'sess_001')

status = monitor.get_budget_status()
print(f"Usage: {status['usage_percentage']}, Remaining: {status['remaining']:,}")
for tip in monitor.suggest_optimizations():
    print(f"Optimization: {tip}")

58.6 Budget Profiles and Dynamic Adjustment

BUDGET_PROFILES = {
    "simple_qa":       {"total": 20_000,  "tools": 1_000,  "history": 5_000,  "tool_results": 5_000},
    "code_generation": {"total": 60_000,  "tools": 4_000,  "history": 10_000, "tool_results": 25_000},
    "data_analysis":   {"total": 80_000,  "tools": 5_000,  "history": 8_000,  "tool_results": 45_000},
    "long_research":   {"total": 150_000, "tools": 8_000,  "history": 30_000, "tool_results": 70_000},
}

When budget runs low during execution, apply cuts in this priority order:

Compress conversation history — Replace early turns with summaries (60–80% savings)
Unload non-core tools — Remove schemas for tools not needed in current step
Truncate tool results — Keep only the first N lines of each tool output
Load only Hot Memory — Temporarily skip Warm/Cold MEMORY.md sections
Reduce reasoning depth — Shorten Chain-of-Thought prompts

Chapter Summary

Token budget management is the core production capability for Hermes Agent:

Layered budget framework: Allocate total budget across components with a safety margin to prevent context overflow
Dynamic tool loading: Load tools on demand based on task description, saving 70–80% of tool-definition tokens
MEMORY.md slimming: Tiered storage plus automatic summarization keeps memory within 3K tokens
System prompt compression: Normalized language and placeholders reduce prompt tokens by 60–75%
Real-time monitoring: TokenBudgetMonitor tracks per-component consumption and triggers optimization before limits are hit

Review Questions

If a task mid-execution suddenly requires a new tool but the token budget is nearly exhausted, how would you design a "tool swap-in/swap-out" mechanism?
What content in MEMORY.md should never be compressed? How would you mark these "anchor memories"?
When analyzing a 100KB file within token budget constraints, how would you design a chunked processing strategy?
For tasks requiring extended execution (more than 10 tool call rounds), how would you design a rolling context window?

Rate this chapter

4.5 / 5 (3 ratings)