Token Budget Management and Toolset Optimization
Chapter 58: Token Budget Management and Toolset Optimization
Tokens are the Agent's fuel, and the budget is the range limit. An Agent without Token budget awareness is like driving without checking the fuel gaugeโyou only discover the context window is full when there's nowhere left to go.
58.1 The Token Budget Framework
Why Token Budget Management Matters
The Hermes Agent's context window is a finite resource. With Claude 3.5 Sonnet's 200K token context window as an example, this doesn't mean content can be filled arbitrarily:
- Linear cost growth: Input token count directly drives API fees
- Quadratic latency growth: Longer contexts mean slower prefill, degrading response speed
- Attention dilution: In extremely long contexts, model attention to earlier information drops significantly
- Tool call pollution: Historical tool results consume large volumes of tokens, diluting current task context
Layered Token Budget Architecture
Total Token Budget (example: 100,000 tokens)
โโโ System Prompt: 5,000 tokens (5%)
โโโ Task Description: 2,000 tokens (2%)
โโโ MEMORY.md content: 3,000 tokens (3%)
โโโ Tool Definitions: 8,000 tokens (8%)
โโโ Conversation History: 20,000 tokens (20%)
โโโ Tool Call Results: 30,000 tokens (30%)
โโโ Reasoning Buffer: 22,000 tokens (22%)
โโโ Safety Margin: 10,000 tokens (10%)
Recommended Budget Profiles by Task Type
| Task Type | System Prompt | Tool Defs | History | Tool Results | Reasoning Buffer |
|---|---|---|---|---|---|
| Simple Q&A | 2K | 1K | 5K | 5K | 10K |
| Code Generation | 3K | 3K | 10K | 20K | 15K |
| Data Analysis | 3K | 5K | 8K | 40K | 15K |
| Document Writing | 4K | 2K | 15K | 10K | 20K |
| Multi-step Research | 5K | 8K | 20K | 30K | 20K |
| System Operations | 3K | 6K | 10K | 35K | 15K |
58.2 Dynamic Tool Loading
The Case for Dynamic Loading
Hermes Agent supports dozens of tools, but each tool's schema definition averages 200โ500 tokens. Loading all tools upfront:
50 tools ร avg 300 tokens = 15,000 tokens
= 15% of a 100K budget
In practice, a single task typically uses only 3โ5 tools. Dynamic loading saves 70โ80% of tool-definition tokens.
Classification and On-Demand Loading Strategy
from typing import Dict, List, Optional, Set
TOOL_REGISTRY = {
"web_search": {
"category": "search",
"token_cost": 280,
"triggers": ["search", "find", "look up", "research"],
"schema": {...}
},
"arxiv_search": {
"category": "search",
"token_cost": 320,
"triggers": ["paper", "arxiv", "academic", "publication"],
"schema": {...}
},
"code_executor": {
"category": "code",
"token_cost": 450,
"triggers": ["run", "execute", "code", "script"],
"schema": {...}
},
"file_reader": {
"category": "file",
"token_cost": 220,
"triggers": ["read", "open file", "load file", "file content"],
"schema": {...}
},
"sql_executor": {
"category": "data",
"token_cost": 380,
"triggers": ["database", "SQL", "query", "table"],
"schema": {...}
},
}
class DynamicToolLoader:
"""
Loads tools on demand based on task description,
minimizing token consumption.
"""
def __init__(self, token_budget: int = 8000):
self.token_budget = token_budget
self.loaded_tools: Dict[str, dict] = {}
self.always_loaded: Set[str] = set()
def set_always_loaded(self, tool_names: List[str]):
self.always_loaded = set(tool_names)
def infer_needed_tools(self, task_description: str) -> List[str]:
task_lower = task_description.lower()
needed = set(self.always_loaded)
for tool_name, meta in TOOL_REGISTRY.items():
for trigger in meta["triggers"]:
if trigger.lower() in task_lower:
needed.add(tool_name)
# Also load tools in the same category
category = meta["category"]
for other_tool, other_meta in TOOL_REGISTRY.items():
if other_meta["category"] == category:
needed.add(other_tool)
break
return list(needed)
def load_tools_for_task(
self,
task_description: str,
explicit_tools: Optional[List[str]] = None
) -> Dict[str, dict]:
if explicit_tools:
needed_tools = set(explicit_tools) | self.always_loaded
else:
needed_tools = set(self.infer_needed_tools(task_description))
# Sort by token cost (cheapest first)
sorted_tools = sorted(
needed_tools,
key=lambda t: TOOL_REGISTRY.get(t, {}).get("token_cost", 999)
)
loaded = {}
total_tokens = 0
for tool_name in sorted_tools:
if tool_name not in TOOL_REGISTRY:
continue
cost = TOOL_REGISTRY[tool_name]["token_cost"]
if total_tokens + cost <= self.token_budget:
loaded[tool_name] = TOOL_REGISTRY[tool_name]["schema"]
total_tokens += cost
self.loaded_tools = loaded
return loaded
def get_token_usage(self) -> dict:
total = sum(
TOOL_REGISTRY[name]["token_cost"]
for name in self.loaded_tools
if name in TOOL_REGISTRY
)
return {
"loaded_tools": list(self.loaded_tools.keys()),
"tool_count": len(self.loaded_tools),
"token_cost": total,
"budget_used_pct": f"{total / self.token_budget * 100:.1f}%"
}
# Usage example
loader = DynamicToolLoader(token_budget=8000)
loader.set_always_loaded(["file_reader"])
task = "Search for the latest AI papers and save results to a CSV file"
tools = loader.load_tools_for_task(task)
print("Loaded tools:", loader.get_token_usage())
58.3 MEMORY.md Slimming Strategies
The MEMORY.md Token Problem
Hermes Agent's MEMORY.md grows continuously with use, potentially reaching 5,000โ10,000 tokens. Since it's injected in full every conversation, it creates significant waste.
Strategy 1: Tiered Storage
# MEMORY.md Tiered Structure
## Hot Memory โ Read every time (โค500 tokens)
- User basic preferences
- Active projects (max 3)
- Recent key decisions
## Warm Memory โ On-demand (โค2000 tokens)
- Project detailed background
- Tool usage history summary
- Common configuration parameters
## Cold Memory โ Archived files (unlimited)
- Historical session summaries
- Completed project records
- Error analysis reports
Strategy 2: Automatic Summarization Compression
from typing import Tuple
import re
class MemoryOptimizer:
def __init__(self, memory_path: str, token_limit: int = 3000):
self.memory_path = memory_path
self.token_limit = token_limit
def estimate_tokens(self, text: str) -> int:
return len(text) // 4 # ~4 chars per token
def compress_old_entries(self, content: str, keep_days: int = 7) -> str:
from datetime import datetime, timedelta
cutoff_date = datetime.now() - timedelta(days=keep_days)
date_pattern = re.compile(r'(\d{4}-\d{2}-\d{2})')
lines = content.split('\n')
compressed_lines = []
skip_block = False
archived_count = 0
for line in lines:
match = date_pattern.search(line)
if match:
try:
line_date = datetime.strptime(match.group(1), '%Y-%m-%d')
if line_date < cutoff_date:
skip_block = True
archived_count += 1
continue
else:
skip_block = False
except ValueError:
pass
if not skip_block:
compressed_lines.append(line)
if archived_count > 0:
compressed_lines.append(f"\n<!-- {archived_count} old entries archived -->")
return '\n'.join(compressed_lines)
def auto_trim(self, llm_summarizer=None) -> Tuple[int, int]:
with open(self.memory_path, 'r', encoding='utf-8') as f:
content = f.read()
before_tokens = self.estimate_tokens(content)
if before_tokens <= self.token_limit:
return before_tokens, before_tokens
content = self.compress_old_entries(content)
if self.estimate_tokens(content) > self.token_limit and llm_summarizer:
content = llm_summarizer(content, max_tokens=self.token_limit)
with open(self.memory_path, 'w', encoding='utf-8') as f:
f.write(content)
return before_tokens, self.estimate_tokens(content)
58.4 System Prompt Compression
Before vs. After Compression
Original (~800 tokens):
You are Hermes, an autonomous AI agent developed by NousResearch.
You are designed to help users accomplish complex tasks by breaking
them down into smaller steps and using various tools available to you.
You should always be helpful, accurate, and efficient...
Compressed (~200 tokens):
You: Hermes Agent (NousResearch). Mode: autonomous task execution.
Rules: โ Break tasks into steps โกVerify tool outputs โขAsk when unclear โฃLog all actions
Tools: {TOOLS_PLACEHOLDER}
Format: JSON for structured data, markdown for reports.
Compression Principles
| Principle | Example |
|---|---|
| Lists over paragraphs | โ โกโข instead of "First... Second... Third..." |
| Remove filler phrases | Delete "You should always be..." |
| Use placeholders | {TOOLS_PLACEHOLDER} injected dynamically |
| Abbreviate | "ctx" for "context", "req" for "requirement" |
| Drop obvious rules | Don't state what LLMs already do by default |
58.5 Token Budget Monitor Implementation
import time
import logging
from dataclasses import dataclass
from typing import Dict, List
from collections import deque
@dataclass
class TokenUsageRecord:
timestamp: float
session_id: str
component: str
tokens_used: int
operation: str # 'input' or 'output'
class TokenBudgetMonitor:
def __init__(self, total_budget: int = 100_000):
self.total_budget = total_budget
self.current_usage: Dict[str, int] = {
'system_prompt': 0, 'task_description': 0, 'memory': 0,
'tools': 0, 'history': 0, 'tool_results': 0, 'output_buffer': 0,
}
self.history: deque = deque(maxlen=1000)
self.alert_threshold = 0.85
self.logger = logging.getLogger('agent.token_monitor')
def record_usage(self, component: str, tokens: int,
session_id: str = "", operation: str = "input"):
self.current_usage[component] = self.current_usage.get(component, 0) + tokens
self.history.append(TokenUsageRecord(
timestamp=time.time(), session_id=session_id,
component=component, tokens_used=tokens, operation=operation
))
self._check_alerts()
def _check_alerts(self):
total_used = sum(self.current_usage.values())
ratio = total_used / self.total_budget
if ratio >= 1.0:
self.logger.error(f"TOKEN BUDGET EXCEEDED: {total_used}/{self.total_budget}")
elif ratio >= self.alert_threshold:
self.logger.warning(f"Token budget at {ratio*100:.1f}%")
def get_budget_status(self) -> dict:
total_used = sum(self.current_usage.values())
return {
"total_budget": self.total_budget,
"total_used": total_used,
"remaining": self.total_budget - total_used,
"usage_percentage": f"{total_used/self.total_budget*100:.1f}%",
"breakdown": {k: v for k, v in self.current_usage.items()},
"alert": total_used > self.total_budget * self.alert_threshold
}
def suggest_optimizations(self) -> List[str]:
suggestions = []
total_used = sum(self.current_usage.values())
if total_used == 0:
return suggestions
thresholds = {
'tools': (0.15, "Enable dynamic tool loading"),
'history': (0.25, "Enable rolling summary compression"),
'tool_results': (0.35, "Retain only key tool output"),
'memory': (0.05, "Run MEMORY.md slim-down optimization"),
}
for component, (threshold, suggestion) in thresholds.items():
ratio = self.current_usage.get(component, 0) / total_used
if ratio > threshold:
suggestions.append(f"{component} at {ratio*100:.1f}%: {suggestion}")
return suggestions
# Example usage
monitor = TokenBudgetMonitor(total_budget=100_000)
monitor.record_usage('system_prompt', 3200, 'sess_001')
monitor.record_usage('memory', 2800, 'sess_001')
monitor.record_usage('tools', 4500, 'sess_001')
monitor.record_usage('history', 18000, 'sess_001')
monitor.record_usage('tool_results', 32000, 'sess_001')
status = monitor.get_budget_status()
print(f"Usage: {status['usage_percentage']}, Remaining: {status['remaining']:,}")
for tip in monitor.suggest_optimizations():
print(f"Optimization: {tip}")
58.6 Budget Profiles and Dynamic Adjustment
BUDGET_PROFILES = {
"simple_qa": {"total": 20_000, "tools": 1_000, "history": 5_000, "tool_results": 5_000},
"code_generation": {"total": 60_000, "tools": 4_000, "history": 10_000, "tool_results": 25_000},
"data_analysis": {"total": 80_000, "tools": 5_000, "history": 8_000, "tool_results": 45_000},
"long_research": {"total": 150_000, "tools": 8_000, "history": 30_000, "tool_results": 70_000},
}
When budget runs low during execution, apply cuts in this priority order:
- Compress conversation history โ Replace early turns with summaries (60โ80% savings)
- Unload non-core tools โ Remove schemas for tools not needed in current step
- Truncate tool results โ Keep only the first N lines of each tool output
- Load only Hot Memory โ Temporarily skip Warm/Cold MEMORY.md sections
- Reduce reasoning depth โ Shorten Chain-of-Thought prompts
Chapter Summary
Token budget management is the core production capability for Hermes Agent:
- Layered budget framework: Allocate total budget across components with a safety margin to prevent context overflow
- Dynamic tool loading: Load tools on demand based on task description, saving 70โ80% of tool-definition tokens
- MEMORY.md slimming: Tiered storage plus automatic summarization keeps memory within 3K tokens
- System prompt compression: Normalized language and placeholders reduce prompt tokens by 60โ75%
- Real-time monitoring: TokenBudgetMonitor tracks per-component consumption and triggers optimization before limits are hit
Review Questions
- If a task mid-execution suddenly requires a new tool but the token budget is nearly exhausted, how would you design a "tool swap-in/swap-out" mechanism?
- What content in MEMORY.md should never be compressed? How would you mark these "anchor memories"?
- When analyzing a 100KB file within token budget constraints, how would you design a chunked processing strategy?
- For tasks requiring extended execution (more than 10 tool call rounds), how would you design a rolling context window?