Chapter 13

Hermes System Architecture Overview

Chapter 13: Hermes System Architecture Overview

A well-designed architecture ensures each module knows its boundaries and data flows are immediately clear. Hermes's system architecture exemplifies this — each layer has clear responsibilities, and every interface has an explicit contract.

13.1 Architecture Design Philosophy

13.1.1 Four Core Design Principles

Hermes's system architecture follows four core principles:

Layered Isolation: The core engine doesn't depend on specific tool implementations; tools don't depend on specific platforms
Unidirectional Data Flow: Requests flow from user to engine, responses flow from engine to user — no circular dependencies
Memory Persistence: Cross-session knowledge accumulation is the system's core value
Open Extension: Adding new tools and platforms doesn't require modifying the core engine

13.1.2 Overall System Architecture

┌──────────────────────────────────────────────────────────────┐
│                    Hermes Agent System                        │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │          Platform Adapter Layer                       │   │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │   │
│  │  │   CLI    │ │ REST API │ │ Web UI   │ │  SDK   │ │   │
│  │  └────┬─────┘ └────┬─────┘ └────┬─────┘ └───┬────┘ │   │
│  └───────┴─────────────┴─────────────┴───────────┴──────┘   │
│                        │ Unified message format               │
│                        ↓                                      │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              Core Engine Layer                        │   │
│  │  ┌──────────────┐ ┌──────────┐ ┌────────────────┐   │   │
│  │  │ Conversation │ │ Planner  │ │Context Compressor│  │   │
│  │  │   Manager    │ │          │ │                 │   │   │
│  │  └──────┬───────┘ └─────┬────┘ └────────┬────────┘  │   │
│  │         └───────────────┴───────────────┘            │   │
│  │                         │                            │   │
│  │                  ┌──────┴──────┐                     │   │
│  │                  │Model Interface│                    │   │
│  │                  └──────┬──────┘                     │   │
│  └─────────────────────────┼──────────────────────────  ┘   │
│                             │                                │
│          ┌──────────────────┼──────────────────┐            │
│          ↓                  ↓                  ↓             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  Tool Layer  │  │ Memory Layer │  │Model Backend │      │
│  │  40+ tools   │  │ 3-tier memory│  │ Hermes/GPT/  │      │
│  │  + plugins   │  │              │  │ Claude       │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │            Persistence Layer                          │   │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐           │   │
│  │  │ SQLite   │  │Filesystem│  │ VectorDB │           │   │
│  │  │(sessions)│  │(MEMORY.md)│  │(semantic)│           │   │
│  │  └──────────┘  └──────────┘  └──────────┘           │   │
│  └──────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────┘

13.2 Core Engine Layer Detail

13.2.1 Conversation Manager

The conversation manager is the system's "state machine," responsible for maintaining the entire session lifecycle:

class ConversationManager:
    async def process_message(self, session_id: str, user_message: str) -> str:
        session = self.get_or_create_session(session_id)
        
        # Inject persistent memory (MEMORY.md + Skill library)
        context = await self.memory_manager.inject_context(session)
        session.add_message(role="user", content=user_message)
        
        # Compression check
        if session.token_count > self.config.compression_threshold:
            await self.compressor.compress(session)
        
        # Run Agent loop
        response = await self.run_agent_loop(session, context)
        
        # Extract and save new skills
        await self.memory_manager.extract_skills(session, response)
        return response
    
    async def run_agent_loop(self, session: Session, context: str) -> str:
        for step in range(self.config.max_steps):
            response = await self.model_interface.generate(
                messages=session.get_context_window(),
                tools=self.tool_registry.get_schemas(),
                system=context
            )
            
            if response.type == "final_response":
                return response.content
            elif response.type == "tool_call":
                tool_result = await self.tool_executor.execute(response.tool_call)
                session.add_tool_result(response.tool_call, tool_result)
            elif response.type == "thinking":
                session.add_thought(response.content)
        
        return await self.model_interface.generate_summary(session)

13.2.2 Planner

class Planner:
    PLANNING_PROMPT = """
Before executing the task, create a clear plan:

Task: {task}

Output format:
## Task Analysis
[Understand the goal and constraints]

## Execution Plan
1. Step 1: [specific action]
2. Step 2: [specific action]
...

## Potential Risks
[Identify potential issues and alternatives]

## Success Criteria
[How to determine task completion]
"""
    
    def should_replan(self, execution_state: ExecutionState) -> bool:
        if execution_state.failure_rate > 0.3:
            return True
        if execution_state.unexpected_discovery:
            return True
        return False

13.2.3 Context Compressor (Interface)

class ContextCompressor:
    def __init__(self, config: HermesConfig):
        self.sacred_zone_tokens = config.sacred_zone_tokens  # default 20K
        self.target_ratio = config.compression_ratio         # target ~50%
    
    async def compress(self, session: Session) -> None:
        messages = session.messages
        sacred_start = self._identify_sacred_zone_start(messages)
        
        for i, msg in enumerate(messages):
            if i < sacred_start and msg.role == "tool":
                messages[i].content = self._compress_tool_output(msg.content)
        
        session.messages = messages
        session.update_token_count()

13.3 Tool Layer Architecture

13.3.1 Tool Registry

class ToolRegistry:
    def _load_builtin_tools(self):
        """Load 40+ built-in tools organized by category"""
        builtin_categories = {
            "code_execution":    [PythonExecTool(), ShellExecTool(), JavaScriptTool()],
            "file_operations":   [FileReadTool(), FileWriteTool(), PdfParserTool(), ExcelTool()],
            "web_search":        [WebSearchTool(), WebFetchTool(), ApiCallTool()],
            "data_processing":   [SqliteTool(), CsvAnalysisTool(), JsonTool()],
            "system_tools":      [ProcessManagerTool(), NetworkTool(), GitTool()],
            "ai_tools":          [ImageAnalysisTool(), TextEmbeddingTool(), SummarizerTool()],
        }
        for category_tools in builtin_categories.values():
            for tool in category_tools:
                self.register(tool)

13.3.2 Tool Base Class Design

class BaseTool:
    name: str
    description: str
    parameters_schema: dict
    timeout: int = 30
    
    @abstractmethod
    async def execute(self, **kwargs) -> ToolResult:
        pass
    
    def get_schema(self) -> dict:
        return {
            "type": "function",
            "function": {
                "name": self.name,
                "description": self.description,
                "parameters": self.parameters_schema
            }
        }
    
    async def safe_execute(self, **kwargs) -> ToolResult:
        try:
            return await asyncio.wait_for(self.execute(**kwargs), timeout=self.timeout)
        except asyncio.TimeoutError:
            return ToolResult(success=False, error=f"Tool {self.name} timed out ({self.timeout}s)")
        except Exception as e:
            return ToolResult(success=False, error=f"Tool {self.name} error: {str(e)}")


class PythonExecTool(BaseTool):
    name = "python_exec"
    description = "Execute Python code in a secure sandbox"
    parameters_schema = {
        "type": "object",
        "properties": {
            "code": {"type": "string", "description": "Python code to execute"},
            "timeout": {"type": "integer", "default": 30}
        },
        "required": ["code"]
    }
    
    async def execute(self, code: str, timeout: int = 30) -> ToolResult:
        result = await self.sandbox.run_python(code, timeout=timeout)
        return ToolResult(
            success=result.exit_code == 0,
            output=result.stdout,
            error=result.stderr if result.exit_code != 0 else None
        )

13.3.3 MCP Tool Integration

class MCPToolAdapter(BaseTool):
    """Adapts MCP protocol tools to the Hermes tool interface"""
    
    def __init__(self, mcp_server_url: str, tool_name: str):
        self.mcp_client = MCPClient(mcp_server_url)
        self.name = f"mcp_{tool_name}"
    
    async def initialize(self):
        """Fetch tool description from MCP server"""
        tools = await self.mcp_client.list_tools()
        tool = next(t for t in tools if t.name == self.tool_name)
        self._schema = tool.input_schema
        self.description = tool.description
    
    async def execute(self, **kwargs) -> ToolResult:
        response = await self.mcp_client.call_tool(
            name=self.tool_name, arguments=kwargs
        )
        return ToolResult(
            success=not response.is_error,
            output=response.content[0].text if response.content else "",
            error=str(response.content) if response.is_error else None
        )

13.4 Memory Layer Architecture

class MemoryManager:
    def __init__(self, config: HermesConfig):
        self.working_memory  = WorkingMemory(max_tokens=config.context_window_size)
        self.episodic_memory = EpisodicMemory(storage=SQLiteStorage(config.db_path))
        self.semantic_memory = SemanticMemory(
            storage=VectorDB(config.vector_db_path),
            embedding_model=config.embedding_model
        )
    
    async def inject_context(self, session: Session) -> str:
        relevant_skills   = await self.semantic_memory.search(
            query=session.current_task, top_k=5, threshold=0.75
        )
        relevant_episodes = await self.episodic_memory.search(
            query=session.current_task, top_k=3
        )
        
        context_parts = []
        if relevant_skills:
            context_parts.append(self._format_skills(relevant_skills))
        if relevant_episodes:
            context_parts.append(self._format_episodes(relevant_episodes))
        
        return "\n\n".join(context_parts)

13.5 Platform Adapter Layer

13.5.1 Multi-Platform Support Architecture

class PlatformAdapter:
    @abstractmethod
    async def receive_input(self) -> UserInput: pass
    
    @abstractmethod
    async def send_output(self, response: AgentResponse) -> None: pass


class CLIAdapter(PlatformAdapter):
    async def receive_input(self) -> UserInput:
        return UserInput(text=input("\nYou: ").strip())
    
    async def send_output(self, response: AgentResponse) -> None:
        print("\nHermes: ", end="", flush=True)
        async for token in response.token_stream:
            print(token, end="", flush=True)
        print()


class RestAPIAdapter(PlatformAdapter):
    def _register_routes(self):
        @self.app.post("/v1/chat/completions")
        async def chat_completions(request: ChatRequest):
            """OpenAI API compatible endpoint"""
            response = await self.process(request)
            return ChatResponse(
                id=response.id,
                object="chat.completion",
                choices=[{"message": {"role": "assistant", "content": response.content}}]
            )

13.6 Configuration File Structure

# hermes_config.yaml

model:
  backend: "local"  # local | openai | anthropic
  local:
    model_path: "./models/hermes-4-q4_k_m.gguf"
    n_gpu_layers: 48
    n_ctx: 32768
    temperature: 0.7

tools:
  builtin:
    enabled: true
    categories: ["code_execution", "file_operations", "web_search", "data_processing"]
  sandbox:
    type: "docker"
    image: "hermes-sandbox:latest"
    memory_limit: "1g"
    network: "restricted"
  mcp_servers:
    - name: "github"
      url: "mcp://localhost:3001"

memory:
  working:
    max_tokens: 32768
  episodic:
    storage: "sqlite"
    db_path: "./data/episodes.db"
    retention_days: 90
  semantic:
    storage: "chroma"
    db_path: "./data/skills"
    embedding_model: "nomic-embed-text"

compression:
  enabled: true
  threshold_tokens: 24000
  target_ratio: 0.5
  sacred_zone_tokens: 20000

learning:
  enabled: true
  atropos:
    enabled: false
    judge_model: "gpt-4o"
    training_interval: 100

adapters:
  cli:
    enabled: true
  api:
    enabled: true
    host: "0.0.0.0"
    port: 8080

13.7 Complete Request Data Flow

User input: "Analyze sales trends in data.csv"
    │
    ↓ [Platform Adapter Layer]
    CLI/API receives input
    Formats to standard UserInput
    │
    ↓ [Core Engine - Conversation Manager]
    Create/retrieve Session
    │
    ↓ [Memory Layer Injection]
    Retrieve relevant Skills (data analysis)
    Retrieve relevant history (CSV operation experience)
    Compose SystemPrompt
    │
    ↓ [Core Engine - Planner]
    Generate task plan:
    1. Read CSV file
    2. Explore data structure
    3. Calculate trend metrics
    4. Generate visualization
    5. Output report
    │
    ↓ [Model Interface Layer]
    Send request to model (with tool definitions)
    Model generates: <think>...</think> + tool call
    │
    ↓ [Tool Layer]
    Execute python_exec: pd.read_csv('data.csv')
    Returns: DataFrame info (50 rows × 5 columns)
    │
    ↓ [Loop: Model Generate → Tool Execute × N]
    python_exec → statistical analysis
    python_exec → plot (matplotlib)
    file_write → save chart
    │
    ↓ [Model Final Response]
    Generate natural language analysis report
    │
    ↓ [Memory Layer Learning]
    Extract new skill: "CSV sales trend analysis"
    Store in semantic memory
    │
    ↓ [Platform Adapter Layer]
    Format response
    Return to user

Chapter Summary

Hermes uses a five-layer architecture: Platform Adapter, Core Engine, Tool, Memory, and Persistence layers
Core engine contains three sub-components: Conversation Manager, Planner, and Context Compressor
Tool layer provides unified management of 40+ built-in tools, custom plugins, and MCP protocol tools
Memory layer implements three-tier memory (working/episodic/semantic) persisted via SQLite + vector database
Configuration uses YAML format with environment variable substitution and hot-reload support
Data flow is unidirectional: user → engine → tools → engine → memory → user

Discussion Questions

Hermes's architecture completely separates the tool layer from the memory layer. What benefits does this design bring? In what scenarios might tighter coupling be necessary?
The platform adapter layer isolates platform-specific differences through a unified interface. If adding a "Telegram Bot" platform, where would code need to be added?
Configuration supports hot-reload, but some config changes (like model path) require a restart. How would you distinguish "hot-update" from "cold-update" configuration items in system design?
After a tool execution timeout, how should the system gracefully recover? In an Agent loop, what recovery strategy should a tool timeout trigger?

Rate this chapter

4.7 / 5 (33 ratings)