Chapter 41

Hermes as an MCP Server: Exposing Capabilities to Other Clients

Chapter 41: Hermes as an MCP Server — Exposing Capabilities to External Clients

Introduction

Most practitioners deploy Hermes Agent as a consumer — calling filesystem tools, querying databases, browsing the web. There is, however, a powerful inversion of this pattern: making Hermes itself a Model Context Protocol (MCP) Server, so that Claude Code, Cursor, or any MCP-compatible client can invoke Hermes's reasoning capabilities on demand. This "reverse MCP" architecture lets organizations centralize AI business logic in a private Hermes instance and expose it as a first-class service to every tool in their AI stack.

This chapter walks through the architecture design, a complete Python implementation, configuration examples for popular clients, and an honest assessment of when this pattern shines — and when it falls short.


41.1 Why Wrap Hermes as an MCP Server?

The Limitation of the Standard Model

In the canonical Hermes deployment, Hermes is the MCP client. It connects to external MCP servers (filesystem, database, web-search) and calls their tools to complete tasks. This is clean and effective, but it creates silos:

What the Reverse Architecture Unlocks

Standard Architecture Reverse MCP Architecture
Each client implements its own logic Business logic centralized in Hermes
Claude Code uses only its native tools Claude Code gains Hermes's 64K+ context reasoning
Hermes workflows are opaque Hermes workflows exposed as callable tools
Data must leave the private network Hermes stays on-premise; clients call in via MCP

41.2 Architecture Design

┌─────────────────────────────────────────────────────┐
│                  MCP Client Layer                    │
│  ┌────────────┐  ┌──────────┐  ┌──────────────────┐ │
│  │ Claude Code│  │  Cursor  │  │ Custom IDE Plugin │ │
│  └─────┬──────┘  └────┬─────┘  └────────┬─────────┘ │
└────────┼──────────────┼─────────────────┼───────────┘
         │  MCP Protocol (JSON-RPC 2.0)   │
         ▼              ▼                 ▼
┌──────────────────────────────────────────────────────┐
│            Hermes MCP Wrapper  (this chapter)        │
│  ┌────────────────────────────────────────────────┐  │
│  │         MCP Server  (stdio / TCP)              │  │
│  │  tools/list   tools/call   resources/read      │  │
│  └────────────────────┬───────────────────────────┘  │
│                       │                              │
│  ┌────────────────────▼───────────────────────────┐  │
│  │         Hermes Agent Core                      │  │
│  │  • Task planning  • Tool selection             │  │
│  │  • Multi-step reasoning  • Memory              │  │
│  └────────────────────┬───────────────────────────┘  │
│                       │                              │
│  ┌────────────────────▼───────────────────────────┐  │
│  │    Ollama / vLLM / llama.cpp backend           │  │
│  └────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────┘
         │ (optional downstream tool calls)
         ▼
┌────────────────────────────────────────────────────┐
│  filesystem MCP │ database MCP │ web-search MCP    │
└────────────────────────────────────────────────────┘

Three Core Design Principles

1. Meaningful tool granularity Resist the temptation to expose a single run_agent(prompt) tool. Break capabilities into semantically clear, domain-specific tools:

2. Streaming support Hermes inference on a 70B model takes 5–60 seconds. Implement SSE or MCP streaming to prevent client-side timeouts and provide progressive feedback.

3. Session isolation Each MCP client connection gets its own context ID. Never let context from one client bleed into another's session.


41.3 Complete Python Implementation

Installation

pip install mcp httpx asyncio pydantic python-dotenv

Project Layout

hermes-mcp-server/
├── server.py          # MCP Server entry point
├── hermes_client.py   # Hermes/Ollama communication
├── tools.py           # Tool definitions (schema)
├── config.py          # Configuration management
└── .env               # Environment variables

config.py

import os
from dataclasses import dataclass
from dotenv import load_dotenv

load_dotenv()

@dataclass
class HermesConfig:
    hermes_base_url: str = os.getenv("HERMES_BASE_URL", "http://localhost:11434")
    hermes_model: str = os.getenv("HERMES_MODEL", "nous-hermes2:70b-q4_0")
    mcp_host: str = os.getenv("MCP_HOST", "127.0.0.1")
    mcp_port: int = int(os.getenv("MCP_PORT", "8765"))
    max_tokens: int = int(os.getenv("MAX_TOKENS", "4096"))
    temperature: float = float(os.getenv("TEMPERATURE", "0.1"))
    context_window: int = int(os.getenv("CONTEXT_WINDOW", "65536"))
    request_timeout: int = int(os.getenv("REQUEST_TIMEOUT", "120"))
    stream_timeout: int = int(os.getenv("STREAM_TIMEOUT", "300"))

config = HermesConfig()

hermes_client.py

import httpx
import json
from typing import AsyncGenerator, Optional
from config import config

class HermesClient:
    """
    Abstracts communication with the Hermes inference backend.
    Automatically switches between Ollama and OpenAI-compatible APIs (vLLM).
    """
    
    async def chat_completion(
        self,
        messages: list[dict],
        stream: bool = False,
        tools: Optional[list] = None
    ):
        if "11434" in config.hermes_base_url or "ollama" in config.hermes_base_url.lower():
            return await self._ollama_chat(messages, stream, tools)
        return await self._openai_chat(messages, stream, tools)

    async def _ollama_chat(self, messages, stream, tools):
        payload = {
            "model": config.hermes_model,
            "messages": messages,
            "stream": stream,
            "options": {
                "temperature": config.temperature,
                "num_predict": config.max_tokens,
                "num_ctx": config.context_window,
            }
        }
        if tools:
            payload["tools"] = tools

        async with httpx.AsyncClient(
            base_url=config.hermes_base_url,
            timeout=httpx.Timeout(config.request_timeout)
        ) as client:
            response = await client.post("/api/chat", json=payload)
            response.raise_for_status()
            return response.json()

    async def _openai_chat(self, messages, stream, tools):
        payload = {
            "model": config.hermes_model,
            "messages": messages,
            "stream": stream,
            "temperature": config.temperature,
            "max_tokens": config.max_tokens,
        }
        if tools:
            payload["tools"] = tools
            payload["tool_choice"] = "auto"

        async with httpx.AsyncClient(
            base_url=config.hermes_base_url,
            timeout=httpx.Timeout(config.request_timeout)
        ) as client:
            response = await client.post("/v1/chat/completions", json=payload)
            response.raise_for_status()
            return response.json()

hermes_client = HermesClient()

server.py (MCP Server entry point)

import asyncio
import json
import logging
import mcp.server.stdio
from mcp.server import Server
from mcp.server.models import InitializationOptions
from mcp import types
from tools import HERMES_TOOLS
from hermes_client import hermes_client
from config import config

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("hermes-mcp")

app = Server("hermes-agent-mcp")

@app.list_tools()
async def handle_list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name=t["name"],
            description=t["description"],
            inputSchema=t["inputSchema"]
        )
        for t in HERMES_TOOLS
    ]

@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
    logger.info(f"Tool called: {name}")
    try:
        if name == "analyze_code":
            result = await _analyze_code(**arguments)
        elif name == "research_topic":
            result = await _research_topic(**arguments)
        elif name == "review_document":
            result = await _review_document(**arguments)
        elif name == "execute_workflow":
            result = await _execute_workflow(**arguments)
        else:
            result = f"Unknown tool: '{name}'"
        return [types.TextContent(type="text", text=result)]
    except Exception as e:
        logger.error(f"Tool {name} error: {e}")
        return [types.TextContent(type="text", text=f"Error: {str(e)}")]

async def _analyze_code(code, language="auto", focus="all", context=""):
    focus_prompts = {
        "bugs": "Focus on logic errors, boundary conditions, and runtime exceptions.",
        "security": "Focus on injection attacks, insecure deserialization, and data exposure.",
        "performance": "Focus on algorithmic complexity, database query patterns, and memory leaks.",
        "quality": "Focus on readability, maintainability, and SOLID principles.",
        "all": "Comprehensive review: bugs, security, performance, and code quality."
    }
    messages = [
        {"role": "system", "content": "You are a senior software engineer. Provide structured code analysis with severity ratings (High/Medium/Low), specific line references, and concrete fix suggestions with code examples."},
        {"role": "user", "content": f"Analyze this {language} code.\n\nFocus: {focus_prompts.get(focus)}\n{f'Context: {context}' if context else ''}\n\n```{language}\n{code}\n```"}
    ]
    response = await hermes_client.chat_completion(messages)
    return response["message"]["content"]

async def _research_topic(query, depth="standard", output_format="markdown"):
    depth_config = {
        "quick": "Brief, under 300 words, key points only.",
        "standard": "Standard depth, ~1000 words, covering main aspects.",
        "deep": "Deep research, 2000+ words, multi-angle analysis with examples."
    }
    messages = [
        {"role": "system", "content": f"You are a professional researcher. Produce output in {output_format} format."},
        {"role": "user", "content": f"Research: {query}\n\nDepth requirement: {depth_config[depth]}"}
    ]
    response = await hermes_client.chat_completion(messages)
    return response["message"]["content"]

async def _review_document(content, criteria=None, tone="balanced"):
    if criteria is None:
        criteria = ["accuracy", "clarity", "completeness", "structure"]
    tone_desc = {
        "strict": "Be rigorous and direct about all problems.",
        "balanced": "Be objective — acknowledge strengths and flag weaknesses.",
        "encouraging": "Be supportive, highlight positives, suggest improvements gently."
    }
    messages = [
        {"role": "system", "content": f"You are a professional document reviewer. Tone: {tone_desc[tone]}"},
        {"role": "user", "content": f"Review criteria: {', '.join(criteria)}\n\nDocument:\n{content}"}
    ]
    response = await hermes_client.chat_completion(messages)
    return response["message"]["content"]

async def _execute_workflow(workflow_name, params):
    workflows = {
        "code_review_pipeline": "You are a code review pipeline expert. Execute a full review cycle.",
        "data_analysis": "You are a data analyst. Perform comprehensive data analysis.",
        "report_generation": "You are a professional report writer. Generate a structured report.",
    }
    if workflow_name not in workflows:
        return f"Unknown workflow '{workflow_name}'. Available: {', '.join(workflows.keys())}"
    messages = [
        {"role": "system", "content": workflows[workflow_name]},
        {"role": "user", "content": f"Execute workflow with parameters:\n{json.dumps(params, indent=2)}"}
    ]
    response = await hermes_client.chat_completion(messages)
    return response["message"]["content"]

async def main():
    logger.info(f"Hermes MCP Server starting — backend: {config.hermes_base_url}")
    async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            InitializationOptions(
                server_name="hermes-agent-mcp",
                server_version="1.0.0",
                capabilities=app.get_capabilities(
                    notification_options=None,
                    experimental_capabilities={}
                )
            )
        )

if __name__ == "__main__":
    asyncio.run(main())

41.4 Client Configuration Examples

Claude Code (.mcp.json)

{
  "mcpServers": {
    "hermes-agent": {
      "command": "python",
      "args": ["/path/to/hermes-mcp-server/server.py"],
      "env": {
        "HERMES_BASE_URL": "http://localhost:11434",
        "HERMES_MODEL": "nous-hermes2:70b-q4_0",
        "MAX_TOKENS": "4096",
        "TEMPERATURE": "0.1",
        "CONTEXT_WINDOW": "65536",
        "REQUEST_TIMEOUT": "120"
      }
    }
  }
}

Cursor (settings.json)

{
  "mcp": {
    "servers": {
      "hermes": {
        "command": "python",
        "args": ["/path/to/hermes-mcp-server/server.py"],
        "env": {
          "HERMES_BASE_URL": "http://localhost:11434",
          "HERMES_MODEL": "nous-hermes2:70b-q4_0"
        }
      }
    }
  }
}

vLLM Backend Switch

To use vLLM instead of Ollama, simply change:

HERMES_BASE_URL=http://localhost:8000   # vLLM default port
HERMES_MODEL=NousResearch/Hermes-4-70B  # HuggingFace model ID

The client auto-detects the backend and uses the OpenAI-compatible API format.


41.5 Use Cases and Limitations

When to Use This Pattern

Use Case Why Reverse MCP Fits
Private codebase analysis Hermes accesses internal code; results flow to Claude Code via MCP
Complex domain reasoning Encapsulate proprietary business rules as services
Long document processing Leverage Hermes's 64K+ context window for contracts, research papers
Standardized AI pipelines Package multi-step workflows as single tool calls
Air-gapped / on-premise All inference stays on-network; no data leaves

Known Limitations

Latency: A Hermes 70B inference round-trip takes 5–60 seconds — unsuitable for sub-second interactive use.

Concurrency: A single GPU instance serializes requests. Under multi-client load, implement a request queue or scale horizontally (see Chapter 48).

Statelessness: MCP is inherently stateless. Multi-turn dialogue history must be passed explicitly in tool parameters or managed by the wrapper.

Not ideal for: real-time applications, high-throughput workloads (>50 QPS), or simple queries where a direct API call is faster.


Chapter Summary

This chapter delivered a complete, production-ready Hermes MCP Wrapper:

The reverse MCP pattern's core value proposition is capability reuse: a single Hermes deployment becomes a shared AI service that any MCP-compatible client can call.

Review Questions

  1. The current implementation collects the full Hermes response before returning it to the MCP client. How would you modify the server to support progressive streaming, and what MCP protocol capabilities would that require?

  2. Design a session isolation scheme that allows multiple Claude Code instances to maintain separate conversation histories when calling the same Hermes MCP Server.

  3. If Hermes itself needs to call a downstream MCP server (e.g., a database tool) while simultaneously acting as an MCP server, map the full request chain and identify where circular dependency risks could emerge.

Rate this chapter
4.9  / 5  (3 ratings)

💬 Comments