Hermes as an MCP Server: Exposing Capabilities to Other Clients
Chapter 41: Hermes as an MCP Server โ Exposing Capabilities to External Clients
Introduction
Most practitioners deploy Hermes Agent as a consumer โ calling filesystem tools, querying databases, browsing the web. There is, however, a powerful inversion of this pattern: making Hermes itself a Model Context Protocol (MCP) Server, so that Claude Code, Cursor, or any MCP-compatible client can invoke Hermes's reasoning capabilities on demand. This "reverse MCP" architecture lets organizations centralize AI business logic in a private Hermes instance and expose it as a first-class service to every tool in their AI stack.
This chapter walks through the architecture design, a complete Python implementation, configuration examples for popular clients, and an honest assessment of when this pattern shines โ and when it falls short.
41.1 Why Wrap Hermes as an MCP Server?
The Limitation of the Standard Model
In the canonical Hermes deployment, Hermes is the MCP client. It connects to external MCP servers (filesystem, database, web-search) and calls their tools to complete tasks. This is clean and effective, but it creates silos:
- Claude Code cannot directly invoke Hermes's reasoning capabilities
- Private business logic embedded in Hermes workflows is inaccessible to other AI clients
- Multiple AI tools must independently re-implement the same domain knowledge
What the Reverse Architecture Unlocks
| Standard Architecture | Reverse MCP Architecture |
|---|---|
| Each client implements its own logic | Business logic centralized in Hermes |
| Claude Code uses only its native tools | Claude Code gains Hermes's 64K+ context reasoning |
| Hermes workflows are opaque | Hermes workflows exposed as callable tools |
| Data must leave the private network | Hermes stays on-premise; clients call in via MCP |
41.2 Architecture Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MCP Client Layer โ
โ โโโโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ Claude Codeโ โ Cursor โ โ Custom IDE Plugin โ โ
โ โโโโโโโฌโโโโโโโ โโโโโโฌโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โ
โโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโ
โ MCP Protocol (JSON-RPC 2.0) โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Hermes MCP Wrapper (this chapter) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ MCP Server (stdio / TCP) โ โ
โ โ tools/list tools/call resources/read โ โ
โ โโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Hermes Agent Core โ โ
โ โ โข Task planning โข Tool selection โ โ
โ โ โข Multi-step reasoning โข Memory โ โ
โ โโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Ollama / vLLM / llama.cpp backend โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ (optional downstream tool calls)
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ filesystem MCP โ database MCP โ web-search MCP โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Three Core Design Principles
1. Meaningful tool granularity
Resist the temptation to expose a single run_agent(prompt) tool. Break capabilities into semantically clear, domain-specific tools:
analyze_codeโ bug detection, security audits, performance reviewresearch_topicโ deep analysis with structured outputreview_documentโ professional document critiqueexecute_workflowโ run a named, parameterized Agent pipeline
2. Streaming support Hermes inference on a 70B model takes 5โ60 seconds. Implement SSE or MCP streaming to prevent client-side timeouts and provide progressive feedback.
3. Session isolation Each MCP client connection gets its own context ID. Never let context from one client bleed into another's session.
41.3 Complete Python Implementation
Installation
pip install mcp httpx asyncio pydantic python-dotenv
Project Layout
hermes-mcp-server/
โโโ server.py # MCP Server entry point
โโโ hermes_client.py # Hermes/Ollama communication
โโโ tools.py # Tool definitions (schema)
โโโ config.py # Configuration management
โโโ .env # Environment variables
config.py
import os
from dataclasses import dataclass
from dotenv import load_dotenv
load_dotenv()
@dataclass
class HermesConfig:
hermes_base_url: str = os.getenv("HERMES_BASE_URL", "http://localhost:11434")
hermes_model: str = os.getenv("HERMES_MODEL", "nous-hermes2:70b-q4_0")
mcp_host: str = os.getenv("MCP_HOST", "127.0.0.1")
mcp_port: int = int(os.getenv("MCP_PORT", "8765"))
max_tokens: int = int(os.getenv("MAX_TOKENS", "4096"))
temperature: float = float(os.getenv("TEMPERATURE", "0.1"))
context_window: int = int(os.getenv("CONTEXT_WINDOW", "65536"))
request_timeout: int = int(os.getenv("REQUEST_TIMEOUT", "120"))
stream_timeout: int = int(os.getenv("STREAM_TIMEOUT", "300"))
config = HermesConfig()
hermes_client.py
import httpx
import json
from typing import AsyncGenerator, Optional
from config import config
class HermesClient:
"""
Abstracts communication with the Hermes inference backend.
Automatically switches between Ollama and OpenAI-compatible APIs (vLLM).
"""
async def chat_completion(
self,
messages: list[dict],
stream: bool = False,
tools: Optional[list] = None
):
if "11434" in config.hermes_base_url or "ollama" in config.hermes_base_url.lower():
return await self._ollama_chat(messages, stream, tools)
return await self._openai_chat(messages, stream, tools)
async def _ollama_chat(self, messages, stream, tools):
payload = {
"model": config.hermes_model,
"messages": messages,
"stream": stream,
"options": {
"temperature": config.temperature,
"num_predict": config.max_tokens,
"num_ctx": config.context_window,
}
}
if tools:
payload["tools"] = tools
async with httpx.AsyncClient(
base_url=config.hermes_base_url,
timeout=httpx.Timeout(config.request_timeout)
) as client:
response = await client.post("/api/chat", json=payload)
response.raise_for_status()
return response.json()
async def _openai_chat(self, messages, stream, tools):
payload = {
"model": config.hermes_model,
"messages": messages,
"stream": stream,
"temperature": config.temperature,
"max_tokens": config.max_tokens,
}
if tools:
payload["tools"] = tools
payload["tool_choice"] = "auto"
async with httpx.AsyncClient(
base_url=config.hermes_base_url,
timeout=httpx.Timeout(config.request_timeout)
) as client:
response = await client.post("/v1/chat/completions", json=payload)
response.raise_for_status()
return response.json()
hermes_client = HermesClient()
server.py (MCP Server entry point)
import asyncio
import json
import logging
import mcp.server.stdio
from mcp.server import Server
from mcp.server.models import InitializationOptions
from mcp import types
from tools import HERMES_TOOLS
from hermes_client import hermes_client
from config import config
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("hermes-mcp")
app = Server("hermes-agent-mcp")
@app.list_tools()
async def handle_list_tools() -> list[types.Tool]:
return [
types.Tool(
name=t["name"],
description=t["description"],
inputSchema=t["inputSchema"]
)
for t in HERMES_TOOLS
]
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
logger.info(f"Tool called: {name}")
try:
if name == "analyze_code":
result = await _analyze_code(**arguments)
elif name == "research_topic":
result = await _research_topic(**arguments)
elif name == "review_document":
result = await _review_document(**arguments)
elif name == "execute_workflow":
result = await _execute_workflow(**arguments)
else:
result = f"Unknown tool: '{name}'"
return [types.TextContent(type="text", text=result)]
except Exception as e:
logger.error(f"Tool {name} error: {e}")
return [types.TextContent(type="text", text=f"Error: {str(e)}")]
async def _analyze_code(code, language="auto", focus="all", context=""):
focus_prompts = {
"bugs": "Focus on logic errors, boundary conditions, and runtime exceptions.",
"security": "Focus on injection attacks, insecure deserialization, and data exposure.",
"performance": "Focus on algorithmic complexity, database query patterns, and memory leaks.",
"quality": "Focus on readability, maintainability, and SOLID principles.",
"all": "Comprehensive review: bugs, security, performance, and code quality."
}
messages = [
{"role": "system", "content": "You are a senior software engineer. Provide structured code analysis with severity ratings (High/Medium/Low), specific line references, and concrete fix suggestions with code examples."},
{"role": "user", "content": f"Analyze this {language} code.\n\nFocus: {focus_prompts.get(focus)}\n{f'Context: {context}' if context else ''}\n\n```{language}\n{code}\n```"}
]
response = await hermes_client.chat_completion(messages)
return response["message"]["content"]
async def _research_topic(query, depth="standard", output_format="markdown"):
depth_config = {
"quick": "Brief, under 300 words, key points only.",
"standard": "Standard depth, ~1000 words, covering main aspects.",
"deep": "Deep research, 2000+ words, multi-angle analysis with examples."
}
messages = [
{"role": "system", "content": f"You are a professional researcher. Produce output in {output_format} format."},
{"role": "user", "content": f"Research: {query}\n\nDepth requirement: {depth_config[depth]}"}
]
response = await hermes_client.chat_completion(messages)
return response["message"]["content"]
async def _review_document(content, criteria=None, tone="balanced"):
if criteria is None:
criteria = ["accuracy", "clarity", "completeness", "structure"]
tone_desc = {
"strict": "Be rigorous and direct about all problems.",
"balanced": "Be objective โ acknowledge strengths and flag weaknesses.",
"encouraging": "Be supportive, highlight positives, suggest improvements gently."
}
messages = [
{"role": "system", "content": f"You are a professional document reviewer. Tone: {tone_desc[tone]}"},
{"role": "user", "content": f"Review criteria: {', '.join(criteria)}\n\nDocument:\n{content}"}
]
response = await hermes_client.chat_completion(messages)
return response["message"]["content"]
async def _execute_workflow(workflow_name, params):
workflows = {
"code_review_pipeline": "You are a code review pipeline expert. Execute a full review cycle.",
"data_analysis": "You are a data analyst. Perform comprehensive data analysis.",
"report_generation": "You are a professional report writer. Generate a structured report.",
}
if workflow_name not in workflows:
return f"Unknown workflow '{workflow_name}'. Available: {', '.join(workflows.keys())}"
messages = [
{"role": "system", "content": workflows[workflow_name]},
{"role": "user", "content": f"Execute workflow with parameters:\n{json.dumps(params, indent=2)}"}
]
response = await hermes_client.chat_completion(messages)
return response["message"]["content"]
async def main():
logger.info(f"Hermes MCP Server starting โ backend: {config.hermes_base_url}")
async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
await app.run(
read_stream,
write_stream,
InitializationOptions(
server_name="hermes-agent-mcp",
server_version="1.0.0",
capabilities=app.get_capabilities(
notification_options=None,
experimental_capabilities={}
)
)
)
if __name__ == "__main__":
asyncio.run(main())
41.4 Client Configuration Examples
Claude Code (.mcp.json)
{
"mcpServers": {
"hermes-agent": {
"command": "python",
"args": ["/path/to/hermes-mcp-server/server.py"],
"env": {
"HERMES_BASE_URL": "http://localhost:11434",
"HERMES_MODEL": "nous-hermes2:70b-q4_0",
"MAX_TOKENS": "4096",
"TEMPERATURE": "0.1",
"CONTEXT_WINDOW": "65536",
"REQUEST_TIMEOUT": "120"
}
}
}
}
Cursor (settings.json)
{
"mcp": {
"servers": {
"hermes": {
"command": "python",
"args": ["/path/to/hermes-mcp-server/server.py"],
"env": {
"HERMES_BASE_URL": "http://localhost:11434",
"HERMES_MODEL": "nous-hermes2:70b-q4_0"
}
}
}
}
}
vLLM Backend Switch
To use vLLM instead of Ollama, simply change:
HERMES_BASE_URL=http://localhost:8000 # vLLM default port
HERMES_MODEL=NousResearch/Hermes-4-70B # HuggingFace model ID
The client auto-detects the backend and uses the OpenAI-compatible API format.
41.5 Use Cases and Limitations
When to Use This Pattern
| Use Case | Why Reverse MCP Fits |
|---|---|
| Private codebase analysis | Hermes accesses internal code; results flow to Claude Code via MCP |
| Complex domain reasoning | Encapsulate proprietary business rules as services |
| Long document processing | Leverage Hermes's 64K+ context window for contracts, research papers |
| Standardized AI pipelines | Package multi-step workflows as single tool calls |
| Air-gapped / on-premise | All inference stays on-network; no data leaves |
Known Limitations
Latency: A Hermes 70B inference round-trip takes 5โ60 seconds โ unsuitable for sub-second interactive use.
Concurrency: A single GPU instance serializes requests. Under multi-client load, implement a request queue or scale horizontally (see Chapter 48).
Statelessness: MCP is inherently stateless. Multi-turn dialogue history must be passed explicitly in tool parameters or managed by the wrapper.
Not ideal for: real-time applications, high-throughput workloads (>50 QPS), or simple queries where a direct API call is faster.
Chapter Summary
This chapter delivered a complete, production-ready Hermes MCP Wrapper:
- A layered architecture: MCP Server โ Agent Core โ Inference Backend
- Auto-detection of Ollama vs. vLLM backends with zero code changes
- Four semantically distinct tools covering code analysis, research, document review, and workflow execution
- Ready-to-paste configuration for Claude Code and Cursor
The reverse MCP pattern's core value proposition is capability reuse: a single Hermes deployment becomes a shared AI service that any MCP-compatible client can call.
Review Questions
-
The current implementation collects the full Hermes response before returning it to the MCP client. How would you modify the server to support progressive streaming, and what MCP protocol capabilities would that require?
-
Design a session isolation scheme that allows multiple Claude Code instances to maintain separate conversation histories when calling the same Hermes MCP Server.
-
If Hermes itself needs to call a downstream MCP server (e.g., a database tool) while simultaneously acting as an MCP server, map the full request chain and identify where circular dependency risks could emerge.