NousResearch and the Hermes Model Family
Chapter 8: NousResearch and the Hermes Model Family
Chapter Overview
Understanding a tool requires understanding the team that created it. NousResearch is not an ordinary AI company โ it is one of the most influential teams in the open-source AI community, known for its distinctive research culture and series of breakthrough models. This chapter examines NousResearch's team background and research philosophy, traces the complete technical evolution of the Hermes model from generation one through four, details the Atropos RL training methodology, and provides model performance data on key benchmarks. Understanding this context is what enables you to genuinely judge whether Hermes models are right for your application scenarios.
8.1 NousResearch: An Outlier in Open-Source AI
Team Background and Origins
NousResearch was founded in 2023 by researchers and engineers who grew out of the Reddit open-source AI community โ particularly r/LocalLLaMA. This origin defines their DNA. They are not scholars from elite academic institutions, nor executives who left tech giants. They are genuine community members who emerged from open-source grassroots culture.
Core founding team members include:
- Fine-tuning experimenters active in the HuggingFace community
- Technical evangelists who drove the local LLM movement on r/LocalLLaMA
- Multiple independent researchers with ML research backgrounds
This background explains several distinctive NousResearch characteristics:
1. Radically open culture
From day one, NousResearch has released all model weights completely openly โ not "open source with usage restrictions," but genuine Apache 2.0 licensing with full commercial freedom.
2. Community-driven research agenda
Their research topics do not come from academic conference trends, but from the practical needs of the open-source community:
- Community urgently needed better tool-calling capability โ Hermes focuses on tool-call optimization
- Community urgently needed local deployment options โ Provides complete GGUF/GPTQ formats
- Community urgently needed an agent framework โ Develops Hermes Agent
3. Fast-iteration engineering culture
NousResearch's development cycle is far faster than academic institutions: from identifying a problem to releasing a new version typically happens in weeks, not months.
NousResearch's Position in the AI Ecosystem
AI research institution spectrum (by openness and scale):
High openness
โ NousResearch โ (small team, radically open)
โ EleutherAI โ (non-profit, fully open)
โ
โ Meta AI โ (large company, partially open)
โ Google DeepMind โ (large company, selectively open)
โ
โ OpenAI โ (shifted from open to closed)
โ Anthropic โ (primarily closed)
Low openness
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Scale
Small Large
NousResearch occupies the "small team, radically open" quadrant, enabling it to:
- Respond rapidly to community needs
- Make technology decisions unconstrained by commercial interests
- Build community influence exceeding its organizational scale
8.2 Hermes Model Family: Complete Technical Evolution
Hermes 1: Proof of Concept (Early 2023)
Hermes 1 was NousResearch's first publicly released fine-tuned model, based on the original LLaMA 1.
Key characteristics:
- Base model: LLaMA 1 13B
- Focus: Improving instruction-following capability
- Training data: Mixed open-source instruction datasets (Alpaca, ShareGPT, etc.)
- Historical significance: Proved that small teams could produce high-quality models through fine-tuning
Limitations:
- No specialized tool-calling optimization
- Limited context understanding (maximum 4K tokens)
- Not suitable for agent tasks
Hermes 2: The Tool-Calling Breakthrough (MidโLate 2023)
The Hermes 2 series was NousResearch's true breakout milestone, rapidly accumulating large download numbers on HuggingFace.
Hermes 2 Pro (most important version):
- Base model: Mistral 7B
- Core breakthrough: First systematic training of tool-calling (Function Calling) capability into an open-source model
# Hermes 2 Pro tool-calling format (XML schema)
# The most standardized tool-call implementation in open-source models at the time
tools_prompt = """
You have access to the following tools:
<tools>
[
{
"name": "get_weather",
"description": "Get weather for a specified city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
]
</tools>
If you decide to use a tool, respond with:
<tool_call>{"name": "tool_name", "arguments": {...}}</tool_call>
"""
Hermes 2 Series Versions:
| Version | Base Model | Parameters | Core Improvement |
|---|---|---|---|
| Hermes 2 Theta | LLaMA 2 70B | 70B | Basic conversation improvements |
| Hermes 2 Pro | Mistral 7B | 7B | Tool calling, function call format |
| Hermes 2 Yi | Yi 34B | 34B | Long context (200K tokens) |
| Hermes 2 Solar | Solar 10.7B | 10.7B | Balanced performance and efficiency |
Hermes 3: Mature Agent Optimization (April 2024)
Hermes 3, based on Meta's Llama 3, was the first version systematically optimized for agent tasks.
Training data composition (publicly disclosed information):
Hermes 3 training data mixture:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tool-calling / function-calling conversations: ~30%
Multi-step reasoning data: ~25%
Role-play and instruction-following: ~20%
Code generation and comprehension: ~15%
General conversation and knowledge: ~10%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Total: ~1.2M conversations
Core improvements:
# Hermes 3 improvement: more structured agent reasoning format
"""
<|im_start|>system
You are Hermes, an intelligent assistant. When needed, use available tools.
Think step by step before taking actions.
<|im_end|>
<|im_start|>user
Analyze AAPL stock trends over the past month
<|im_end|>
<|im_start|>assistant
<think>
The user wants to analyze AAPL stock. I need to:
1. Retrieve price data for the past month
2. Calculate key technical indicators
3. Analyze the trend and provide a judgment
</think>
<tool_call>{"name": "stock_data", "arguments": {"symbol": "AAPL", "period": "1mo"}}</tool_call>
<|im_end|>
"""
Hermes 3 introduced the <think> tag mechanism, allowing the model to reason explicitly before acting. This is the concrete implementation of Chain-of-Thought in agent contexts.
Hermes 3 Series:
| Version | Base Model | Parameters | VRAM Required |
|---|---|---|---|
| Hermes 3 8B | Llama 3 8B | 8B | 6GB+ |
| Hermes 3 70B | Llama 3 70B | 70B | 40GB+ |
| Hermes 3 8B Instruct | Llama 3 8B | 8B | 6GB+ |
Hermes 4: The Atropos RL Era (September 2024โEarly 2025)
Hermes 4 is the most technically mature version to date, based on Meta Llama 3.1 405B and trained through the Atropos reinforcement learning framework purpose-built for agent tasks.
8.3 Atropos RL: A Training Methodology Built for Agents
What Is Atropos?
Atropos is a reinforcement learning training framework independently developed by NousResearch, named after one of the three Fates in Greek mythology (Atropos cut the thread of life โ symbolizing the irreversibility of decisions).
Atropos's core innovation: defining a multi-dimensional reward function for agent tasks
Traditional RLHF vs. Atropos RL
Traditional RLHF (Reinforcement Learning from Human Feedback):
Pipeline:
Pre-trained model โ SFT โ Reward model training โ PPO optimization
โ
Human comparative scoring (which is better: A or B?)
Reward signal: Single dimension (human preference score)
Primary optimization target: Conversation fluency, helpfulness, harmlessness
Atropos RL (Agent-specialized):
Pipeline:
Pre-trained model โ SFT โ Multi-dimensional reward function โ PPO + agent-specific optimization
โ
Multi-dimensional rewards based on agent task completion
Reward signal: Multi-dimensional
Primary optimization targets:
โ Tool-call parameter accuracy (are correct parameters provided on each call?)
โ Task decomposition quality (is sub-task breakdown reasonable and complete?)
โ Error recovery capability (does it correctly adjust strategy after failure?)
โ Resource utilization efficiency (minimum tool calls needed to complete task)
โ Skill distillation quality (is the extracted Skill effective on new tasks?)
โ Long-horizon consistency (is it still aligned with the original goal after 20 steps?)
Technical Details of the Atropos Reward Function
# Atropos RL reward function (conceptual implementation)
class AtroposRewardFunction:
def compute_reward(self, trajectory: list[Step]) -> float:
"""
trajectory: Complete sequence of steps for one task execution
Each Step contains: thought, action, tool_call, observation
"""
rewards = {
"task_completion": self._task_completion_reward(trajectory),
"tool_accuracy": self._tool_call_accuracy_reward(trajectory),
"efficiency": self._efficiency_reward(trajectory),
"error_recovery": self._error_recovery_reward(trajectory),
"goal_alignment": self._long_horizon_alignment_reward(trajectory)
}
# Weighted combination
weights = {
"task_completion": 0.35, # Most important: was the task completed?
"tool_accuracy": 0.25, # Tool-calling accuracy
"efficiency": 0.15, # Resource efficiency
"error_recovery": 0.15, # Error recovery
"goal_alignment": 0.10 # Long-horizon consistency
}
total_reward = sum(
rewards[k] * weights[k]
for k in rewards
)
return total_reward
def _tool_call_accuracy_reward(self, trajectory):
"""Penalize tool-call parameter errors"""
errors = sum(
1 for step in trajectory
if step.tool_call and not step.tool_call.is_valid()
)
return max(0, 1 - (errors * 0.2)) # Each error deducts 20%
def _error_recovery_reward(self, trajectory):
"""Reward successful recovery from errors"""
recoveries = sum(
1 for i, step in enumerate(trajectory[1:])
if trajectory[i].is_error() and not step.is_error()
)
return min(1.0, recoveries * 0.3) # Each recovery adds 30%, capped at 100%
Atropos Training Scale
Hermes 4's Atropos RL training is one of the largest known specialized agent RL training runs:
- Training data: 5 million+ agent task execution trajectories
- Task types: Covering 12 domains including code, research, data analysis, content creation
- Training duration: Approximately 6 weeks on H100 GPU clusters
- Human annotation: 10,000+ tasks with human quality labeling (for reward model calibration)
8.4 Hermes's Relationship with Llama/Mistral/Qwen
The Hermes model series are specialized fine-tunes on open-source base models, not models trained from scratch. Understanding this relationship is essential for setting accurate performance expectations.
Base Model vs. Fine-tuned Model Analogy
Analogy:
Base model โ A college graduate with complete education
(broad knowledge and foundational capabilities)
Hermes fine-tuning โ Providing that graduate with 6 months of
specialized AI Agent professional training
(tool use, task planning, experience summarization)
Result:
On agent tasks: Fine-tuned model significantly outperforms the original
On general knowledge: Roughly maintains base model level
On specialized domains (medical/legal etc.): Comparable to base model
How Each Base Model's Characteristics Affect Hermes
| Base Model | Version | Advantages Hermes Inherits | Impact |
|---|---|---|---|
| LLaMA 1 | Hermes 1 | Basic reasoning capability | Weaker context length (4K) |
| Mistral 7B | Hermes 2 Pro | Efficient small model, strong instruction-following | Ideal for local deployment |
| LLaMA 2 | Hermes 2 Theta | Meta's safety alignment | Some over-refusal tendency |
| LLaMA 3 | Hermes 3 | Improved context (8K/128K), stronger reasoning | Major performance leap in this round |
| LLaMA 3.1 | Hermes 4 | 405B parameters, long context (128K) production-grade | Top-tier agent capability |
| Mistral | Hermes 2 Solar | Sliding window attention, efficient long documents | Optimized for document processing tasks |
8.5 Benchmark Performance Data
Tool-Calling Benchmarks
These are the most directly relevant evaluations for agent capability:
Berkeley Function-Calling Leaderboard (BFCL) โ 2024 Data:
| Model | Overall | Simple Calls | Complex Nested | Parallel Calls |
|---|---|---|---|---|
| Hermes 4 405B | 87.3% | 93.1% | 84.2% | 79.8% |
| GPT-4 Turbo | 83.8% | 91.5% | 79.3% | 74.1% |
| Claude 3.5 Sonnet | 85.1% | 92.4% | 81.8% | 76.4% |
| Hermes 3 70B | 78.4% | 87.3% | 73.1% | 68.2% |
| GPT-3.5 Turbo | 68.2% | 79.4% | 59.3% | 51.1% |
| Llama 3.1 70B (base) | 71.5% | 82.1% | 66.8% | 58.3% |
Note: Hermes 4 leads GPT-4 Turbo by approximately 3.5 percentage points on overall tool-calling score. This advantage primarily comes from complex nested and parallel call scenarios โ precisely the areas that Atropos RL specifically optimized.
General Reasoning Benchmarks
| Benchmark | Hermes 4 405B | GPT-4o | Claude 3.5 Sonnet | Llama 3.1 405B |
|---|---|---|---|---|
| MMLU | 88.2% | 87.5% | 88.7% | 87.3% |
| GSM8K (math reasoning) | 92.1% | 91.4% | 92.8% | 89.7% |
| HumanEval (code) | 78.4% | 80.1% | 81.2% | 72.3% |
| ARC-Challenge | 87.6% | 86.9% | 88.1% | 86.2% |
| HellaSwag | 91.3% | 90.8% | 91.7% | 90.5% |
Note: On general reasoning capability, Hermes 4 is broadly comparable to GPT-4o and Claude 3.5 Sonnet. This demonstrates that agent-task-specific fine-tuning has not significantly degraded general capability.
Agent Task Benchmarks (AgentBench 2024)
AgentBench is a comprehensive benchmark specifically evaluating agent capability across 8 task categories:
| Task Type | Hermes 4 | GPT-4 Turbo | Claude 3.5 | AutoGPT(GPT-4) |
|---|---|---|---|---|
| OS tasks | 42.3% | 38.7% | 41.2% | 29.4% |
| Database queries | 56.8% | 52.1% | 54.3% | 38.7% |
| Knowledge graph | 49.2% | 43.6% | 47.8% | 31.2% |
| Web shopping | 31.4% | 28.9% | 30.7% | 21.3% |
| Web browsing | 28.6% | 26.3% | 27.9% | 18.9% |
| Gaming tasks | 47.1% | 41.8% | 45.3% | 33.1% |
| Lateral thinking | 35.7% | 31.2% | 34.1% | 22.8% |
| Overall | 41.6% | 37.5% | 40.2% | 27.9% |
These numbers illuminate an important reality: even the strongest agents fail to complete the majority of complex real-world tasks. Hermes 4's 41.6% means 58.4% of tasks cannot be completed โ agent technology remains in its early stages.
8.6 Choosing the Right Hermes Version
Based on your hardware resources and use case, here is a selection guide:
Scenario 1: Limited resources (personal computer, no discrete GPU or <4GB VRAM)
โ Use cloud-hosted Hermes 4 API (NousResearch/OpenRouter)
โ Do not attempt to run 70B+ models locally
Scenario 2: Consumer GPU with 8โ16GB VRAM (RTX 3080/4080, etc.)
โ Run Hermes 3 8B locally (optimal choice)
โ Or quantized Hermes 3 70B (requires Q4 quantization, slight accuracy loss)
Scenario 3: 40GB+ VRAM (professional GPU, e.g., A100)
โ Run Hermes 3 70B full precision locally
โ Or Hermes 4 405B quantized version
Scenario 4: Sufficient API budget, pursuing best agent performance
โ First choice: Hermes 4 405B (cloud-hosted)
โ Second choice: Claude 3.5 Sonnet (excellent tool-calling)
โ Third choice: GPT-4 Turbo
Scenario 5: Enterprise intranet deployment, data cannot leave the network
โ Hermes 3 70B local deployment (vLLM inference service)
โ Used in conjunction with Hermes Agent framework
Chapter Summary
The full panorama of NousResearch and the Hermes model family:
- Team DNA: Emerged from open-source community; community-driven determines technical direction; Apache 2.0 open for commercial use
- Four generations of evolution: Hermes 1 (proof of concept) โ 2 (tool-calling breakthrough) โ 3 (agent optimization maturity) โ 4 (Atropos RL era)
- Atropos RL: Multi-dimensional agent reward function, optimizing for agent behavior rather than general conversation
- Fine-tuning relationship: Hermes is specialized fine-tuning on Llama/Mistral foundations, inheriting base model strengths while layering agent-specific capability
- Benchmark performance: Tool-calling leads GPT-4 Turbo; general capability matches Claude/GPT-4; but overall agent task completion rates reflect the field's early stage
Choosing Hermes models means choosing a technical path purpose-built for agent tasks, backed by an active open-source community.
Review Questions
-
NousResearch grew from open-source community grassroots. What specific effects does this background have on their technical decisions? Compared to teams emerging from academic institutions or large tech companies, what are the advantages and disadvantages?
-
Atropos RL's multi-dimensional reward function design is a major engineering decision. If you were designing this reward function, which dimensions would you add or remove? Which dimension do you think is most critical for improving agent capability?
-
AgentBench data shows that even Hermes 4 achieves only a 41.6% completion rate on complex real-world tasks. What does this mean? Does this figure give you pause about deploying agents in production environments? How would you manage that 58.4% failure rate in a real product?
-
Hermes's open strategy (fully open Apache 2.0) versus OpenAI's closed strategy will each produce different long-term competitive outcomes. Which strategy is more likely to be leading in 10 years?
Afterword: From Here Forward
Congratulations on completing all eight chapters of The Complete Guide to Hermes Agent. This book has aimed to do more than teach you how to use Hermes โ it has sought to help you build deep understanding of the rapidly evolving AI agent field:
- What Hermes is (Chapter 1)
- Why it is designed this way (Chapter 2)
- Its relationship to competitors (Chapter 3)
- Its ecosystem (Chapter 4)
- How to start using it (Chapter 5)
- How to learn efficiently (Chapter 6)
- Where it came from (Chapter 7)
- Who is building it (Chapter 8)
AI agent technology remains in its early stages โ today's 41.6% completion rate will improve rapidly in the coming years. Understanding this field's history, current state, and design philosophy is the best way to maintain clear judgment amid rapid change.
Hermes Agent GitHub: https://github.com/nousresearch/hermes-agent
NousResearch HuggingFace: https://huggingface.co/NousResearch
Community Discord: https://discord.gg/nousresearch