Chapter 8

NousResearch and the Hermes Model Family

Chapter 8: NousResearch and the Hermes Model Family

Chapter Overview

Understanding a tool requires understanding the team that created it. NousResearch is not an ordinary AI company — it is one of the most influential teams in the open-source AI community, known for its distinctive research culture and series of breakthrough models. This chapter examines NousResearch's team background and research philosophy, traces the complete technical evolution of the Hermes model from generation one through four, details the Atropos RL training methodology, and provides model performance data on key benchmarks. Understanding this context is what enables you to genuinely judge whether Hermes models are right for your application scenarios.


8.1 NousResearch: An Outlier in Open-Source AI

Team Background and Origins

NousResearch was founded in 2023 by researchers and engineers who grew out of the Reddit open-source AI community — particularly r/LocalLLaMA. This origin defines their DNA. They are not scholars from elite academic institutions, nor executives who left tech giants. They are genuine community members who emerged from open-source grassroots culture.

Core founding team members include:

This background explains several distinctive NousResearch characteristics:

1. Radically open culture

From day one, NousResearch has released all model weights completely openly — not "open source with usage restrictions," but genuine Apache 2.0 licensing with full commercial freedom.

2. Community-driven research agenda

Their research topics do not come from academic conference trends, but from the practical needs of the open-source community:

3. Fast-iteration engineering culture

NousResearch's development cycle is far faster than academic institutions: from identifying a problem to releasing a new version typically happens in weeks, not months.

NousResearch's Position in the AI Ecosystem

AI research institution spectrum (by openness and scale):

High openness
    │  NousResearch  ●  (small team, radically open)
    │  EleutherAI    ●  (non-profit, fully open)
    │
    │  Meta AI       ●  (large company, partially open)
    │  Google DeepMind ●  (large company, selectively open)
    │
    │  OpenAI        ●  (shifted from open to closed)
    │  Anthropic     ●  (primarily closed)
Low openness
    ─────────────────────────────────────── Scale
                Small                  Large

NousResearch occupies the "small team, radically open" quadrant, enabling it to:


8.2 Hermes Model Family: Complete Technical Evolution

Hermes 1: Proof of Concept (Early 2023)

Hermes 1 was NousResearch's first publicly released fine-tuned model, based on the original LLaMA 1.

Key characteristics:

Limitations:

Hermes 2: The Tool-Calling Breakthrough (Mid–Late 2023)

The Hermes 2 series was NousResearch's true breakout milestone, rapidly accumulating large download numbers on HuggingFace.

Hermes 2 Pro (most important version):

# Hermes 2 Pro tool-calling format (XML schema)
# The most standardized tool-call implementation in open-source models at the time

tools_prompt = """
You have access to the following tools:
<tools>
[
  {
    "name": "get_weather",
    "description": "Get weather for a specified city",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {"type": "string", "description": "City name"},
        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
      },
      "required": ["city"]
    }
  }
]
</tools>

If you decide to use a tool, respond with:
<tool_call>{"name": "tool_name", "arguments": {...}}</tool_call>
"""

Hermes 2 Series Versions:

Version Base Model Parameters Core Improvement
Hermes 2 Theta LLaMA 2 70B 70B Basic conversation improvements
Hermes 2 Pro Mistral 7B 7B Tool calling, function call format
Hermes 2 Yi Yi 34B 34B Long context (200K tokens)
Hermes 2 Solar Solar 10.7B 10.7B Balanced performance and efficiency

Hermes 3: Mature Agent Optimization (April 2024)

Hermes 3, based on Meta's Llama 3, was the first version systematically optimized for agent tasks.

Training data composition (publicly disclosed information):

Hermes 3 training data mixture:
────────────────────────────────────────────────
Tool-calling / function-calling conversations: ~30%
Multi-step reasoning data:                    ~25%
Role-play and instruction-following:          ~20%
Code generation and comprehension:            ~15%
General conversation and knowledge:           ~10%
────────────────────────────────────────────────
Total: ~1.2M conversations

Core improvements:

# Hermes 3 improvement: more structured agent reasoning format
"""
<|im_start|>system
You are Hermes, an intelligent assistant. When needed, use available tools.

Think step by step before taking actions.
<|im_end|>

<|im_start|>user
Analyze AAPL stock trends over the past month
<|im_end|>

<|im_start|>assistant
<think>
The user wants to analyze AAPL stock. I need to:
1. Retrieve price data for the past month
2. Calculate key technical indicators
3. Analyze the trend and provide a judgment
</think>
<tool_call>{"name": "stock_data", "arguments": {"symbol": "AAPL", "period": "1mo"}}</tool_call>
<|im_end|>
"""

Hermes 3 introduced the <think> tag mechanism, allowing the model to reason explicitly before acting. This is the concrete implementation of Chain-of-Thought in agent contexts.

Hermes 3 Series:

Version Base Model Parameters VRAM Required
Hermes 3 8B Llama 3 8B 8B 6GB+
Hermes 3 70B Llama 3 70B 70B 40GB+
Hermes 3 8B Instruct Llama 3 8B 8B 6GB+

Hermes 4: The Atropos RL Era (September 2024–Early 2025)

Hermes 4 is the most technically mature version to date, based on Meta Llama 3.1 405B and trained through the Atropos reinforcement learning framework purpose-built for agent tasks.


8.3 Atropos RL: A Training Methodology Built for Agents

What Is Atropos?

Atropos is a reinforcement learning training framework independently developed by NousResearch, named after one of the three Fates in Greek mythology (Atropos cut the thread of life — symbolizing the irreversibility of decisions).

Atropos's core innovation: defining a multi-dimensional reward function for agent tasks

Traditional RLHF vs. Atropos RL

Traditional RLHF (Reinforcement Learning from Human Feedback):

Pipeline:
Pre-trained model → SFT → Reward model training → PPO optimization
                              ↑
                    Human comparative scoring (which is better: A or B?)

Reward signal: Single dimension (human preference score)
Primary optimization target: Conversation fluency, helpfulness, harmlessness

Atropos RL (Agent-specialized):

Pipeline:
Pre-trained model → SFT → Multi-dimensional reward function → PPO + agent-specific optimization
                                    ↑
                    Multi-dimensional rewards based on agent task completion

Reward signal: Multi-dimensional
Primary optimization targets:
  ✓ Tool-call parameter accuracy (are correct parameters provided on each call?)
  ✓ Task decomposition quality (is sub-task breakdown reasonable and complete?)
  ✓ Error recovery capability (does it correctly adjust strategy after failure?)
  ✓ Resource utilization efficiency (minimum tool calls needed to complete task)
  ✓ Skill distillation quality (is the extracted Skill effective on new tasks?)
  ✓ Long-horizon consistency (is it still aligned with the original goal after 20 steps?)

Technical Details of the Atropos Reward Function

# Atropos RL reward function (conceptual implementation)
class AtroposRewardFunction:
    def compute_reward(self, trajectory: list[Step]) -> float:
        """
        trajectory: Complete sequence of steps for one task execution
        Each Step contains: thought, action, tool_call, observation
        """
        rewards = {
            "task_completion": self._task_completion_reward(trajectory),
            "tool_accuracy": self._tool_call_accuracy_reward(trajectory),
            "efficiency": self._efficiency_reward(trajectory),
            "error_recovery": self._error_recovery_reward(trajectory),
            "goal_alignment": self._long_horizon_alignment_reward(trajectory)
        }
        
        # Weighted combination
        weights = {
            "task_completion": 0.35,   # Most important: was the task completed?
            "tool_accuracy": 0.25,     # Tool-calling accuracy
            "efficiency": 0.15,        # Resource efficiency
            "error_recovery": 0.15,    # Error recovery
            "goal_alignment": 0.10     # Long-horizon consistency
        }
        
        total_reward = sum(
            rewards[k] * weights[k] 
            for k in rewards
        )
        return total_reward
    
    def _tool_call_accuracy_reward(self, trajectory):
        """Penalize tool-call parameter errors"""
        errors = sum(
            1 for step in trajectory 
            if step.tool_call and not step.tool_call.is_valid()
        )
        return max(0, 1 - (errors * 0.2))  # Each error deducts 20%
    
    def _error_recovery_reward(self, trajectory):
        """Reward successful recovery from errors"""
        recoveries = sum(
            1 for i, step in enumerate(trajectory[1:])
            if trajectory[i].is_error() and not step.is_error()
        )
        return min(1.0, recoveries * 0.3)  # Each recovery adds 30%, capped at 100%

Atropos Training Scale

Hermes 4's Atropos RL training is one of the largest known specialized agent RL training runs:


8.4 Hermes's Relationship with Llama/Mistral/Qwen

The Hermes model series are specialized fine-tunes on open-source base models, not models trained from scratch. Understanding this relationship is essential for setting accurate performance expectations.

Base Model vs. Fine-tuned Model Analogy

Analogy:
Base model ≈ A college graduate with complete education
             (broad knowledge and foundational capabilities)

Hermes fine-tuning ≈ Providing that graduate with 6 months of
                     specialized AI Agent professional training
                     (tool use, task planning, experience summarization)

Result:
  On agent tasks: Fine-tuned model significantly outperforms the original
  On general knowledge: Roughly maintains base model level
  On specialized domains (medical/legal etc.): Comparable to base model

How Each Base Model's Characteristics Affect Hermes

Base Model Version Advantages Hermes Inherits Impact
LLaMA 1 Hermes 1 Basic reasoning capability Weaker context length (4K)
Mistral 7B Hermes 2 Pro Efficient small model, strong instruction-following Ideal for local deployment
LLaMA 2 Hermes 2 Theta Meta's safety alignment Some over-refusal tendency
LLaMA 3 Hermes 3 Improved context (8K/128K), stronger reasoning Major performance leap in this round
LLaMA 3.1 Hermes 4 405B parameters, long context (128K) production-grade Top-tier agent capability
Mistral Hermes 2 Solar Sliding window attention, efficient long documents Optimized for document processing tasks

8.5 Benchmark Performance Data

Tool-Calling Benchmarks

These are the most directly relevant evaluations for agent capability:

Berkeley Function-Calling Leaderboard (BFCL) — 2024 Data:

Model Overall Simple Calls Complex Nested Parallel Calls
Hermes 4 405B 87.3% 93.1% 84.2% 79.8%
GPT-4 Turbo 83.8% 91.5% 79.3% 74.1%
Claude 3.5 Sonnet 85.1% 92.4% 81.8% 76.4%
Hermes 3 70B 78.4% 87.3% 73.1% 68.2%
GPT-3.5 Turbo 68.2% 79.4% 59.3% 51.1%
Llama 3.1 70B (base) 71.5% 82.1% 66.8% 58.3%

Note: Hermes 4 leads GPT-4 Turbo by approximately 3.5 percentage points on overall tool-calling score. This advantage primarily comes from complex nested and parallel call scenarios — precisely the areas that Atropos RL specifically optimized.

General Reasoning Benchmarks

Benchmark Hermes 4 405B GPT-4o Claude 3.5 Sonnet Llama 3.1 405B
MMLU 88.2% 87.5% 88.7% 87.3%
GSM8K (math reasoning) 92.1% 91.4% 92.8% 89.7%
HumanEval (code) 78.4% 80.1% 81.2% 72.3%
ARC-Challenge 87.6% 86.9% 88.1% 86.2%
HellaSwag 91.3% 90.8% 91.7% 90.5%

Note: On general reasoning capability, Hermes 4 is broadly comparable to GPT-4o and Claude 3.5 Sonnet. This demonstrates that agent-task-specific fine-tuning has not significantly degraded general capability.

Agent Task Benchmarks (AgentBench 2024)

AgentBench is a comprehensive benchmark specifically evaluating agent capability across 8 task categories:

Task Type Hermes 4 GPT-4 Turbo Claude 3.5 AutoGPT(GPT-4)
OS tasks 42.3% 38.7% 41.2% 29.4%
Database queries 56.8% 52.1% 54.3% 38.7%
Knowledge graph 49.2% 43.6% 47.8% 31.2%
Web shopping 31.4% 28.9% 30.7% 21.3%
Web browsing 28.6% 26.3% 27.9% 18.9%
Gaming tasks 47.1% 41.8% 45.3% 33.1%
Lateral thinking 35.7% 31.2% 34.1% 22.8%
Overall 41.6% 37.5% 40.2% 27.9%

These numbers illuminate an important reality: even the strongest agents fail to complete the majority of complex real-world tasks. Hermes 4's 41.6% means 58.4% of tasks cannot be completed — agent technology remains in its early stages.


8.6 Choosing the Right Hermes Version

Based on your hardware resources and use case, here is a selection guide:

Scenario 1: Limited resources (personal computer, no discrete GPU or <4GB VRAM)
  → Use cloud-hosted Hermes 4 API (NousResearch/OpenRouter)
  → Do not attempt to run 70B+ models locally

Scenario 2: Consumer GPU with 8–16GB VRAM (RTX 3080/4080, etc.)
  → Run Hermes 3 8B locally (optimal choice)
  → Or quantized Hermes 3 70B (requires Q4 quantization, slight accuracy loss)

Scenario 3: 40GB+ VRAM (professional GPU, e.g., A100)
  → Run Hermes 3 70B full precision locally
  → Or Hermes 4 405B quantized version

Scenario 4: Sufficient API budget, pursuing best agent performance
  → First choice: Hermes 4 405B (cloud-hosted)
  → Second choice: Claude 3.5 Sonnet (excellent tool-calling)
  → Third choice: GPT-4 Turbo

Scenario 5: Enterprise intranet deployment, data cannot leave the network
  → Hermes 3 70B local deployment (vLLM inference service)
  → Used in conjunction with Hermes Agent framework

Chapter Summary

The full panorama of NousResearch and the Hermes model family:

  1. Team DNA: Emerged from open-source community; community-driven determines technical direction; Apache 2.0 open for commercial use
  2. Four generations of evolution: Hermes 1 (proof of concept) → 2 (tool-calling breakthrough) → 3 (agent optimization maturity) → 4 (Atropos RL era)
  3. Atropos RL: Multi-dimensional agent reward function, optimizing for agent behavior rather than general conversation
  4. Fine-tuning relationship: Hermes is specialized fine-tuning on Llama/Mistral foundations, inheriting base model strengths while layering agent-specific capability
  5. Benchmark performance: Tool-calling leads GPT-4 Turbo; general capability matches Claude/GPT-4; but overall agent task completion rates reflect the field's early stage

Choosing Hermes models means choosing a technical path purpose-built for agent tasks, backed by an active open-source community.


Review Questions

  1. NousResearch grew from open-source community grassroots. What specific effects does this background have on their technical decisions? Compared to teams emerging from academic institutions or large tech companies, what are the advantages and disadvantages?

  2. Atropos RL's multi-dimensional reward function design is a major engineering decision. If you were designing this reward function, which dimensions would you add or remove? Which dimension do you think is most critical for improving agent capability?

  3. AgentBench data shows that even Hermes 4 achieves only a 41.6% completion rate on complex real-world tasks. What does this mean? Does this figure give you pause about deploying agents in production environments? How would you manage that 58.4% failure rate in a real product?

  4. Hermes's open strategy (fully open Apache 2.0) versus OpenAI's closed strategy will each produce different long-term competitive outcomes. Which strategy is more likely to be leading in 10 years?


Afterword: From Here Forward

Congratulations on completing all eight chapters of The Complete Guide to Hermes Agent. This book has aimed to do more than teach you how to use Hermes — it has sought to help you build deep understanding of the rapidly evolving AI agent field:

AI agent technology remains in its early stages — today's 41.6% completion rate will improve rapidly in the coming years. Understanding this field's history, current state, and design philosophy is the best way to maintain clear judgment amid rapid change.

Hermes Agent GitHub: https://github.com/nousresearch/hermes-agent
NousResearch HuggingFace: https://huggingface.co/NousResearch
Community Discord: https://discord.gg/nousresearch

Rate this chapter
4.8  / 5  (63 ratings)

💬 Comments