Chapter 9

Hermes 1 → 2 → 3 → 4 Version Milestones

Chapter 9: Hermes Version Milestones — The Technical Evolution from 1 to 4

Each generation of Hermes represents not merely an expansion of parameter scale, but a fundamental shift in training philosophy. Understanding this evolutionary trajectory is the prerequisite for mastering Hermes 4's capability boundaries.


9.1 Version Evolution Overview

Since the Hermes series launched in 2023, NousResearch has released four major versions, each representing a qualitative leap in training data composition, alignment strategy, and target tasks. The table below summarizes key dimensions across versions:

Dimension Hermes 1 Hermes 2 Hermes 3 Hermes 4
Release 2023 Q2 2023 Q4 2024 Q2 2025 Q1
Base Model LLaMA-13B Mistral-7B / LLaMA-2-70B LLaMA-3-8B/70B LLaMA-3.1-405B
Core Positioning General instruction following Enhanced reasoning & roleplay Multimodal + function calling Native Agent trajectory training
Training Data Scale ~300K dialogues ~1M dialogues ~3M dialogues + synthetic ~10M Agent trajectories
Primary Alignment SFT SFT + RLHF SFT + DPO SFT + Atropos RL
Tool Calling None Limited Standard FC Native 40+ tools
AgentBench Score N/A 18.2 34.7 61.3

9.2 Hermes 1: Laying the Groundwork for Instruction Following (2023 Q2)

9.2.1 Origins

In spring 2023, ChatGPT's explosive growth made the open-source community acutely aware of the strategic importance of "instruction following." While LLaMA's base weights achieved high pretraining quality, they lacked natural conversational ability. Hermes 1's mission was to give LLaMA-13B practical assistant capabilities through carefully curated dialogue data.

9.2.2 Core Technical Breakthroughs

The Data Quality Revolution

Hermes 1's greatest contribution was not model scale but data philosophy. The NousResearch team applied rigorous quality filtering to GPT-4-generated conversations, retaining only about 15%. Criteria included:

# Core logic of Hermes 1 data filtering (reconstructed)
def quality_filter(conversation: dict) -> bool:
    instruction = conversation["instruction"]
    response = conversation["response"]
    
    # Length check: very short instructions are often ambiguous
    if len(instruction.split()) < 5:
        return False
    
    # Keyword coverage between response and instruction
    instruction_keywords = extract_keywords(instruction)
    coverage = sum(1 for kw in instruction_keywords 
                   if kw in response) / len(instruction_keywords)
    if coverage < 0.6:
        return False
    
    # Exclude pure small-talk dialogues
    if is_chitchat(instruction, response):
        return False
    
    return True

Training Recipe

Hermes 1 used standard SFT trained for approximately 3 epochs on 4 × A100-80G GPUs:

torchrun --nproc_per_node=4 train.py \
  --model_name_or_path meta-llama/Llama-13b-hf \
  --data_path ./hermes_v1_filtered.json \
  --bf16 True \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --learning_rate 2e-5 \
  --lr_scheduler_type cosine \
  --warmup_ratio 0.03

9.2.3 Benchmark Results

Benchmark LLaMA-13B (base) Hermes 1 Improvement
MT-Bench 4.2 6.8 +61.9%
AlpacaEval 38.4% 67.2% +75.0%
MMLU 46.9 52.3 +11.5%
HumanEval 15.8 24.1 +52.5%

9.2.4 Limitations

Hermes 1 was fundamentally still imitation learning — it learned to mimic GPT-4's style without genuine reasoning chain training. Performance on multi-step math and complex logic tasks was notably weak.


9.3 Hermes 2: Dual Breakthroughs in Reasoning and Roleplay (2023 Q4)

9.3.1 Strategic Shift

The second half of 2023 witnessed the "7B awakening" — Mistral-7B's release proved that small models could achieve surprising capability density. Hermes 2 pivoted to Mistral-7B as its primary base while also releasing a LLaMA-2-70B variant.

More importantly, Hermes 2 introduced roleplay data and chain-of-thought (CoT) data, forming a dual-track training strategy.

9.3.2 Core Innovation: The ChatML Format

One of Hermes 2's underappreciated contributions was popularizing the ChatML format, which later became the de facto open-source standard:

<|im_start|>system
You are a helpful AI assistant named Hermes.
<|im_end|>
<|im_start|>user
Please explain the basic principles of quantum entanglement.
<|im_end|>
<|im_start|>assistant
Quantum entanglement is one of the most mysterious phenomena in quantum mechanics...
<|im_end|>

Advantages of this format:

  1. Clear role boundaries preventing role confusion
  2. Natural extension to multi-turn conversations
  3. Perfect compatibility with major inference frameworks (vLLM, llama.cpp)

9.3.3 The Reasoning Leap

Hermes 2 incorporated large volumes of Chain-of-Thought training data, including:

# Hermes 2 CoT data example (math reasoning)
example = {
    "instruction": "A factory produces 500 parts per day, 3% of which are defective. "
                   "If a quality inspector checks 200 parts per day, "
                   "how many days on average to find 5 defective parts?",
    "response": """Let me work through this step by step:

**Step 1: Calculate defective rate**
- 500 parts/day × 3% = 15 defective parts/day

**Step 2: Calculate defectives found per inspection**
- Inspector checks 200/500 = 40% of parts
- Expected defectives found = 15 × 40% = 6 per day

**Step 3: Days to find 5 defective parts**
- 5 ÷ 6 ≈ 0.83 days

**Answer: approximately 0.83 days (less than one full day)**""",
    "source": "hermes2_math_cot"
}

9.3.4 Benchmark Results

Benchmark Hermes 1 Hermes 2 (7B) Hermes 2 (70B)
MT-Bench 6.8 7.4 8.1
GSM8K 32.1% 63.4% 81.2%
HumanEval 24.1% 42.3% 67.8%
MMLU 52.3% 64.1% 76.4%

9.4 Hermes 3: The Function Calling Era (2024 Q2)

9.4.1 Context

Two trends reshaped the open-source LLM landscape in early 2024:

  1. LLaMA-3 release: Meta delivered the strongest open-source base models to date
  2. Function calling standardization: OpenAI's Function Calling API became the industry standard

Hermes 3, built on LLaMA-3-8B/70B, tackled both fronts simultaneously.

9.4.2 Native Function Calling

Hermes 3 implemented a complete Structured Function Calling system:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a specified city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["city"]
            }
        }
    }
]

9.4.3 DPO Replacing PPO

Hermes 3 completely abandoned PPO in favor of DPO (Direct Preference Optimization):

Why abandon PPO?

def dpo_loss(model, ref_model, chosen, rejected, beta=0.1):
    log_prob_chosen = model.log_prob(chosen)
    log_prob_rejected = model.log_prob(rejected)
    ref_log_prob_chosen = ref_model.log_prob(chosen)
    ref_log_prob_rejected = ref_model.log_prob(rejected)
    
    chosen_reward = beta * (log_prob_chosen - ref_log_prob_chosen)
    rejected_reward = beta * (log_prob_rejected - ref_log_prob_rejected)
    
    loss = -torch.log(torch.sigmoid(chosen_reward - rejected_reward))
    return loss.mean()

9.4.4 Key Benchmarks

Benchmark Hermes 2 (70B) Hermes 3 (8B) Hermes 3 (70B)
MT-Bench 8.1 8.0 8.9
BFCL (Function Calling) N/A 72.4% 84.1%
AgentBench 18.2 28.3 34.7
MMLU 76.4% 73.1% 82.6%

9.5 Hermes 4: The First Native Agent Trajectory Model (2025 Q1)

9.5.1 A Paradigm Revolution

Hermes 4 is the qualitative leap of the entire series. Previous generations trained on "dialogue fragments" — single Q&A exchanges or short conversation sequences. Hermes 4 is the first model trained primarily on complete Agent trajectories.

What is an Agent trajectory?

Trajectory = Task Description + [Obs₁→Act₁→Obs₂→Act₂→...→Obsₙ→Final Response]

A complete trajectory contains:

9.5.2 The Atropos RL Framework

Hermes 4's training introduces Atropos — a reinforcement learning framework designed specifically for Agent scenarios:

┌──────────────────────────────────────────────────┐
│              Atropos Training Loop               │
│                                                  │
│  ┌──────────┐  trajectory  ┌────────────────┐   │
│  │  Agent   │─────────────→│  Environment   │   │
│  │ (model)  │              │  (tools/sandbox)│   │
│  └──────────┘              └───────┬────────┘   │
│       ↑                            │ observations│
│       │ policy update              ↓            │
│  ┌────┴───────┐  scores  ┌────────────────┐    │
│  │ RL Optimizer│←─────────│  Judge LLM     │    │
│  │ (PPO/GRPO) │          │  (evaluator)   │    │
│  └────────────┘          └────────────────┘    │
└──────────────────────────────────────────────────┘

9.5.3 Training Data Composition

Hermes 4's ~10M training samples break down as follows:

Data Type Proportion Source
Agent trajectories (successful) 45% Atropos auto-collection
Agent trajectories (failed+corrected) 20% Contrastive learning data
Tool-calling dialogues 15% Synthetic generation
General instruction following 12% Curated from prior versions
Code generation 8% GitHub + synthetic

9.5.4 Self-Repair Capability

Hermes 4 was trained on extensive "error-fix" trajectories, giving it remarkable self-correction ability:

[Tool Call] python_exec: "import pandas as pd; df = pd.read_csv('data.csv')"
[Tool Result] Error: FileNotFoundError: data.csv not found

[Hermes 4 Self-Correction]
<think>File not in current directory, need to list files first</think>
[Tool Call] shell_exec: "ls -la"
[Tool Result] sales_data_2024.csv  report.xlsx  ...

[Tool Call] python_exec: "df = pd.read_csv('sales_data_2024.csv')"
[Tool Result] Successfully read 1842 rows

9.5.5 Benchmark Results

Benchmark Hermes 3 (70B) Hermes 4 (405B) GPT-4o
AgentBench 34.7 61.3 68.4
GAIA (Level 1) 42.1% 71.8% 73.2%
Terminal-Bench 31.2% 58.9% 62.1%
HumanEval 82.6% 91.4% 90.2%
BFCL 84.1% 92.7% 91.3%

9.6 Migration Considerations

9.6.1 Hermes 2/3 → Hermes 4

Prompt Format Changes

# Hermes 3 system prompt style
hermes3_system = "You are Hermes, a helpful AI assistant."

# Hermes 4 recommended (explicit Agent framing)
hermes4_system = """You are Hermes, an autonomous AI agent with access to tools.
You think step-by-step before acting, use tools when needed, 
and verify your results before responding."""

Tool Call Format Upgrade

Hermes 4 enforces stricter JSON Schema validation:

# Hermes 3 (relaxed)
tool_call_h3 = "<tool>get_weather(city='Beijing')</tool>"

# Hermes 4 (strict JSON)
tool_call_h4 = {
    "name": "get_weather",
    "parameters": {"city": "Beijing", "unit": "celsius"}
}

9.6.2 Scenario Recommendations

Scenario Recommended Reason
Resource-constrained (<16GB VRAM) Hermes 3 (8B) Best performance/resource ratio
Single-turn Q&A tasks Hermes 3 (70B) Lower cost
Complex Agent tasks Hermes 4 (405B) Qualitative Agent improvement
Local deploy + Agent Hermes 4 (quantized) Only local Agent-class option

9.6.3 Migration Checklist

□ Does system prompt include Agent-related description?
□ Do tool definitions conform to JSON Schema spec?
□ Context window expanded from 8K to 32K+?
□ Chain-of-thought (<think> tags) enabled?
□ Error handling adapted for multi-step retries?
□ Timeout settings allow sufficient time for long trajectories?

9.7 The Deeper Logic of Evolution

Looking back at four generations of Hermes, a clear philosophical arc emerges:

Hermes 1: Learning to "speak"    (instruction following)
    ↓
Hermes 2: Learning to "think"    (multi-step reasoning)
    ↓
Hermes 3: Learning to "use tools" (function calling)
    ↓
Hermes 4: Learning to "complete tasks" (Agent trajectories)

Each breakthrough represents a fundamental shift in training data philosophy. Hermes 4's milestone significance: it is the first open-source model to define intelligence through task completion trajectories rather than conversation fragments.


Chapter Summary

Discussion Questions

  1. Hermes 4's training data includes 20% "failure trajectories" — why are error cases so important for Agent training?
  2. What core problem does the progression SFT → RLHF → DPO → Atropos RL reflect? What does each method solve, and what new problems does it introduce?
  3. If you needed to choose between Hermes 3 (70B) and Hermes 4 (quantized to 70B), how would you evaluate their capability boundaries?
  4. What is the fundamental difference between "training on trajectories" versus "training on dialogues"? How does this affect the model's generalization ability?
Rate this chapter
4.7  / 5  (55 ratings)

💬 Comments