Chapter 9

Hermes 1 → 2 → 3 → 4 Version Milestones

Chapter 9: Hermes Version Milestones — The Technical Evolution from 1 to 4

Each generation of Hermes represents not merely an expansion of parameter scale, but a fundamental shift in training philosophy. Understanding this evolutionary trajectory is the prerequisite for mastering Hermes 4's capability boundaries.

9.1 Version Evolution Overview

Since the Hermes series launched in 2023, NousResearch has released four major versions, each representing a qualitative leap in training data composition, alignment strategy, and target tasks. The table below summarizes key dimensions across versions:

Dimension	Hermes 1	Hermes 2	Hermes 3	Hermes 4
Release	2023 Q2	2023 Q4	2024 Q2	2025 Q1
Base Model	LLaMA-13B	Mistral-7B / LLaMA-2-70B	LLaMA-3-8B/70B	LLaMA-3.1-405B
Core Positioning	General instruction following	Enhanced reasoning & roleplay	Multimodal + function calling	Native Agent trajectory training
Training Data Scale	~300K dialogues	~1M dialogues	~3M dialogues + synthetic	~10M Agent trajectories
Primary Alignment	SFT	SFT + RLHF	SFT + DPO	SFT + Atropos RL
Tool Calling	None	Limited	Standard FC	Native 40+ tools
AgentBench Score	N/A	18.2	34.7	61.3

9.2 Hermes 1: Laying the Groundwork for Instruction Following (2023 Q2)

9.2.1 Origins

In spring 2023, ChatGPT's explosive growth made the open-source community acutely aware of the strategic importance of "instruction following." While LLaMA's base weights achieved high pretraining quality, they lacked natural conversational ability. Hermes 1's mission was to give LLaMA-13B practical assistant capabilities through carefully curated dialogue data.

9.2.2 Core Technical Breakthroughs

The Data Quality Revolution

Hermes 1's greatest contribution was not model scale but data philosophy. The NousResearch team applied rigorous quality filtering to GPT-4-generated conversations, retaining only about 15%. Criteria included:

Instruction clarity: Is the instruction explicit and actionable?
Response fidelity: Does the response fully address the instruction?
Reasoning visibility: Are explicit reasoning steps present?
Format consistency: Is Markdown used appropriately?

# Core logic of Hermes 1 data filtering (reconstructed)
def quality_filter(conversation: dict) -> bool:
    instruction = conversation["instruction"]
    response = conversation["response"]
    
    # Length check: very short instructions are often ambiguous
    if len(instruction.split()) < 5:
        return False
    
    # Keyword coverage between response and instruction
    instruction_keywords = extract_keywords(instruction)
    coverage = sum(1 for kw in instruction_keywords 
                   if kw in response) / len(instruction_keywords)
    if coverage < 0.6:
        return False
    
    # Exclude pure small-talk dialogues
    if is_chitchat(instruction, response):
        return False
    
    return True

Training Recipe

Hermes 1 used standard SFT trained for approximately 3 epochs on 4 × A100-80G GPUs:

torchrun --nproc_per_node=4 train.py \
  --model_name_or_path meta-llama/Llama-13b-hf \
  --data_path ./hermes_v1_filtered.json \
  --bf16 True \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --learning_rate 2e-5 \
  --lr_scheduler_type cosine \
  --warmup_ratio 0.03

9.2.3 Benchmark Results

Benchmark	LLaMA-13B (base)	Hermes 1	Improvement
MT-Bench	4.2	6.8	+61.9%
AlpacaEval	38.4%	67.2%	+75.0%
MMLU	46.9	52.3	+11.5%
HumanEval	15.8	24.1	+52.5%

9.2.4 Limitations

Hermes 1 was fundamentally still imitation learning — it learned to mimic GPT-4's style without genuine reasoning chain training. Performance on multi-step math and complex logic tasks was notably weak.

9.3 Hermes 2: Dual Breakthroughs in Reasoning and Roleplay (2023 Q4)

9.3.1 Strategic Shift

The second half of 2023 witnessed the "7B awakening" — Mistral-7B's release proved that small models could achieve surprising capability density. Hermes 2 pivoted to Mistral-7B as its primary base while also releasing a LLaMA-2-70B variant.

More importantly, Hermes 2 introduced roleplay data and chain-of-thought (CoT) data, forming a dual-track training strategy.

9.3.2 Core Innovation: The ChatML Format

One of Hermes 2's underappreciated contributions was popularizing the ChatML format, which later became the de facto open-source standard:

<|im_start|>system
You are a helpful AI assistant named Hermes.
<|im_end|>
<|im_start|>user
Please explain the basic principles of quantum entanglement.
<|im_end|>
<|im_start|>assistant
Quantum entanglement is one of the most mysterious phenomena in quantum mechanics...
<|im_end|>

Advantages of this format:

Clear role boundaries preventing role confusion
Natural extension to multi-turn conversations
Perfect compatibility with major inference frameworks (vLLM, llama.cpp)

9.3.3 The Reasoning Leap

Hermes 2 incorporated large volumes of Chain-of-Thought training data, including:

Mathematical reasoning: Enhanced step-by-step solutions from MetaMath and GSM8K
Code reasoning: Programming solutions with progressive annotations
Logical reasoning: Structured deductive/inductive reasoning examples

# Hermes 2 CoT data example (math reasoning)
example = {
    "instruction": "A factory produces 500 parts per day, 3% of which are defective. "
                   "If a quality inspector checks 200 parts per day, "
                   "how many days on average to find 5 defective parts?",
    "response": """Let me work through this step by step:

**Step 1: Calculate defective rate**
- 500 parts/day × 3% = 15 defective parts/day

**Step 2: Calculate defectives found per inspection**
- Inspector checks 200/500 = 40% of parts
- Expected defectives found = 15 × 40% = 6 per day

**Step 3: Days to find 5 defective parts**
- 5 ÷ 6 ≈ 0.83 days

**Answer: approximately 0.83 days (less than one full day)**""",
    "source": "hermes2_math_cot"
}

9.3.4 Benchmark Results

Benchmark	Hermes 1	Hermes 2 (7B)	Hermes 2 (70B)
MT-Bench	6.8	7.4	8.1
GSM8K	32.1%	63.4%	81.2%
HumanEval	24.1%	42.3%	67.8%
MMLU	52.3%	64.1%	76.4%

9.4 Hermes 3: The Function Calling Era (2024 Q2)

9.4.1 Context

Two trends reshaped the open-source LLM landscape in early 2024:

LLaMA-3 release: Meta delivered the strongest open-source base models to date
Function calling standardization: OpenAI's Function Calling API became the industry standard

Hermes 3, built on LLaMA-3-8B/70B, tackled both fronts simultaneously.

9.4.2 Native Function Calling

Hermes 3 implemented a complete Structured Function Calling system:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a specified city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["city"]
            }
        }
    }
]

9.4.3 DPO Replacing PPO

Hermes 3 completely abandoned PPO in favor of DPO (Direct Preference Optimization):

Why abandon PPO?

Requires a separate reward model, increasing training cost
Hyperparameter tuning is complex and prone to reward hacking
DPO optimizes directly on preference data for greater stability

def dpo_loss(model, ref_model, chosen, rejected, beta=0.1):
    log_prob_chosen = model.log_prob(chosen)
    log_prob_rejected = model.log_prob(rejected)
    ref_log_prob_chosen = ref_model.log_prob(chosen)
    ref_log_prob_rejected = ref_model.log_prob(rejected)
    
    chosen_reward = beta * (log_prob_chosen - ref_log_prob_chosen)
    rejected_reward = beta * (log_prob_rejected - ref_log_prob_rejected)
    
    loss = -torch.log(torch.sigmoid(chosen_reward - rejected_reward))
    return loss.mean()

9.4.4 Key Benchmarks

Benchmark	Hermes 2 (70B)	Hermes 3 (8B)	Hermes 3 (70B)
MT-Bench	8.1	8.0	8.9
BFCL (Function Calling)	N/A	72.4%	84.1%
AgentBench	18.2	28.3	34.7
MMLU	76.4%	73.1%	82.6%

9.5 Hermes 4: The First Native Agent Trajectory Model (2025 Q1)

9.5.1 A Paradigm Revolution

Hermes 4 is the qualitative leap of the entire series. Previous generations trained on "dialogue fragments" — single Q&A exchanges or short conversation sequences. Hermes 4 is the first model trained primarily on complete Agent trajectories.

What is an Agent trajectory?

Trajectory = Task Description + [Obs₁→Act₁→Obs₂→Act₂→...→Obsₙ→Final Response]

A complete trajectory contains:

Initial task (possibly vague or complex)
Multiple tool calls and their results
Intermediate reasoning steps
Errors and self-corrections
Final answer

9.5.2 The Atropos RL Framework

Hermes 4's training introduces Atropos — a reinforcement learning framework designed specifically for Agent scenarios:

┌──────────────────────────────────────────────────┐
│              Atropos Training Loop               │
│                                                  │
│  ┌──────────┐  trajectory  ┌────────────────┐   │
│  │  Agent   │─────────────→│  Environment   │   │
│  │ (model)  │              │  (tools/sandbox)│   │
│  └──────────┘              └───────┬────────┘   │
│       ↑                            │ observations│
│       │ policy update              ↓            │
│  ┌────┴───────┐  scores  ┌────────────────┐    │
│  │ RL Optimizer│←─────────│  Judge LLM     │    │
│  │ (PPO/GRPO) │          │  (evaluator)   │    │
│  └────────────┘          └────────────────┘    │
└──────────────────────────────────────────────────┘

9.5.3 Training Data Composition

Hermes 4's ~10M training samples break down as follows:

Data Type	Proportion	Source
Agent trajectories (successful)	45%	Atropos auto-collection
Agent trajectories (failed+corrected)	20%	Contrastive learning data
Tool-calling dialogues	15%	Synthetic generation
General instruction following	12%	Curated from prior versions
Code generation	8%	GitHub + synthetic

9.5.4 Self-Repair Capability

Hermes 4 was trained on extensive "error-fix" trajectories, giving it remarkable self-correction ability:

[Tool Call] python_exec: "import pandas as pd; df = pd.read_csv('data.csv')"
[Tool Result] Error: FileNotFoundError: data.csv not found

[Hermes 4 Self-Correction]
<think>File not in current directory, need to list files first</think>
[Tool Call] shell_exec: "ls -la"
[Tool Result] sales_data_2024.csv  report.xlsx  ...

[Tool Call] python_exec: "df = pd.read_csv('sales_data_2024.csv')"
[Tool Result] Successfully read 1842 rows

9.5.5 Benchmark Results

Benchmark	Hermes 3 (70B)	Hermes 4 (405B)	GPT-4o
AgentBench	34.7	61.3	68.4
GAIA (Level 1)	42.1%	71.8%	73.2%
Terminal-Bench	31.2%	58.9%	62.1%
HumanEval	82.6%	91.4%	90.2%
BFCL	84.1%	92.7%	91.3%

9.6 Migration Considerations

9.6.1 Hermes 2/3 → Hermes 4

Prompt Format Changes

# Hermes 3 system prompt style
hermes3_system = "You are Hermes, a helpful AI assistant."

# Hermes 4 recommended (explicit Agent framing)
hermes4_system = """You are Hermes, an autonomous AI agent with access to tools.
You think step-by-step before acting, use tools when needed, 
and verify your results before responding."""

Tool Call Format Upgrade

Hermes 4 enforces stricter JSON Schema validation:

# Hermes 3 (relaxed)
tool_call_h3 = "<tool>get_weather(city='Beijing')</tool>"

# Hermes 4 (strict JSON)
tool_call_h4 = {
    "name": "get_weather",
    "parameters": {"city": "Beijing", "unit": "celsius"}
}

9.6.2 Scenario Recommendations

Scenario	Recommended	Reason
Resource-constrained (<16GB VRAM)	Hermes 3 (8B)	Best performance/resource ratio
Single-turn Q&A tasks	Hermes 3 (70B)	Lower cost
Complex Agent tasks	Hermes 4 (405B)	Qualitative Agent improvement
Local deploy + Agent	Hermes 4 (quantized)	Only local Agent-class option

9.6.3 Migration Checklist

□ Does system prompt include Agent-related description?
□ Do tool definitions conform to JSON Schema spec?
□ Context window expanded from 8K to 32K+?
□ Chain-of-thought (<think> tags) enabled?
□ Error handling adapted for multi-step retries?
□ Timeout settings allow sufficient time for long trajectories?

9.7 The Deeper Logic of Evolution

Looking back at four generations of Hermes, a clear philosophical arc emerges:

Hermes 1: Learning to "speak"    (instruction following)
    ↓
Hermes 2: Learning to "think"    (multi-step reasoning)
    ↓
Hermes 3: Learning to "use tools" (function calling)
    ↓
Hermes 4: Learning to "complete tasks" (Agent trajectories)

Each breakthrough represents a fundamental shift in training data philosophy. Hermes 4's milestone significance: it is the first open-source model to define intelligence through task completion trajectories rather than conversation fragments.

Chapter Summary

Hermes 1 established a data-quality-first training philosophy; the ChatML format became an industry standard
Hermes 2 introduced CoT and roleplay data, achieving a reasoning leap at the 7B scale
Hermes 3 standardized function calling; DPO replaced PPO as the mainstream alignment method
Hermes 4 is the first open-source model trained primarily on Agent trajectories, achieving AgentBench 61.3, approaching GPT-4o parity
Migration requires particular attention to prompt format, tool call specifications, and context window changes

Discussion Questions

Hermes 4's training data includes 20% "failure trajectories" — why are error cases so important for Agent training?
What core problem does the progression SFT → RLHF → DPO → Atropos RL reflect? What does each method solve, and what new problems does it introduce?
If you needed to choose between Hermes 3 (70B) and Hermes 4 (quantized to 70B), how would you evaluate their capability boundaries?
What is the fundamental difference between "training on trajectories" versus "training on dialogues"? How does this affect the model's generalization ability?

Rate this chapter

4.7 / 5 (55 ratings)