Hermes 1 → 2 → 3 → 4 Version Milestones
Chapter 9: Hermes Version Milestones — The Technical Evolution from 1 to 4
Each generation of Hermes represents not merely an expansion of parameter scale, but a fundamental shift in training philosophy. Understanding this evolutionary trajectory is the prerequisite for mastering Hermes 4's capability boundaries.
9.1 Version Evolution Overview
Since the Hermes series launched in 2023, NousResearch has released four major versions, each representing a qualitative leap in training data composition, alignment strategy, and target tasks. The table below summarizes key dimensions across versions:
| Dimension | Hermes 1 | Hermes 2 | Hermes 3 | Hermes 4 |
|---|---|---|---|---|
| Release | 2023 Q2 | 2023 Q4 | 2024 Q2 | 2025 Q1 |
| Base Model | LLaMA-13B | Mistral-7B / LLaMA-2-70B | LLaMA-3-8B/70B | LLaMA-3.1-405B |
| Core Positioning | General instruction following | Enhanced reasoning & roleplay | Multimodal + function calling | Native Agent trajectory training |
| Training Data Scale | ~300K dialogues | ~1M dialogues | ~3M dialogues + synthetic | ~10M Agent trajectories |
| Primary Alignment | SFT | SFT + RLHF | SFT + DPO | SFT + Atropos RL |
| Tool Calling | None | Limited | Standard FC | Native 40+ tools |
| AgentBench Score | N/A | 18.2 | 34.7 | 61.3 |
9.2 Hermes 1: Laying the Groundwork for Instruction Following (2023 Q2)
9.2.1 Origins
In spring 2023, ChatGPT's explosive growth made the open-source community acutely aware of the strategic importance of "instruction following." While LLaMA's base weights achieved high pretraining quality, they lacked natural conversational ability. Hermes 1's mission was to give LLaMA-13B practical assistant capabilities through carefully curated dialogue data.
9.2.2 Core Technical Breakthroughs
The Data Quality Revolution
Hermes 1's greatest contribution was not model scale but data philosophy. The NousResearch team applied rigorous quality filtering to GPT-4-generated conversations, retaining only about 15%. Criteria included:
- Instruction clarity: Is the instruction explicit and actionable?
- Response fidelity: Does the response fully address the instruction?
- Reasoning visibility: Are explicit reasoning steps present?
- Format consistency: Is Markdown used appropriately?
# Core logic of Hermes 1 data filtering (reconstructed)
def quality_filter(conversation: dict) -> bool:
instruction = conversation["instruction"]
response = conversation["response"]
# Length check: very short instructions are often ambiguous
if len(instruction.split()) < 5:
return False
# Keyword coverage between response and instruction
instruction_keywords = extract_keywords(instruction)
coverage = sum(1 for kw in instruction_keywords
if kw in response) / len(instruction_keywords)
if coverage < 0.6:
return False
# Exclude pure small-talk dialogues
if is_chitchat(instruction, response):
return False
return True
Training Recipe
Hermes 1 used standard SFT trained for approximately 3 epochs on 4 × A100-80G GPUs:
torchrun --nproc_per_node=4 train.py \
--model_name_or_path meta-llama/Llama-13b-hf \
--data_path ./hermes_v1_filtered.json \
--bf16 True \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 8 \
--learning_rate 2e-5 \
--lr_scheduler_type cosine \
--warmup_ratio 0.03
9.2.3 Benchmark Results
| Benchmark | LLaMA-13B (base) | Hermes 1 | Improvement |
|---|---|---|---|
| MT-Bench | 4.2 | 6.8 | +61.9% |
| AlpacaEval | 38.4% | 67.2% | +75.0% |
| MMLU | 46.9 | 52.3 | +11.5% |
| HumanEval | 15.8 | 24.1 | +52.5% |
9.2.4 Limitations
Hermes 1 was fundamentally still imitation learning — it learned to mimic GPT-4's style without genuine reasoning chain training. Performance on multi-step math and complex logic tasks was notably weak.
9.3 Hermes 2: Dual Breakthroughs in Reasoning and Roleplay (2023 Q4)
9.3.1 Strategic Shift
The second half of 2023 witnessed the "7B awakening" — Mistral-7B's release proved that small models could achieve surprising capability density. Hermes 2 pivoted to Mistral-7B as its primary base while also releasing a LLaMA-2-70B variant.
More importantly, Hermes 2 introduced roleplay data and chain-of-thought (CoT) data, forming a dual-track training strategy.
9.3.2 Core Innovation: The ChatML Format
One of Hermes 2's underappreciated contributions was popularizing the ChatML format, which later became the de facto open-source standard:
<|im_start|>system
You are a helpful AI assistant named Hermes.
<|im_end|>
<|im_start|>user
Please explain the basic principles of quantum entanglement.
<|im_end|>
<|im_start|>assistant
Quantum entanglement is one of the most mysterious phenomena in quantum mechanics...
<|im_end|>
Advantages of this format:
- Clear role boundaries preventing role confusion
- Natural extension to multi-turn conversations
- Perfect compatibility with major inference frameworks (vLLM, llama.cpp)
9.3.3 The Reasoning Leap
Hermes 2 incorporated large volumes of Chain-of-Thought training data, including:
- Mathematical reasoning: Enhanced step-by-step solutions from MetaMath and GSM8K
- Code reasoning: Programming solutions with progressive annotations
- Logical reasoning: Structured deductive/inductive reasoning examples
# Hermes 2 CoT data example (math reasoning)
example = {
"instruction": "A factory produces 500 parts per day, 3% of which are defective. "
"If a quality inspector checks 200 parts per day, "
"how many days on average to find 5 defective parts?",
"response": """Let me work through this step by step:
**Step 1: Calculate defective rate**
- 500 parts/day × 3% = 15 defective parts/day
**Step 2: Calculate defectives found per inspection**
- Inspector checks 200/500 = 40% of parts
- Expected defectives found = 15 × 40% = 6 per day
**Step 3: Days to find 5 defective parts**
- 5 ÷ 6 ≈ 0.83 days
**Answer: approximately 0.83 days (less than one full day)**""",
"source": "hermes2_math_cot"
}
9.3.4 Benchmark Results
| Benchmark | Hermes 1 | Hermes 2 (7B) | Hermes 2 (70B) |
|---|---|---|---|
| MT-Bench | 6.8 | 7.4 | 8.1 |
| GSM8K | 32.1% | 63.4% | 81.2% |
| HumanEval | 24.1% | 42.3% | 67.8% |
| MMLU | 52.3% | 64.1% | 76.4% |
9.4 Hermes 3: The Function Calling Era (2024 Q2)
9.4.1 Context
Two trends reshaped the open-source LLM landscape in early 2024:
- LLaMA-3 release: Meta delivered the strongest open-source base models to date
- Function calling standardization: OpenAI's Function Calling API became the industry standard
Hermes 3, built on LLaMA-3-8B/70B, tackled both fronts simultaneously.
9.4.2 Native Function Calling
Hermes 3 implemented a complete Structured Function Calling system:
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a specified city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["city"]
}
}
}
]
9.4.3 DPO Replacing PPO
Hermes 3 completely abandoned PPO in favor of DPO (Direct Preference Optimization):
Why abandon PPO?
- Requires a separate reward model, increasing training cost
- Hyperparameter tuning is complex and prone to reward hacking
- DPO optimizes directly on preference data for greater stability
def dpo_loss(model, ref_model, chosen, rejected, beta=0.1):
log_prob_chosen = model.log_prob(chosen)
log_prob_rejected = model.log_prob(rejected)
ref_log_prob_chosen = ref_model.log_prob(chosen)
ref_log_prob_rejected = ref_model.log_prob(rejected)
chosen_reward = beta * (log_prob_chosen - ref_log_prob_chosen)
rejected_reward = beta * (log_prob_rejected - ref_log_prob_rejected)
loss = -torch.log(torch.sigmoid(chosen_reward - rejected_reward))
return loss.mean()
9.4.4 Key Benchmarks
| Benchmark | Hermes 2 (70B) | Hermes 3 (8B) | Hermes 3 (70B) |
|---|---|---|---|
| MT-Bench | 8.1 | 8.0 | 8.9 |
| BFCL (Function Calling) | N/A | 72.4% | 84.1% |
| AgentBench | 18.2 | 28.3 | 34.7 |
| MMLU | 76.4% | 73.1% | 82.6% |
9.5 Hermes 4: The First Native Agent Trajectory Model (2025 Q1)
9.5.1 A Paradigm Revolution
Hermes 4 is the qualitative leap of the entire series. Previous generations trained on "dialogue fragments" — single Q&A exchanges or short conversation sequences. Hermes 4 is the first model trained primarily on complete Agent trajectories.
What is an Agent trajectory?
Trajectory = Task Description + [Obs₁→Act₁→Obs₂→Act₂→...→Obsₙ→Final Response]
A complete trajectory contains:
- Initial task (possibly vague or complex)
- Multiple tool calls and their results
- Intermediate reasoning steps
- Errors and self-corrections
- Final answer
9.5.2 The Atropos RL Framework
Hermes 4's training introduces Atropos — a reinforcement learning framework designed specifically for Agent scenarios:
┌──────────────────────────────────────────────────┐
│ Atropos Training Loop │
│ │
│ ┌──────────┐ trajectory ┌────────────────┐ │
│ │ Agent │─────────────→│ Environment │ │
│ │ (model) │ │ (tools/sandbox)│ │
│ └──────────┘ └───────┬────────┘ │
│ ↑ │ observations│
│ │ policy update ↓ │
│ ┌────┴───────┐ scores ┌────────────────┐ │
│ │ RL Optimizer│←─────────│ Judge LLM │ │
│ │ (PPO/GRPO) │ │ (evaluator) │ │
│ └────────────┘ └────────────────┘ │
└──────────────────────────────────────────────────┘
9.5.3 Training Data Composition
Hermes 4's ~10M training samples break down as follows:
| Data Type | Proportion | Source |
|---|---|---|
| Agent trajectories (successful) | 45% | Atropos auto-collection |
| Agent trajectories (failed+corrected) | 20% | Contrastive learning data |
| Tool-calling dialogues | 15% | Synthetic generation |
| General instruction following | 12% | Curated from prior versions |
| Code generation | 8% | GitHub + synthetic |
9.5.4 Self-Repair Capability
Hermes 4 was trained on extensive "error-fix" trajectories, giving it remarkable self-correction ability:
[Tool Call] python_exec: "import pandas as pd; df = pd.read_csv('data.csv')"
[Tool Result] Error: FileNotFoundError: data.csv not found
[Hermes 4 Self-Correction]
<think>File not in current directory, need to list files first</think>
[Tool Call] shell_exec: "ls -la"
[Tool Result] sales_data_2024.csv report.xlsx ...
[Tool Call] python_exec: "df = pd.read_csv('sales_data_2024.csv')"
[Tool Result] Successfully read 1842 rows
9.5.5 Benchmark Results
| Benchmark | Hermes 3 (70B) | Hermes 4 (405B) | GPT-4o |
|---|---|---|---|
| AgentBench | 34.7 | 61.3 | 68.4 |
| GAIA (Level 1) | 42.1% | 71.8% | 73.2% |
| Terminal-Bench | 31.2% | 58.9% | 62.1% |
| HumanEval | 82.6% | 91.4% | 90.2% |
| BFCL | 84.1% | 92.7% | 91.3% |
9.6 Migration Considerations
9.6.1 Hermes 2/3 → Hermes 4
Prompt Format Changes
# Hermes 3 system prompt style
hermes3_system = "You are Hermes, a helpful AI assistant."
# Hermes 4 recommended (explicit Agent framing)
hermes4_system = """You are Hermes, an autonomous AI agent with access to tools.
You think step-by-step before acting, use tools when needed,
and verify your results before responding."""
Tool Call Format Upgrade
Hermes 4 enforces stricter JSON Schema validation:
# Hermes 3 (relaxed)
tool_call_h3 = "<tool>get_weather(city='Beijing')</tool>"
# Hermes 4 (strict JSON)
tool_call_h4 = {
"name": "get_weather",
"parameters": {"city": "Beijing", "unit": "celsius"}
}
9.6.2 Scenario Recommendations
| Scenario | Recommended | Reason |
|---|---|---|
| Resource-constrained (<16GB VRAM) | Hermes 3 (8B) | Best performance/resource ratio |
| Single-turn Q&A tasks | Hermes 3 (70B) | Lower cost |
| Complex Agent tasks | Hermes 4 (405B) | Qualitative Agent improvement |
| Local deploy + Agent | Hermes 4 (quantized) | Only local Agent-class option |
9.6.3 Migration Checklist
□ Does system prompt include Agent-related description?
□ Do tool definitions conform to JSON Schema spec?
□ Context window expanded from 8K to 32K+?
□ Chain-of-thought (<think> tags) enabled?
□ Error handling adapted for multi-step retries?
□ Timeout settings allow sufficient time for long trajectories?
9.7 The Deeper Logic of Evolution
Looking back at four generations of Hermes, a clear philosophical arc emerges:
Hermes 1: Learning to "speak" (instruction following)
↓
Hermes 2: Learning to "think" (multi-step reasoning)
↓
Hermes 3: Learning to "use tools" (function calling)
↓
Hermes 4: Learning to "complete tasks" (Agent trajectories)
Each breakthrough represents a fundamental shift in training data philosophy. Hermes 4's milestone significance: it is the first open-source model to define intelligence through task completion trajectories rather than conversation fragments.
Chapter Summary
- Hermes 1 established a data-quality-first training philosophy; the ChatML format became an industry standard
- Hermes 2 introduced CoT and roleplay data, achieving a reasoning leap at the 7B scale
- Hermes 3 standardized function calling; DPO replaced PPO as the mainstream alignment method
- Hermes 4 is the first open-source model trained primarily on Agent trajectories, achieving AgentBench 61.3, approaching GPT-4o parity
- Migration requires particular attention to prompt format, tool call specifications, and context window changes
Discussion Questions
- Hermes 4's training data includes 20% "failure trajectories" — why are error cases so important for Agent training?
- What core problem does the progression SFT → RLHF → DPO → Atropos RL reflect? What does each method solve, and what new problems does it introduce?
- If you needed to choose between Hermes 3 (70B) and Hermes 4 (quantized to 70B), how would you evaluate their capability boundaries?
- What is the fundamental difference between "training on trajectories" versus "training on dialogues"? How does this affect the model's generalization ability?