Benchmark in Practice: AgentBench, GAIA and Terminal-Bench
Chapter 65: Benchmarks in Practice: AgentBench / GAIA / Terminal-Bench
Benchmarks are an agent's mirror. An agent that feels fluent in casual conversation may expose fundamental weaknesses when confronted with AgentBench's database tasks or GAIA's multi-step reasoning challenges. This chapter dissects the design philosophy, task characteristics, and scoring logic of four major benchmarks, analyzes Hermes Agent's performance on each, and provides a complete operational guide for reproducing these tests in your own environment.
65.1 AgentBench: Stress-Testing Across Eight Environments
65.1.1 Design Philosophy
AgentBench, released by Tsinghua KEG Lab, rests on a single conviction: agents must complete real tasks in real environments, not answer closed questions. It constructs eight distinct execution environments, each corresponding to a major real-world use case.
AgentBench Environment Overview:
┌──────────────────────────────────────────────────────┐
│ Operating System (OS) │ Database (DB) │
│ Knowledge Graph (KG) │ Digital Card Game (DCG) │
│ Lateral Thinking (LTP) │ House Holding (HH) │
│ Web Shopping (WS) │ Web Browsing (WB) │
└──────────────────────────────────────────────────────┘
65.1.2 Eight Environments in Detail
Environment 1: Operating System (OS)
Task type: Complete file operations, process management, and permission configuration in a Bash shell.
Evaluation focus: Command syntax accuracy, multi-step planning, error recovery.
Example trace:
# Task: Find all Python files > 10 MB under /data, compress them,
# and move the archive to /backup
[Step 1] find /data -name "*.py" -size +10M
→ /data/models/gpt_weights.py /data/logs/training.py
[Step 2] tar -czf /tmp/large_py.tar.gz \
/data/models/gpt_weights.py /data/logs/training.py
[Step 3] mv /tmp/large_py.tar.gz /backup/
[Step 4] echo "Archived to /backup/large_py.tar.gz"
Hermes analysis: Strong on OS tasks due to precise argument passing. Main weakness: generating complex one-liner pipes correctly in a single shot.
Environment 2: Database (DB)
Task type: Query or modify MySQL/SQLite databases for data analysis or transformation.
# Example SQL generated by Hermes for a quarterly sales + YoY task
sql = """
WITH quarterly_sales AS (
SELECT product_id, product_name,
YEAR(sale_date) AS yr, QUARTER(sale_date) AS qtr,
SUM(amount) AS total
FROM sales
WHERE sale_date >= DATE_SUB(CURDATE(), INTERVAL 2 YEAR)
GROUP BY product_id, product_name, yr, qtr
),
ranked AS (
SELECT *, RANK() OVER (PARTITION BY yr, qtr ORDER BY total DESC) AS rnk
FROM quarterly_sales
)
SELECT r1.product_name, r1.yr, r1.qtr,
r1.total AS current,
r2.total AS prev_year,
ROUND((r1.total - r2.total) / r2.total * 100, 2) AS yoy_pct
FROM ranked r1
LEFT JOIN ranked r2
ON r1.product_id = r2.product_id
AND r1.qtr = r2.qtr AND r1.yr = r2.yr + 1
WHERE r1.rnk <= 5
ORDER BY r1.yr, r1.qtr, r1.rnk;
"""
Hermes analysis: SQL generation is Hermes's strongest sub-skill. Weakness: schema inference when column names are ambiguous.
Environments 3–8 Summary
| Env | Focus | Hermes Strength | Hermes Weakness |
|---|---|---|---|
| KG | SPARQL multi-hop | Relation path tracing | 4+ hop queries |
| DCG | Game strategy | Short-horizon tactics | Long-game planning |
| LTP | Lateral puzzles | Hypothesis elimination | Efficient question design |
| HH | Household tasks | Task decomposition | Spatial reasoning |
| WS | Web shopping | Requirement matching | Price comparison logic |
| WB | Web navigation | Navigation strategy | Deep link extraction |
65.1.3 Hermes AgentBench Score Summary
| Environment | Hermes | GPT-4 | Notes |
|---|---|---|---|
| OS | 72.4 | 68.1 | Best env for Hermes |
| DB | 78.9 | 74.3 | SQL generation strength |
| KG | 61.2 | 58.7 | Multi-hop needs work |
| DCG | 55.8 | 52.1 | Average game reasoning |
| LTP | 69.3 | 65.4 | Efficient hypothesis pruning |
| HH | 48.7 | 51.2 | Spatial reasoning gap |
| WS | 64.1 | 62.8 | Smooth purchase flow |
| WB | 70.6 | 67.9 | Strong navigation |
| Overall | 65.1 | 62.6 |
65.2 GAIA: Real-World Challenges Across Three Difficulty Levels
65.2.1 Design Philosophy
GAIA (General AI Assistants benchmark), published by Meta and HuggingFace, tests AI on actual real-world tasks—not academic constructs. All tasks were sourced from real users; all answers are objectively verifiable.
Three core principles:
- Tasks are relatively easy for skilled humans (completable in under 15 minutes)
- Answers are precise and objective (no fuzzy responses accepted)
- Tasks require combining multiple skills (search + file reading + calculation)
65.2.2 Three Difficulty Levels
Level 1: Single- or Double-Step Reasoning
Characteristics: 1–2 tool calls, clear information sources, unambiguous answer format.
# Hermes execution for a Level 1 task
trace = [
ToolCall(tool="web_search",
args={"query": "Tokyo population latest census 2024", "num_results": 5}),
ToolCall(tool="web_fetch",
args={"url": "https://www.stat.go.jp/english/..."}), # Official source
]
answer = "13,960,000 (2020 Census)"
Hermes Level 1 success rate: ~71% (main difficulty: data recency)
Level 2: Multi-Step + Cross-Source Integration
Characteristics: 3–7 tool calls; information from multiple sources; may require simple calculations.
async def solve_level2(task: str) -> str:
# Step 1: Find laureates
laureates_raw = await agent.tool_call(
"web_search", {"query": "2023 Nobel Prize Physics winners"})
laureates = parse_names(laureates_raw)
# Step 2: Find their alma maters (in parallel)
schools = await asyncio.gather(*[
agent.tool_call("web_search", {"query": f"{name} PhD university"})
for name in laureates
])
# Step 3: Fetch QS rankings (in parallel)
rankings = await asyncio.gather(*[
agent.tool_call("web_search", {"query": f"{school} QS world ranking 2024"})
for school in schools
])
return find_highest_ranked(zip(schools, rankings))
Hermes Level 2 success rate: ~48%
Level 3: Complex Multi-Tool + Deep Reasoning
Characteristics: 8+ tool calls; file operations (PDF/Excel); complex computation or code execution.
Example task:
Download this 2022 global carbon emissions report PDF,
identify the top-5 emitting countries, compare their 2010 data
from a second report, compute 12-year change rates, and
plot a comparison chart in Python.
Required tool chain:
web_search → file_download → pdf_parser → python_executor
Hermes Level 3 success rate: ~23%
65.2.3 GAIA Results Overview
| Level | Tasks | Avg Steps | Hermes | Human Expert | GPT-4 |
|---|---|---|---|---|---|
| 1 | 165 | 1–2 | 71.5% | 92.1% | 67.3% |
| 2 | 86 | 3–7 | 47.7% | 84.6% | 39.2% |
| 3 | 25 | 8+ | 22.9% | 75.4% | 14.8% |
| Total | 276 | — | 55.4% | 86.9% | 47.8% |
65.3 Terminal-Bench 2.0: Code Execution Under Pressure
65.3.1 Overview
Terminal-Bench 2.0 evaluates agents in terminal environments on code execution, debugging, and system operations. It is stricter than AgentBench's OS environment—tasks are longer, environments more complex, errors harder to recover from.
Test matrix:
| Category | Difficulty | Typical Steps |
|---|---|---|
| Environment setup | Medium | 5–15 |
| Code debugging | Hard | 8–20 |
| Performance optimization | Hard | 10–25 |
| Security audit | Expert | 15–30 |
| System automation | Medium-Hard | 8–18 |
65.3.2 Debugging Scenario
# Task: Fix all bugs in this quicksort implementation
buggy_code = """
def quicksort(arr, low=0, high=None):
if high is None:
high = len(arr) # Bug 1: should be len(arr) - 1
if low < high:
pi = partition(arr, low, high)
quicksort(arr, low, pi - 1)
quicksort(arr, pi + 1, high)
return arr
def partition(arr, low, high):
pivot = arr[high] # Bug 2: off-by-one risk from Bug 1
i = low - 1
for j in range(low, high):
if arr[j] <= pivot:
i += 1
arr[i], arr[j] == arr[j], arr[i] # Bug 3: == not =
arr[i + 1], arr[high] = arr[high], arr[i + 1]
return i + 1
"""
# Hermes debugging trace
trace = [
{"step": 1, "action": "read_code", "finding": "Identified 3 potential bugs"},
{"step": 2, "action": "run_code", "result": "IndexError: list index out of range"},
{"step": 3, "action": "fix_bug_1", "change": "high = len(arr) - 1"},
{"step": 4, "action": "run_code", "result": "Wrong output — partially sorted"},
{"step": 5, "action": "analyze", "finding": "Assignment operator error on line 12"},
{"step": 6, "action": "fix_bug_3", "change": "arr[i], arr[j] = arr[j], arr[i]"},
{"step": 7, "action": "run_code", "result": "[1, 1, 2, 3, 6, 8, 10] ✓"},
]
65.3.3 Terminal-Bench 2.0 Hermes Results
| Category | Tasks | Success Rate | Avg Steps | Top Failure Cause |
|---|---|---|---|---|
| Env setup | 45 | 68.9% | 11.2 | Version compatibility |
| Debugging | 60 | 54.7% | 14.8 | Concurrent bugs |
| Perf optimization | 35 | 41.2% | 18.3 | Bottleneck localization |
| Security audit | 20 | 29.5% | 22.1 | Advanced CVE recognition |
| Automation | 40 | 62.3% | 13.5 | Error handling completeness |
| Overall | 200 | 53.6% | 15.2 |
65.4 YC-Bench: Startup Scenario Assessment
65.4.1 Background
YC-Bench is designed for startup engineering scenarios, simulating technical and business tasks common in Y Combinator batches. The core assumption: a truly practical AI agent should handle a startup's full-stack work.
| Category | Weight | Typical Task |
|---|---|---|
| Product prototyping | 25% | Requirements → MVP code |
| Data analysis | 20% | User behavior insights |
| Competitive analysis | 15% | Market research report |
| System architecture | 20% | Design proposals |
| Content creation | 10% | Tech blog / docs |
| Customer communication | 10% | Emails / pitch drafts |
65.4.2 Example Task: Retention Analysis
yc_task = """
Our SaaS launched 6 months ago. Given this CSV (user_id, signup_date,
last_active_date, plan_type, country), please:
1. Compute D1/D7/D30 retention rates by month
2. Compare retention: Free vs Paid users
3. Identify the top 3 countries by retention
4. Suggest 3 concrete retention improvements
"""
# Core analysis snippet from Hermes
import pandas as pd
df = pd.read_csv('users.csv', parse_dates=['signup_date', 'last_active_date'])
df['days_active'] = (df['last_active_date'] - df['signup_date']).dt.days
retention = {
f"D{d}": (df['days_active'] >= d).mean()
for d in [1, 7, 30]
}
by_plan = df.groupby('plan_type').apply(
lambda g: {f"D{d}": (g['days_active'] >= d).mean() for d in [1, 7, 30]}
).to_dict()
65.5 Running Benchmarks in Your Own Environment
65.5.1 AgentBench Setup
git clone https://github.com/THUDM/AgentBench.git && cd AgentBench
pip install -r requirements.txt
# Configure Hermes
cat > configs/agents/hermes.yaml << 'EOF'
name: hermes-agent
type: api
base_url: https://api.hermes.nousresearch.com/v1
model: hermes-3-70b
api_key: ${HERMES_API_KEY}
temperature: 0.0
max_tokens: 4096
EOF
# Run a single environment
python run_eval.py \
--agent configs/agents/hermes.yaml \
--task configs/tasks/os.yaml \
--output results/hermes_os.json
# Full suite (~4-6 hours)
python run_eval.py --agent configs/agents/hermes.yaml \
--all-tasks --parallel 4 --output results/hermes_full.json
65.5.2 GAIA Setup
from datasets import load_dataset
from hermes_agent import HermesAgent
import os
gaia = load_dataset("gaia-benchmark/GAIA", "2023_all")
agent = HermesAgent(
api_key=os.environ["HERMES_API_KEY"],
tools=["web_search", "web_fetch", "python_executor", "file_reader"]
)
results = []
for task in gaia["validation"]:
level = task["Level"]
response = agent.run(
task=task["Question"],
max_steps=20 if level <= 2 else 40,
timeout=300
)
success = normalize_answer(response.answer) == normalize_answer(task["Final answer"])
results.append({"level": level, "success": success, "steps": len(response.trace)})
for level in [1, 2, 3]:
subset = [r for r in results if r["level"] == level]
rate = sum(r["success"] for r in subset) / len(subset)
print(f"Level {level}: {rate:.1%} ({len(subset)} tasks)")
65.5.3 Interpreting Results: Capability Boundaries
def interpret_results(results: dict) -> list[str]:
insights = []
ab = results.get("agentbench", {})
if ab.get("os_score", 100) < 60:
insights.append("WARN: Bash capability insufficient for DevOps automation")
if ab.get("db_score", 0) > 75:
insights.append("OK: Strong SQL — suitable for data analysis pipelines")
if ab.get("hh_score", 100) < 50:
insights.append("WARN: Spatial reasoning weak — avoid physical-world tasks")
gaia = results.get("gaia", {})
if gaia.get("level3_rate", 0) < 0.25:
insights.append("WARN: Complex multi-step success rate low — decompose tasks")
if gaia.get("level1_rate", 0) > 0.70:
insights.append("OK: Basic retrieval solid — suitable for simple Q&A")
tb = results.get("terminal_bench", {})
if tb.get("security_audit_rate", 0) < 0.35:
insights.append("WARN: Security audit capability limited — human review required")
return insights
Chapter Summary
This chapter examined four major benchmarks and Hermes's performance on each:
- AgentBench: Eight environments spanning OS, DB, KG, and Web; Hermes overall 65.1; strongest in DB and OS
- GAIA: Three difficulty tiers; Level 3 is challenging for all agents; Hermes 55.4% overall
- Terminal-Bench 2.0: Strict code execution testing; security audit is the biggest weakness (29.5%)
- YC-Bench: Startup scenario evaluation, closest to real business needs
- Running locally: Complete steps from environment configuration to result interpretation
Discussion Questions
- If your primary use case is data analysis, which benchmark environment should you focus on to decide whether Hermes fits?
- GAIA Level 3 success rates are low across all models (best < 30%). What does this tell you? How can engineering (not model upgrades) improve complex task success?
- Terminal-Bench security audit scores only 29.5%. What are the implications for security-sensitive deployments?
- What benchmark overfitting risks exist—situations where high benchmark scores mask poor real-world performance?