Chapter 65

Benchmark in Practice: AgentBench, GAIA and Terminal-Bench

Chapter 65: Benchmarks in Practice: AgentBench / GAIA / Terminal-Bench

Benchmarks are an agent's mirror. An agent that feels fluent in casual conversation may expose fundamental weaknesses when confronted with AgentBench's database tasks or GAIA's multi-step reasoning challenges. This chapter dissects the design philosophy, task characteristics, and scoring logic of four major benchmarks, analyzes Hermes Agent's performance on each, and provides a complete operational guide for reproducing these tests in your own environment.

65.1 AgentBench: Stress-Testing Across Eight Environments

65.1.1 Design Philosophy

AgentBench, released by Tsinghua KEG Lab, rests on a single conviction: agents must complete real tasks in real environments, not answer closed questions. It constructs eight distinct execution environments, each corresponding to a major real-world use case.

AgentBench Environment Overview:
┌──────────────────────────────────────────────────────┐
│  Operating System (OS)  │  Database (DB)             │
│  Knowledge Graph (KG)   │  Digital Card Game (DCG)  │
│  Lateral Thinking (LTP) │  House Holding (HH)       │
│  Web Shopping (WS)      │  Web Browsing (WB)        │
└──────────────────────────────────────────────────────┘

65.1.2 Eight Environments in Detail

Environment 1: Operating System (OS)

Task type: Complete file operations, process management, and permission configuration in a Bash shell.

Evaluation focus: Command syntax accuracy, multi-step planning, error recovery.

Example trace:

# Task: Find all Python files > 10 MB under /data, compress them,
#       and move the archive to /backup

[Step 1] find /data -name "*.py" -size +10M
         → /data/models/gpt_weights.py  /data/logs/training.py

[Step 2] tar -czf /tmp/large_py.tar.gz \
         /data/models/gpt_weights.py /data/logs/training.py

[Step 3] mv /tmp/large_py.tar.gz /backup/

[Step 4] echo "Archived to /backup/large_py.tar.gz"

Hermes analysis: Strong on OS tasks due to precise argument passing. Main weakness: generating complex one-liner pipes correctly in a single shot.

Environment 2: Database (DB)

Task type: Query or modify MySQL/SQLite databases for data analysis or transformation.

# Example SQL generated by Hermes for a quarterly sales + YoY task
sql = """
WITH quarterly_sales AS (
    SELECT product_id, product_name,
           YEAR(sale_date) AS yr, QUARTER(sale_date) AS qtr,
           SUM(amount) AS total
    FROM sales
    WHERE sale_date >= DATE_SUB(CURDATE(), INTERVAL 2 YEAR)
    GROUP BY product_id, product_name, yr, qtr
),
ranked AS (
    SELECT *, RANK() OVER (PARTITION BY yr, qtr ORDER BY total DESC) AS rnk
    FROM quarterly_sales
)
SELECT r1.product_name, r1.yr, r1.qtr,
       r1.total AS current,
       r2.total AS prev_year,
       ROUND((r1.total - r2.total) / r2.total * 100, 2) AS yoy_pct
FROM ranked r1
LEFT JOIN ranked r2
    ON r1.product_id = r2.product_id
   AND r1.qtr = r2.qtr AND r1.yr = r2.yr + 1
WHERE r1.rnk <= 5
ORDER BY r1.yr, r1.qtr, r1.rnk;
"""

Hermes analysis: SQL generation is Hermes's strongest sub-skill. Weakness: schema inference when column names are ambiguous.

Environments 3–8 Summary

Env	Focus	Hermes Strength	Hermes Weakness
KG	SPARQL multi-hop	Relation path tracing	4+ hop queries
DCG	Game strategy	Short-horizon tactics	Long-game planning
LTP	Lateral puzzles	Hypothesis elimination	Efficient question design
HH	Household tasks	Task decomposition	Spatial reasoning
WS	Web shopping	Requirement matching	Price comparison logic
WB	Web navigation	Navigation strategy	Deep link extraction

65.1.3 Hermes AgentBench Score Summary

Environment	Hermes	GPT-4	Notes
OS	72.4	68.1	Best env for Hermes
DB	78.9	74.3	SQL generation strength
KG	61.2	58.7	Multi-hop needs work
DCG	55.8	52.1	Average game reasoning
LTP	69.3	65.4	Efficient hypothesis pruning
HH	48.7	51.2	Spatial reasoning gap
WS	64.1	62.8	Smooth purchase flow
WB	70.6	67.9	Strong navigation
Overall	65.1	62.6

65.2 GAIA: Real-World Challenges Across Three Difficulty Levels

65.2.1 Design Philosophy

GAIA (General AI Assistants benchmark), published by Meta and HuggingFace, tests AI on actual real-world tasks—not academic constructs. All tasks were sourced from real users; all answers are objectively verifiable.

Three core principles:

Tasks are relatively easy for skilled humans (completable in under 15 minutes)
Answers are precise and objective (no fuzzy responses accepted)
Tasks require combining multiple skills (search + file reading + calculation)

65.2.2 Three Difficulty Levels

Level 1: Single- or Double-Step Reasoning

Characteristics: 1–2 tool calls, clear information sources, unambiguous answer format.

# Hermes execution for a Level 1 task
trace = [
    ToolCall(tool="web_search",
             args={"query": "Tokyo population latest census 2024", "num_results": 5}),
    ToolCall(tool="web_fetch",
             args={"url": "https://www.stat.go.jp/english/..."}),  # Official source
]
answer = "13,960,000 (2020 Census)"

Hermes Level 1 success rate: ~71% (main difficulty: data recency)

Level 2: Multi-Step + Cross-Source Integration

Characteristics: 3–7 tool calls; information from multiple sources; may require simple calculations.

async def solve_level2(task: str) -> str:
    # Step 1: Find laureates
    laureates_raw = await agent.tool_call(
        "web_search", {"query": "2023 Nobel Prize Physics winners"})
    laureates = parse_names(laureates_raw)

    # Step 2: Find their alma maters (in parallel)
    schools = await asyncio.gather(*[
        agent.tool_call("web_search", {"query": f"{name} PhD university"})
        for name in laureates
    ])

    # Step 3: Fetch QS rankings (in parallel)
    rankings = await asyncio.gather(*[
        agent.tool_call("web_search", {"query": f"{school} QS world ranking 2024"})
        for school in schools
    ])

    return find_highest_ranked(zip(schools, rankings))

Hermes Level 2 success rate: ~48%

Level 3: Complex Multi-Tool + Deep Reasoning

Characteristics: 8+ tool calls; file operations (PDF/Excel); complex computation or code execution.

Example task:
Download this 2022 global carbon emissions report PDF,
identify the top-5 emitting countries, compare their 2010 data
from a second report, compute 12-year change rates, and
plot a comparison chart in Python.

Required tool chain:
web_search → file_download → pdf_parser → python_executor

Hermes Level 3 success rate: ~23%

65.2.3 GAIA Results Overview

Level	Tasks	Avg Steps	Hermes	Human Expert	GPT-4
1	165	1–2	71.5%	92.1%	67.3%
2	86	3–7	47.7%	84.6%	39.2%
3	25	8+	22.9%	75.4%	14.8%
Total	276	—	55.4%	86.9%	47.8%

65.3 Terminal-Bench 2.0: Code Execution Under Pressure

65.3.1 Overview

Terminal-Bench 2.0 evaluates agents in terminal environments on code execution, debugging, and system operations. It is stricter than AgentBench's OS environment—tasks are longer, environments more complex, errors harder to recover from.

Test matrix:

Category	Difficulty	Typical Steps
Environment setup	Medium	5–15
Code debugging	Hard	8–20
Performance optimization	Hard	10–25
Security audit	Expert	15–30
System automation	Medium-Hard	8–18

65.3.2 Debugging Scenario

# Task: Fix all bugs in this quicksort implementation

buggy_code = """
def quicksort(arr, low=0, high=None):
    if high is None:
        high = len(arr)          # Bug 1: should be len(arr) - 1
    if low < high:
        pi = partition(arr, low, high)
        quicksort(arr, low, pi - 1)
        quicksort(arr, pi + 1, high)
    return arr

def partition(arr, low, high):
    pivot = arr[high]            # Bug 2: off-by-one risk from Bug 1
    i = low - 1
    for j in range(low, high):
        if arr[j] <= pivot:
            i += 1
            arr[i], arr[j] == arr[j], arr[i]  # Bug 3: == not =
    arr[i + 1], arr[high] = arr[high], arr[i + 1]
    return i + 1
"""

# Hermes debugging trace
trace = [
    {"step": 1, "action": "read_code",    "finding": "Identified 3 potential bugs"},
    {"step": 2, "action": "run_code",     "result": "IndexError: list index out of range"},
    {"step": 3, "action": "fix_bug_1",    "change": "high = len(arr) - 1"},
    {"step": 4, "action": "run_code",     "result": "Wrong output — partially sorted"},
    {"step": 5, "action": "analyze",      "finding": "Assignment operator error on line 12"},
    {"step": 6, "action": "fix_bug_3",    "change": "arr[i], arr[j] = arr[j], arr[i]"},
    {"step": 7, "action": "run_code",     "result": "[1, 1, 2, 3, 6, 8, 10] ✓"},
]

65.3.3 Terminal-Bench 2.0 Hermes Results

Category	Tasks	Success Rate	Avg Steps	Top Failure Cause
Env setup	45	68.9%	11.2	Version compatibility
Debugging	60	54.7%	14.8	Concurrent bugs
Perf optimization	35	41.2%	18.3	Bottleneck localization
Security audit	20	29.5%	22.1	Advanced CVE recognition
Automation	40	62.3%	13.5	Error handling completeness
Overall	200	53.6%	15.2

65.4 YC-Bench: Startup Scenario Assessment

65.4.1 Background

YC-Bench is designed for startup engineering scenarios, simulating technical and business tasks common in Y Combinator batches. The core assumption: a truly practical AI agent should handle a startup's full-stack work.

Category	Weight	Typical Task
Product prototyping	25%	Requirements → MVP code
Data analysis	20%	User behavior insights
Competitive analysis	15%	Market research report
System architecture	20%	Design proposals
Content creation	10%	Tech blog / docs
Customer communication	10%	Emails / pitch drafts

65.4.2 Example Task: Retention Analysis

yc_task = """
Our SaaS launched 6 months ago. Given this CSV (user_id, signup_date,
last_active_date, plan_type, country), please:
1. Compute D1/D7/D30 retention rates by month
2. Compare retention: Free vs Paid users
3. Identify the top 3 countries by retention
4. Suggest 3 concrete retention improvements
"""

# Core analysis snippet from Hermes
import pandas as pd

df = pd.read_csv('users.csv', parse_dates=['signup_date', 'last_active_date'])
df['days_active'] = (df['last_active_date'] - df['signup_date']).dt.days

retention = {
    f"D{d}": (df['days_active'] >= d).mean()
    for d in [1, 7, 30]
}

by_plan = df.groupby('plan_type').apply(
    lambda g: {f"D{d}": (g['days_active'] >= d).mean() for d in [1, 7, 30]}
).to_dict()

65.5 Running Benchmarks in Your Own Environment

65.5.1 AgentBench Setup

git clone https://github.com/THUDM/AgentBench.git && cd AgentBench
pip install -r requirements.txt

# Configure Hermes
cat > configs/agents/hermes.yaml << 'EOF'
name: hermes-agent
type: api
base_url: https://api.hermes.nousresearch.com/v1
model: hermes-3-70b
api_key: ${HERMES_API_KEY}
temperature: 0.0
max_tokens: 4096
EOF

# Run a single environment
python run_eval.py \
  --agent configs/agents/hermes.yaml \
  --task configs/tasks/os.yaml \
  --output results/hermes_os.json

# Full suite (~4-6 hours)
python run_eval.py --agent configs/agents/hermes.yaml \
  --all-tasks --parallel 4 --output results/hermes_full.json

65.5.2 GAIA Setup

from datasets import load_dataset
from hermes_agent import HermesAgent
import os

gaia = load_dataset("gaia-benchmark/GAIA", "2023_all")
agent = HermesAgent(
    api_key=os.environ["HERMES_API_KEY"],
    tools=["web_search", "web_fetch", "python_executor", "file_reader"]
)

results = []
for task in gaia["validation"]:
    level = task["Level"]
    response = agent.run(
        task=task["Question"],
        max_steps=20 if level <= 2 else 40,
        timeout=300
    )
    success = normalize_answer(response.answer) == normalize_answer(task["Final answer"])
    results.append({"level": level, "success": success, "steps": len(response.trace)})

for level in [1, 2, 3]:
    subset = [r for r in results if r["level"] == level]
    rate = sum(r["success"] for r in subset) / len(subset)
    print(f"Level {level}: {rate:.1%}  ({len(subset)} tasks)")

65.5.3 Interpreting Results: Capability Boundaries

def interpret_results(results: dict) -> list[str]:
    insights = []
    ab = results.get("agentbench", {})
    if ab.get("os_score", 100) < 60:
        insights.append("WARN: Bash capability insufficient for DevOps automation")
    if ab.get("db_score", 0) > 75:
        insights.append("OK: Strong SQL — suitable for data analysis pipelines")
    if ab.get("hh_score", 100) < 50:
        insights.append("WARN: Spatial reasoning weak — avoid physical-world tasks")

    gaia = results.get("gaia", {})
    if gaia.get("level3_rate", 0) < 0.25:
        insights.append("WARN: Complex multi-step success rate low — decompose tasks")
    if gaia.get("level1_rate", 0) > 0.70:
        insights.append("OK: Basic retrieval solid — suitable for simple Q&A")

    tb = results.get("terminal_bench", {})
    if tb.get("security_audit_rate", 0) < 0.35:
        insights.append("WARN: Security audit capability limited — human review required")

    return insights

Chapter Summary

This chapter examined four major benchmarks and Hermes's performance on each:

AgentBench: Eight environments spanning OS, DB, KG, and Web; Hermes overall 65.1; strongest in DB and OS
GAIA: Three difficulty tiers; Level 3 is challenging for all agents; Hermes 55.4% overall
Terminal-Bench 2.0: Strict code execution testing; security audit is the biggest weakness (29.5%)
YC-Bench: Startup scenario evaluation, closest to real business needs
Running locally: Complete steps from environment configuration to result interpretation

Discussion Questions

If your primary use case is data analysis, which benchmark environment should you focus on to decide whether Hermes fits?
GAIA Level 3 success rates are low across all models (best < 30%). What does this tell you? How can engineering (not model upgrades) improve complex task success?
Terminal-Bench security audit scores only 29.5%. What are the implications for security-sensitive deployments?
What benchmark overfitting risks exist—situations where high benchmark scores mask poor real-world performance?

Rate this chapter

4.9 / 5 (3 ratings)