Chapter 65

Benchmark in Practice: AgentBench, GAIA and Terminal-Bench

Chapter 65: Benchmarks in Practice: AgentBench / GAIA / Terminal-Bench

Benchmarks are an agent's mirror. An agent that feels fluent in casual conversation may expose fundamental weaknesses when confronted with AgentBench's database tasks or GAIA's multi-step reasoning challenges. This chapter dissects the design philosophy, task characteristics, and scoring logic of four major benchmarks, analyzes Hermes Agent's performance on each, and provides a complete operational guide for reproducing these tests in your own environment.


65.1 AgentBench: Stress-Testing Across Eight Environments

65.1.1 Design Philosophy

AgentBench, released by Tsinghua KEG Lab, rests on a single conviction: agents must complete real tasks in real environments, not answer closed questions. It constructs eight distinct execution environments, each corresponding to a major real-world use case.

AgentBench Environment Overview:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Operating System (OS)  โ”‚  Database (DB)             โ”‚
โ”‚  Knowledge Graph (KG)   โ”‚  Digital Card Game (DCG)  โ”‚
โ”‚  Lateral Thinking (LTP) โ”‚  House Holding (HH)       โ”‚
โ”‚  Web Shopping (WS)      โ”‚  Web Browsing (WB)        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

65.1.2 Eight Environments in Detail

Environment 1: Operating System (OS)

Task type: Complete file operations, process management, and permission configuration in a Bash shell.

Evaluation focus: Command syntax accuracy, multi-step planning, error recovery.

Example trace:

# Task: Find all Python files > 10 MB under /data, compress them,
#       and move the archive to /backup

[Step 1] find /data -name "*.py" -size +10M
         โ†’ /data/models/gpt_weights.py  /data/logs/training.py

[Step 2] tar -czf /tmp/large_py.tar.gz \
         /data/models/gpt_weights.py /data/logs/training.py

[Step 3] mv /tmp/large_py.tar.gz /backup/

[Step 4] echo "Archived to /backup/large_py.tar.gz"

Hermes analysis: Strong on OS tasks due to precise argument passing. Main weakness: generating complex one-liner pipes correctly in a single shot.

Environment 2: Database (DB)

Task type: Query or modify MySQL/SQLite databases for data analysis or transformation.

# Example SQL generated by Hermes for a quarterly sales + YoY task
sql = """
WITH quarterly_sales AS (
    SELECT product_id, product_name,
           YEAR(sale_date) AS yr, QUARTER(sale_date) AS qtr,
           SUM(amount) AS total
    FROM sales
    WHERE sale_date >= DATE_SUB(CURDATE(), INTERVAL 2 YEAR)
    GROUP BY product_id, product_name, yr, qtr
),
ranked AS (
    SELECT *, RANK() OVER (PARTITION BY yr, qtr ORDER BY total DESC) AS rnk
    FROM quarterly_sales
)
SELECT r1.product_name, r1.yr, r1.qtr,
       r1.total AS current,
       r2.total AS prev_year,
       ROUND((r1.total - r2.total) / r2.total * 100, 2) AS yoy_pct
FROM ranked r1
LEFT JOIN ranked r2
    ON r1.product_id = r2.product_id
   AND r1.qtr = r2.qtr AND r1.yr = r2.yr + 1
WHERE r1.rnk <= 5
ORDER BY r1.yr, r1.qtr, r1.rnk;
"""

Hermes analysis: SQL generation is Hermes's strongest sub-skill. Weakness: schema inference when column names are ambiguous.

Environments 3โ€“8 Summary

Env Focus Hermes Strength Hermes Weakness
KG SPARQL multi-hop Relation path tracing 4+ hop queries
DCG Game strategy Short-horizon tactics Long-game planning
LTP Lateral puzzles Hypothesis elimination Efficient question design
HH Household tasks Task decomposition Spatial reasoning
WS Web shopping Requirement matching Price comparison logic
WB Web navigation Navigation strategy Deep link extraction

65.1.3 Hermes AgentBench Score Summary

Environment Hermes GPT-4 Notes
OS 72.4 68.1 Best env for Hermes
DB 78.9 74.3 SQL generation strength
KG 61.2 58.7 Multi-hop needs work
DCG 55.8 52.1 Average game reasoning
LTP 69.3 65.4 Efficient hypothesis pruning
HH 48.7 51.2 Spatial reasoning gap
WS 64.1 62.8 Smooth purchase flow
WB 70.6 67.9 Strong navigation
Overall 65.1 62.6

65.2 GAIA: Real-World Challenges Across Three Difficulty Levels

65.2.1 Design Philosophy

GAIA (General AI Assistants benchmark), published by Meta and HuggingFace, tests AI on actual real-world tasksโ€”not academic constructs. All tasks were sourced from real users; all answers are objectively verifiable.

Three core principles:

  1. Tasks are relatively easy for skilled humans (completable in under 15 minutes)
  2. Answers are precise and objective (no fuzzy responses accepted)
  3. Tasks require combining multiple skills (search + file reading + calculation)

65.2.2 Three Difficulty Levels

Level 1: Single- or Double-Step Reasoning

Characteristics: 1โ€“2 tool calls, clear information sources, unambiguous answer format.

# Hermes execution for a Level 1 task
trace = [
    ToolCall(tool="web_search",
             args={"query": "Tokyo population latest census 2024", "num_results": 5}),
    ToolCall(tool="web_fetch",
             args={"url": "https://www.stat.go.jp/english/..."}),  # Official source
]
answer = "13,960,000 (2020 Census)"

Hermes Level 1 success rate: ~71% (main difficulty: data recency)

Level 2: Multi-Step + Cross-Source Integration

Characteristics: 3โ€“7 tool calls; information from multiple sources; may require simple calculations.

async def solve_level2(task: str) -> str:
    # Step 1: Find laureates
    laureates_raw = await agent.tool_call(
        "web_search", {"query": "2023 Nobel Prize Physics winners"})
    laureates = parse_names(laureates_raw)

    # Step 2: Find their alma maters (in parallel)
    schools = await asyncio.gather(*[
        agent.tool_call("web_search", {"query": f"{name} PhD university"})
        for name in laureates
    ])

    # Step 3: Fetch QS rankings (in parallel)
    rankings = await asyncio.gather(*[
        agent.tool_call("web_search", {"query": f"{school} QS world ranking 2024"})
        for school in schools
    ])

    return find_highest_ranked(zip(schools, rankings))

Hermes Level 2 success rate: ~48%

Level 3: Complex Multi-Tool + Deep Reasoning

Characteristics: 8+ tool calls; file operations (PDF/Excel); complex computation or code execution.

Example task:
Download this 2022 global carbon emissions report PDF,
identify the top-5 emitting countries, compare their 2010 data
from a second report, compute 12-year change rates, and
plot a comparison chart in Python.

Required tool chain:
web_search โ†’ file_download โ†’ pdf_parser โ†’ python_executor

Hermes Level 3 success rate: ~23%

65.2.3 GAIA Results Overview

Level Tasks Avg Steps Hermes Human Expert GPT-4
1 165 1โ€“2 71.5% 92.1% 67.3%
2 86 3โ€“7 47.7% 84.6% 39.2%
3 25 8+ 22.9% 75.4% 14.8%
Total 276 โ€” 55.4% 86.9% 47.8%

65.3 Terminal-Bench 2.0: Code Execution Under Pressure

65.3.1 Overview

Terminal-Bench 2.0 evaluates agents in terminal environments on code execution, debugging, and system operations. It is stricter than AgentBench's OS environmentโ€”tasks are longer, environments more complex, errors harder to recover from.

Test matrix:

Category Difficulty Typical Steps
Environment setup Medium 5โ€“15
Code debugging Hard 8โ€“20
Performance optimization Hard 10โ€“25
Security audit Expert 15โ€“30
System automation Medium-Hard 8โ€“18

65.3.2 Debugging Scenario

# Task: Fix all bugs in this quicksort implementation

buggy_code = """
def quicksort(arr, low=0, high=None):
    if high is None:
        high = len(arr)          # Bug 1: should be len(arr) - 1
    if low < high:
        pi = partition(arr, low, high)
        quicksort(arr, low, pi - 1)
        quicksort(arr, pi + 1, high)
    return arr

def partition(arr, low, high):
    pivot = arr[high]            # Bug 2: off-by-one risk from Bug 1
    i = low - 1
    for j in range(low, high):
        if arr[j] <= pivot:
            i += 1
            arr[i], arr[j] == arr[j], arr[i]  # Bug 3: == not =
    arr[i + 1], arr[high] = arr[high], arr[i + 1]
    return i + 1
"""

# Hermes debugging trace
trace = [
    {"step": 1, "action": "read_code",    "finding": "Identified 3 potential bugs"},
    {"step": 2, "action": "run_code",     "result": "IndexError: list index out of range"},
    {"step": 3, "action": "fix_bug_1",    "change": "high = len(arr) - 1"},
    {"step": 4, "action": "run_code",     "result": "Wrong output โ€” partially sorted"},
    {"step": 5, "action": "analyze",      "finding": "Assignment operator error on line 12"},
    {"step": 6, "action": "fix_bug_3",    "change": "arr[i], arr[j] = arr[j], arr[i]"},
    {"step": 7, "action": "run_code",     "result": "[1, 1, 2, 3, 6, 8, 10] โœ“"},
]

65.3.3 Terminal-Bench 2.0 Hermes Results

Category Tasks Success Rate Avg Steps Top Failure Cause
Env setup 45 68.9% 11.2 Version compatibility
Debugging 60 54.7% 14.8 Concurrent bugs
Perf optimization 35 41.2% 18.3 Bottleneck localization
Security audit 20 29.5% 22.1 Advanced CVE recognition
Automation 40 62.3% 13.5 Error handling completeness
Overall 200 53.6% 15.2

65.4 YC-Bench: Startup Scenario Assessment

65.4.1 Background

YC-Bench is designed for startup engineering scenarios, simulating technical and business tasks common in Y Combinator batches. The core assumption: a truly practical AI agent should handle a startup's full-stack work.

Category Weight Typical Task
Product prototyping 25% Requirements โ†’ MVP code
Data analysis 20% User behavior insights
Competitive analysis 15% Market research report
System architecture 20% Design proposals
Content creation 10% Tech blog / docs
Customer communication 10% Emails / pitch drafts

65.4.2 Example Task: Retention Analysis

yc_task = """
Our SaaS launched 6 months ago. Given this CSV (user_id, signup_date,
last_active_date, plan_type, country), please:
1. Compute D1/D7/D30 retention rates by month
2. Compare retention: Free vs Paid users
3. Identify the top 3 countries by retention
4. Suggest 3 concrete retention improvements
"""

# Core analysis snippet from Hermes
import pandas as pd

df = pd.read_csv('users.csv', parse_dates=['signup_date', 'last_active_date'])
df['days_active'] = (df['last_active_date'] - df['signup_date']).dt.days

retention = {
    f"D{d}": (df['days_active'] >= d).mean()
    for d in [1, 7, 30]
}

by_plan = df.groupby('plan_type').apply(
    lambda g: {f"D{d}": (g['days_active'] >= d).mean() for d in [1, 7, 30]}
).to_dict()

65.5 Running Benchmarks in Your Own Environment

65.5.1 AgentBench Setup

git clone https://github.com/THUDM/AgentBench.git && cd AgentBench
pip install -r requirements.txt

# Configure Hermes
cat > configs/agents/hermes.yaml << 'EOF'
name: hermes-agent
type: api
base_url: https://api.hermes.nousresearch.com/v1
model: hermes-3-70b
api_key: ${HERMES_API_KEY}
temperature: 0.0
max_tokens: 4096
EOF

# Run a single environment
python run_eval.py \
  --agent configs/agents/hermes.yaml \
  --task configs/tasks/os.yaml \
  --output results/hermes_os.json

# Full suite (~4-6 hours)
python run_eval.py --agent configs/agents/hermes.yaml \
  --all-tasks --parallel 4 --output results/hermes_full.json

65.5.2 GAIA Setup

from datasets import load_dataset
from hermes_agent import HermesAgent
import os

gaia = load_dataset("gaia-benchmark/GAIA", "2023_all")
agent = HermesAgent(
    api_key=os.environ["HERMES_API_KEY"],
    tools=["web_search", "web_fetch", "python_executor", "file_reader"]
)

results = []
for task in gaia["validation"]:
    level = task["Level"]
    response = agent.run(
        task=task["Question"],
        max_steps=20 if level <= 2 else 40,
        timeout=300
    )
    success = normalize_answer(response.answer) == normalize_answer(task["Final answer"])
    results.append({"level": level, "success": success, "steps": len(response.trace)})

for level in [1, 2, 3]:
    subset = [r for r in results if r["level"] == level]
    rate = sum(r["success"] for r in subset) / len(subset)
    print(f"Level {level}: {rate:.1%}  ({len(subset)} tasks)")

65.5.3 Interpreting Results: Capability Boundaries

def interpret_results(results: dict) -> list[str]:
    insights = []
    ab = results.get("agentbench", {})
    if ab.get("os_score", 100) < 60:
        insights.append("WARN: Bash capability insufficient for DevOps automation")
    if ab.get("db_score", 0) > 75:
        insights.append("OK: Strong SQL โ€” suitable for data analysis pipelines")
    if ab.get("hh_score", 100) < 50:
        insights.append("WARN: Spatial reasoning weak โ€” avoid physical-world tasks")

    gaia = results.get("gaia", {})
    if gaia.get("level3_rate", 0) < 0.25:
        insights.append("WARN: Complex multi-step success rate low โ€” decompose tasks")
    if gaia.get("level1_rate", 0) > 0.70:
        insights.append("OK: Basic retrieval solid โ€” suitable for simple Q&A")

    tb = results.get("terminal_bench", {})
    if tb.get("security_audit_rate", 0) < 0.35:
        insights.append("WARN: Security audit capability limited โ€” human review required")

    return insights

Chapter Summary

This chapter examined four major benchmarks and Hermes's performance on each:

  1. AgentBench: Eight environments spanning OS, DB, KG, and Web; Hermes overall 65.1; strongest in DB and OS
  2. GAIA: Three difficulty tiers; Level 3 is challenging for all agents; Hermes 55.4% overall
  3. Terminal-Bench 2.0: Strict code execution testing; security audit is the biggest weakness (29.5%)
  4. YC-Bench: Startup scenario evaluation, closest to real business needs
  5. Running locally: Complete steps from environment configuration to result interpretation

Discussion Questions

  1. If your primary use case is data analysis, which benchmark environment should you focus on to decide whether Hermes fits?
  2. GAIA Level 3 success rates are low across all models (best < 30%). What does this tell you? How can engineering (not model upgrades) improve complex task success?
  3. Terminal-Bench security audit scores only 29.5%. What are the implications for security-sensitive deployments?
  4. What benchmark overfitting risks existโ€”situations where high benchmark scores mask poor real-world performance?
Rate this chapter
4.9  / 5  (3 ratings)

๐Ÿ’ฌ Comments