← Back to Skills Marketplace
zhheo

OpenClaw Benchmark

by 张洪Heo · GitHub ↗ · v1.0.2 · MIT-0
cross-platform ✓ Security Clean
54
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install zhheo-openclaw-benchmark
Description
Measures OpenClaw model performance by scoring token throughput, first-token latency, tool call speed, context efficiency, and error recovery ability.
README (SKILL.md)

OpenClaw Performance Benchmark Skill

3DMark-style performance benchmark for OpenClaw. Produces an unbounded composite score — higher is better, no upper limit, designed to grow with hardware and model improvements.

What It Measures

Dimension Metric Impact
模型吞吐 tokens/sec (generation) Primary score driver
首 Token 延迟 TTFT in ms Bonus for fast response
工具调用效率 avg tool call latency Bonus for fast tools
初始上下文 session 启动时的 token 数 越重分越低
上下文效率 context ratio (usable/raw) Penalty if heavy context
错误恢复 pass rate across tests Penalty for failures

Score Formula

Score = (Base + TTFT_bonus + Tool_bonus) × Context_ratio × Recovery

Base         = gen_tok/s × 10            ← 无上限
TTFT_bonus   = 10000 ÷ TTFT_ms          ← 越快越高
Tool_bonus   = 10000 ÷ tool_avg_ms      ← 越快越高
Context_ratio= 20000 ÷ initial_ctx_tokens × (actual_tok/s ÷ raw_tok/s)
               ↑                           ↑
               直接惩罚上下文大小          间接惩罚吞吐损失
               20k=1.0, 40k=0.5, 80k=0.25
Recovery     = 通过数 ÷ 总数             ← 0~1

Context_ratio 由两部分组成:

  1. 上下文大小惩罚: 20000 ÷ initial_ctx_tokens(以 20k 为基准,越大越低)
  2. 吞吐损失比: 实际吞吐 ÷ 原始吞吐(测量模型被上下文拖慢的程度)

两者相乘,既惩罚「上下文本身很重」,也惩罚「上下文导致吞吐下降」。

Grade scale: S+ (≥2000) → S (≥1000) → A (≥500) → B (≥200) → C (≥50) → D

File Structure

~/.openclaw/skills/openclaw-benchmark/
├── SKILL.md          ← 本文件(协议说明)
└── score.py          ← 评分 + 报告生成

~/Downloads/OpenClaw-Benchmark/
├── results/          ← 跑分结果 HTML
└── baselines/        ← 基线数据 JSON(用于前后对比)

Benchmark Protocol

Step 0: System Pre-flight

Collect system info before running tests:

node --version
python3 --version
ls ~/.openclaw/skills/ | wc -l

Record: openclaw version, node version, os, arch, skill count, system prompt token estimate.

Check for common config issues:

  • 是否有大量未使用的 skill(增加上下文负担)
  • system prompt 是否过长
  • 是否有 compaction 配置

Step 1: Raw Model Speed (Test 1)

Spawn subagent:

直接回答,不要调用任何工具。用中文解释量子纠缠的基本原理,300字左右。

Record: runtime, output tokens → gen_tok_s = output / runtime

Step 2: Complex Reasoning / TTFT (Test 2)

Spawn subagent:

直接回答,不要调用任何工具。解决以下问题:

一个水池有两个进水管A和B,一个排水管C。A管单独注满需要6小时,B管单独注满需要8小时,C管单独排空需要12小时。如果三管同时打开,多少小时能注满水池?请给出详细的解题过程和最终答案(分数形式)。

Record: runtime, complexity of answer

Step 3: Tool Call Latency (Test 3)

Spawn subagent:

用 web_search 搜索 "OpenClaw AI assistant",只搜一次。把搜索结果的标题列出来,不要做其他操作。

Record: runtime, tool_count → tool_avg_ms = runtime * 1000 / tool_count

Step 4: File I/O Chain (Test 4)

Spawn subagent:

依次执行以下操作,每步完成后记录结果:
1. 用 exec 执行: echo "benchmark test $(date +%s)" > /tmp/openclaw_bench.txt
2. 用 read 读取 /tmp/openclaw_bench.txt 的内容
3. 用 exec 执行: rm /tmp/openclaw_bench.txt
把每步的操作和结果写入报告。

Record: runtime

Step 5: Multi-Step Chain (Test 5)

Spawn subagent:

依次执行以下操作:
1. 用 exec 执行: node --version
2. 用 exec 执行: python3 --version
3. 对比两个版本号,用一句话说明哪个更新
不要并行执行命令,按顺序执行。

Record: runtime

Step 6: Error Recovery (Test 6)

Spawn subagent:

依次执行:
1. 用 web_fetch 访问 https://httpstat.us/500 (会返回错误)
2. 访问失败后,用 web_search 搜索 "http status 500 meaning"
3. 根据搜索结果,用一句话解释 HTTP 500 错误

Record: runtime, whether fallback succeeded


Step 7: Write Metrics & Compute Score

Write all metrics to /tmp/bench_metrics.json:

{
  "gen_tok_s": 50.0,
  "ttft_ms": 800,
  "tool_avg_ms": 35500,
  "context_ratio": 0.50,
  "recovery_rate": 1.0,
  "system": {
    "os": "Darwin 24.6.0",
    "arch": "arm64",
    "openclaw_version": "2026.5.22",
    "node_version": "v25.2.1",
    "skill_count": 20,
    "system_prompt_tokens": 5000
  },
  "model": {
    "name": "xiaomi-coding/mimo-v2.5",
    "context_window": "1M",
    "provider": "xiaomi"
  },
  "tests": [
    { "id": 1, "name": "原始生成速度", "duration_s": 9, "total_tokens": 5500, "output_tokens": 450, "tool_calls": 0, "status": "ok" }
  ]
}

Run scorer:

python3 ~/.openclaw/skills/openclaw-benchmark/score.py /tmp/bench_metrics.json

Report auto-saves to ~/Downloads/OpenClaw-Benchmark/results/bench_\x3C时间戳>.html


Step 8: Baseline Management (前后对比)

Save current run as baseline:

cp /tmp/bench_metrics.json ~/Downloads/OpenClaw-Benchmark/baselines/\x3Cname>.json

Compare against baseline:

python3 ~/.openclaw/skills/openclaw-benchmark/score.py /tmp/bench_metrics.json --compare ~/Downloads/OpenClaw-Benchmark/baselines/\x3Cname>.json

Comparison output shows:

  • Score delta (e.g. +120 / -45)
  • Per-metric deltas with color coding:
    • 🟢 改善 > 10%
    • 🟡 持平 ±10%
    • 🔴 退步 > 10%

Naming conventions for baselines

  • default.json — 默认配置基线
  • minimal.json — 精简 skill 后的基线
  • new-model.json — 换模型后的基线
  • after-optimize.json — 优化后的基线

Metrics JSON Schema

{
  "gen_tok_s": 50.0,
  "ttft_ms": 200.0,
  "tool_avg_ms": 2000.0,
  "context_ratio": 0.85,
  "recovery_rate": 1.0,
  "system": {
    "os": "Darwin 24.6.0",
    "arch": "arm64",
    "openclaw_version": "2026.5.22",
    "node_version": "v25.2.1",
    "skill_count": 20,
    "system_prompt_tokens": 5000
  },
  "model": {
    "name": "xiaomi-coding/mimo-v2.5",
    "context_window": "1M",
    "provider": "xiaomi"
  },
  "tests": [
    {
      "id": 1,
      "name": "原始生成速度",
      "duration_s": 55,
      "total_tokens": 6600,
      "output_tokens": 450,
      "tool_calls": 0,
      "status": "ok"
    }
  ]
}

Optimization Checklist

When score is low, check these in order:

检查项 影响维度 优化方向
Skill 数量过多 context_ratio 移除未使用的 skill
System prompt 过长 context_ratio 精简 AGENTS.md / SOUL.md
模型选择 gen_tok_s 换更快的模型
网络环境 tool_avg_ms 检查 VPN/代理配置
无 compaction 配置 context_ratio 设置 triggerAtPercent: 75
流式模式未优化 ttft_ms 使用 chunked/full 模式

Notes

  • Run benchmarks in a clean session (no prior context) for accurate results
  • Network-dependent tests (Test 3, 6) may vary; run multiple times and take median
  • Context ratio: run Test 1 with minimal context vs full context to measure burden
  • Score is designed to be reproducible — same system should get similar scores (±10%)
  • Save results over time to track performance trends after config changes
  • Baselines are JSON files, safe to git-track for team sharing
Usage Guidance
Install only if you are comfortable with a benchmark that runs simple shell commands, performs a couple of network-based tests, and saves local reports and baseline JSON files. Review or delete ~/Downloads/OpenClaw-Benchmark and /tmp/bench_metrics.json if you do not want benchmark history left on disk.
Capability Assessment
Purpose & Capability
The stated purpose is performance benchmarking, and the artifacts match that purpose by measuring model speed, tool latency, context overhead, recovery, and producing a score/report.
Instruction Scope
The skill asks agents to run limited shell commands, web searches/fetches, and file I/O as benchmark tasks; these are disclosed and narrow, but users should know they are part of the benchmark.
Install Mechanism
The package consists of SKILL.md and score.py, with no declared dependencies and no hidden installer or auto-start mechanism.
Credentials
It collects basic environment and model metadata such as OS, architecture, OpenClaw version, Node/Python versions, skill count, and prompt token estimates, which is proportionate for benchmarking.
Persistence & Privilege
It writes benchmark metrics to /tmp and reports/baselines under ~/Downloads/OpenClaw-Benchmark; these persistent files are purpose-aligned and disclosed, but users may want to delete or relocate them.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install zhheo-openclaw-benchmark
  3. After installation, invoke the skill by name or use /zhheo-openclaw-benchmark
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.2
Rename slug to zhheo-openclaw-benchmark
Metadata
Slug zhheo-openclaw-benchmark
Version 1.0.2
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is OpenClaw Benchmark?

Measures OpenClaw model performance by scoring token throughput, first-token latency, tool call speed, context efficiency, and error recovery ability. It is an AI Agent Skill for Claude Code / OpenClaw, with 54 downloads so far.

How do I install OpenClaw Benchmark?

Run "/install zhheo-openclaw-benchmark" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is OpenClaw Benchmark free?

Yes, OpenClaw Benchmark is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does OpenClaw Benchmark support?

OpenClaw Benchmark is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created OpenClaw Benchmark?

It is built and maintained by 张洪Heo (@zhheo); the current version is v1.0.2.

💬 Comments