← 返回 Skills 市场
zhheo

OpenClaw Benchmark

作者 张洪Heo · GitHub ↗ · v1.0.2 · MIT-0
cross-platform ✓ 安全检测通过
54
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install zhheo-openclaw-benchmark
功能描述
Measures OpenClaw model performance by scoring token throughput, first-token latency, tool call speed, context efficiency, and error recovery ability.
使用说明 (SKILL.md)

OpenClaw Performance Benchmark Skill

3DMark-style performance benchmark for OpenClaw. Produces an unbounded composite score — higher is better, no upper limit, designed to grow with hardware and model improvements.

What It Measures

Dimension Metric Impact
模型吞吐 tokens/sec (generation) Primary score driver
首 Token 延迟 TTFT in ms Bonus for fast response
工具调用效率 avg tool call latency Bonus for fast tools
初始上下文 session 启动时的 token 数 越重分越低
上下文效率 context ratio (usable/raw) Penalty if heavy context
错误恢复 pass rate across tests Penalty for failures

Score Formula

Score = (Base + TTFT_bonus + Tool_bonus) × Context_ratio × Recovery

Base         = gen_tok/s × 10            ← 无上限
TTFT_bonus   = 10000 ÷ TTFT_ms          ← 越快越高
Tool_bonus   = 10000 ÷ tool_avg_ms      ← 越快越高
Context_ratio= 20000 ÷ initial_ctx_tokens × (actual_tok/s ÷ raw_tok/s)
               ↑                           ↑
               直接惩罚上下文大小          间接惩罚吞吐损失
               20k=1.0, 40k=0.5, 80k=0.25
Recovery     = 通过数 ÷ 总数             ← 0~1

Context_ratio 由两部分组成:

  1. 上下文大小惩罚: 20000 ÷ initial_ctx_tokens(以 20k 为基准,越大越低)
  2. 吞吐损失比: 实际吞吐 ÷ 原始吞吐(测量模型被上下文拖慢的程度)

两者相乘,既惩罚「上下文本身很重」,也惩罚「上下文导致吞吐下降」。

Grade scale: S+ (≥2000) → S (≥1000) → A (≥500) → B (≥200) → C (≥50) → D

File Structure

~/.openclaw/skills/openclaw-benchmark/
├── SKILL.md          ← 本文件(协议说明)
└── score.py          ← 评分 + 报告生成

~/Downloads/OpenClaw-Benchmark/
├── results/          ← 跑分结果 HTML
└── baselines/        ← 基线数据 JSON(用于前后对比)

Benchmark Protocol

Step 0: System Pre-flight

Collect system info before running tests:

node --version
python3 --version
ls ~/.openclaw/skills/ | wc -l

Record: openclaw version, node version, os, arch, skill count, system prompt token estimate.

Check for common config issues:

  • 是否有大量未使用的 skill(增加上下文负担)
  • system prompt 是否过长
  • 是否有 compaction 配置

Step 1: Raw Model Speed (Test 1)

Spawn subagent:

直接回答,不要调用任何工具。用中文解释量子纠缠的基本原理,300字左右。

Record: runtime, output tokens → gen_tok_s = output / runtime

Step 2: Complex Reasoning / TTFT (Test 2)

Spawn subagent:

直接回答,不要调用任何工具。解决以下问题:

一个水池有两个进水管A和B,一个排水管C。A管单独注满需要6小时,B管单独注满需要8小时,C管单独排空需要12小时。如果三管同时打开,多少小时能注满水池?请给出详细的解题过程和最终答案(分数形式)。

Record: runtime, complexity of answer

Step 3: Tool Call Latency (Test 3)

Spawn subagent:

用 web_search 搜索 "OpenClaw AI assistant",只搜一次。把搜索结果的标题列出来,不要做其他操作。

Record: runtime, tool_count → tool_avg_ms = runtime * 1000 / tool_count

Step 4: File I/O Chain (Test 4)

Spawn subagent:

依次执行以下操作,每步完成后记录结果:
1. 用 exec 执行: echo "benchmark test $(date +%s)" > /tmp/openclaw_bench.txt
2. 用 read 读取 /tmp/openclaw_bench.txt 的内容
3. 用 exec 执行: rm /tmp/openclaw_bench.txt
把每步的操作和结果写入报告。

Record: runtime

Step 5: Multi-Step Chain (Test 5)

Spawn subagent:

依次执行以下操作:
1. 用 exec 执行: node --version
2. 用 exec 执行: python3 --version
3. 对比两个版本号,用一句话说明哪个更新
不要并行执行命令,按顺序执行。

Record: runtime

Step 6: Error Recovery (Test 6)

Spawn subagent:

依次执行:
1. 用 web_fetch 访问 https://httpstat.us/500 (会返回错误)
2. 访问失败后,用 web_search 搜索 "http status 500 meaning"
3. 根据搜索结果,用一句话解释 HTTP 500 错误

Record: runtime, whether fallback succeeded


Step 7: Write Metrics & Compute Score

Write all metrics to /tmp/bench_metrics.json:

{
  "gen_tok_s": 50.0,
  "ttft_ms": 800,
  "tool_avg_ms": 35500,
  "context_ratio": 0.50,
  "recovery_rate": 1.0,
  "system": {
    "os": "Darwin 24.6.0",
    "arch": "arm64",
    "openclaw_version": "2026.5.22",
    "node_version": "v25.2.1",
    "skill_count": 20,
    "system_prompt_tokens": 5000
  },
  "model": {
    "name": "xiaomi-coding/mimo-v2.5",
    "context_window": "1M",
    "provider": "xiaomi"
  },
  "tests": [
    { "id": 1, "name": "原始生成速度", "duration_s": 9, "total_tokens": 5500, "output_tokens": 450, "tool_calls": 0, "status": "ok" }
  ]
}

Run scorer:

python3 ~/.openclaw/skills/openclaw-benchmark/score.py /tmp/bench_metrics.json

Report auto-saves to ~/Downloads/OpenClaw-Benchmark/results/bench_\x3C时间戳>.html


Step 8: Baseline Management (前后对比)

Save current run as baseline:

cp /tmp/bench_metrics.json ~/Downloads/OpenClaw-Benchmark/baselines/\x3Cname>.json

Compare against baseline:

python3 ~/.openclaw/skills/openclaw-benchmark/score.py /tmp/bench_metrics.json --compare ~/Downloads/OpenClaw-Benchmark/baselines/\x3Cname>.json

Comparison output shows:

  • Score delta (e.g. +120 / -45)
  • Per-metric deltas with color coding:
    • 🟢 改善 > 10%
    • 🟡 持平 ±10%
    • 🔴 退步 > 10%

Naming conventions for baselines

  • default.json — 默认配置基线
  • minimal.json — 精简 skill 后的基线
  • new-model.json — 换模型后的基线
  • after-optimize.json — 优化后的基线

Metrics JSON Schema

{
  "gen_tok_s": 50.0,
  "ttft_ms": 200.0,
  "tool_avg_ms": 2000.0,
  "context_ratio": 0.85,
  "recovery_rate": 1.0,
  "system": {
    "os": "Darwin 24.6.0",
    "arch": "arm64",
    "openclaw_version": "2026.5.22",
    "node_version": "v25.2.1",
    "skill_count": 20,
    "system_prompt_tokens": 5000
  },
  "model": {
    "name": "xiaomi-coding/mimo-v2.5",
    "context_window": "1M",
    "provider": "xiaomi"
  },
  "tests": [
    {
      "id": 1,
      "name": "原始生成速度",
      "duration_s": 55,
      "total_tokens": 6600,
      "output_tokens": 450,
      "tool_calls": 0,
      "status": "ok"
    }
  ]
}

Optimization Checklist

When score is low, check these in order:

检查项 影响维度 优化方向
Skill 数量过多 context_ratio 移除未使用的 skill
System prompt 过长 context_ratio 精简 AGENTS.md / SOUL.md
模型选择 gen_tok_s 换更快的模型
网络环境 tool_avg_ms 检查 VPN/代理配置
无 compaction 配置 context_ratio 设置 triggerAtPercent: 75
流式模式未优化 ttft_ms 使用 chunked/full 模式

Notes

  • Run benchmarks in a clean session (no prior context) for accurate results
  • Network-dependent tests (Test 3, 6) may vary; run multiple times and take median
  • Context ratio: run Test 1 with minimal context vs full context to measure burden
  • Score is designed to be reproducible — same system should get similar scores (±10%)
  • Save results over time to track performance trends after config changes
  • Baselines are JSON files, safe to git-track for team sharing
安全使用建议
Install only if you are comfortable with a benchmark that runs simple shell commands, performs a couple of network-based tests, and saves local reports and baseline JSON files. Review or delete ~/Downloads/OpenClaw-Benchmark and /tmp/bench_metrics.json if you do not want benchmark history left on disk.
能力评估
Purpose & Capability
The stated purpose is performance benchmarking, and the artifacts match that purpose by measuring model speed, tool latency, context overhead, recovery, and producing a score/report.
Instruction Scope
The skill asks agents to run limited shell commands, web searches/fetches, and file I/O as benchmark tasks; these are disclosed and narrow, but users should know they are part of the benchmark.
Install Mechanism
The package consists of SKILL.md and score.py, with no declared dependencies and no hidden installer or auto-start mechanism.
Credentials
It collects basic environment and model metadata such as OS, architecture, OpenClaw version, Node/Python versions, skill count, and prompt token estimates, which is proportionate for benchmarking.
Persistence & Privilege
It writes benchmark metrics to /tmp and reports/baselines under ~/Downloads/OpenClaw-Benchmark; these persistent files are purpose-aligned and disclosed, but users may want to delete or relocate them.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install zhheo-openclaw-benchmark
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /zhheo-openclaw-benchmark 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.2
Rename slug to zhheo-openclaw-benchmark
元数据
Slug zhheo-openclaw-benchmark
版本 1.0.2
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

OpenClaw Benchmark 是什么?

Measures OpenClaw model performance by scoring token throughput, first-token latency, tool call speed, context efficiency, and error recovery ability. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 54 次。

如何安装 OpenClaw Benchmark?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install zhheo-openclaw-benchmark」即可一键安装,无需额外配置。

OpenClaw Benchmark 是免费的吗?

是的,OpenClaw Benchmark 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

OpenClaw Benchmark 支持哪些平台?

OpenClaw Benchmark 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 OpenClaw Benchmark?

由 张洪Heo(@zhheo)开发并维护,当前版本 v1.0.2。

💬 留言讨论