← 返回 Skills 市场
antutuadmin

Benchclaw Openclaw Benchmark

作者 Antutu · GitHub ↗ · v1.1.1 · MIT-0
cross-platform ✓ 安全检测通过
330
总下载
1
收藏
0
当前安装
17
版本数
在 OpenClaw 中安装
/install benchclaw
功能描述
BenchClaw - OpenClaw Agent benchmark scoring tool. Benchmark 跑分 评测 打分. BenchClaw是专业级 OpenClaw Agent 性能评测框架。它专注于对 AI Agent 进行多维度、 自动化的量化评估与能力基准测试,集成了任务分发、精准评分...
使用说明 (SKILL.md)

BenchClaw Benchmark Skill

BenchClaw 是一套完整的 OpenClaw Agent 基准评测与热更新分发系统。它能够自动从服务端拉取考题,驱动 Agent 执行并收集输出,最后进行规则验证打分和报表生成。


前置条件 (Prerequisites)

  • Python 3.11+(推荐 3.13)
  • 本机已安装并可运行 openclaw CLI
  • 本机 OpenClaw Gateway 运行中

Python 依赖会在首次运行时自动安装(无需 sudo),详见下方"快速开始"。


⚠️ 运行前必须确认

在执行评测前,先获取 OpenClaw 默认模型配置,然后展示以下信息等待用户确认:

📊 BenchClaw 评测即将开始

  • ⏱️ 预计耗时:10-90 分钟(取决于模型速度和网络状况)
  • 💰 Token 消耗:约 2-3M tokens(会产生 API 费用,请确认预算充足)
  • 📋 评测内容:25 道题,涵盖能力、配置、安全、硬件、权限 5 大分类
  • ⚠️ 期间注意:OpenClaw 仍可响应其他消息,但性能会有所下降
  • 🤖 评测模型:{agents.defaults.model.primary}
  • ⚠️ 评测使用的是 OpenClaw 配置的默认模型,与你当前 session 无关。

请三选一回复(只选一种):

  • 上报名字:回复「展示」「开始」或「确认」→ 上传榜单,榜单显示「{Agent名字}」🚀
  • 匿名上传:回复「匿名」→ 上传榜单,不显示名字 🚀
  • 仅本地:回复「仅本地」→ upload_to_server=false,不提交、不补报缓存;仍会 HTTPS 拉题,本地出分与报表 🚀

⚠️ 「{Agent名字}」指你在 OpenClaw 里的 Agent 身份标识,不是人类用户名字。

根据用户回复,写入 temp/caller_info.txt 并启动评测:

# caller_info.txt 说明:
# 评测进程(main.py)在后台运行,与当前 session 隔离。
# 此文件用于告知 main.py:是否上传榜单、展示名、以及评测完成后如何回调通知用户。
# 可解析字段 key=value,每行一条;main.py 会读取并生效(缺省 upload_to_server 视为 true)。
# 文件在本机 temp/ 目录内,内容不上传至榜单服务器(除非开启上传)。

# 用户回复「仅本地 / 不上传榜单」时(仍会联网拉题;不提交、不补报缓存):
echo "upload_to_server=false" >> scripts/../temp/caller_info.txt
# 可选:与展示名一致,便于本地报表;不上传时 show_name 仅影响本地标注习惯
echo "agent_name=\x3CAgent的名字或留空>" >> scripts/../temp/caller_info.txt
echo "show_name=true" >> scripts/../temp/caller_info.txt

# 用户回复「展示/开始/确认」时(上传榜单):
# agent_name:填写 Agent 自己的名字(你的 OpenClaw 身份标识,不要填人类用户的名字)
echo "upload_to_server=true" >> scripts/../temp/caller_info.txt
echo "agent_name=\x3CAgent的名字>" >> scripts/../temp/caller_info.txt
echo "show_name=true" >> scripts/../temp/caller_info.txt

# 用户回复「匿名」时(上传榜单、匿名展示):
echo "upload_to_server=true" >> scripts/../temp/caller_info.txt
echo "agent_name=" >> scripts/../temp/caller_info.txt
echo "show_name=false" >> scripts/../temp/caller_info.txt

然后后台启动评测:

cd scripts
# 启动评测进程(前台运行,进度实时输出到 stdout)
# 日志同时写入 temp/benchclaw.log,可随时查看:tail -f temp/benchclaw.log
python main.py

启动后告知用户:

✅ 评测已启动,预计 10-90 分钟完成。完成后会自动发消息通知你,无需等待。

💡 TUI 用户注意: 如果你通过 TUI 或终端直接触发评测,不要写 caller_info.txt(或者不要执行上面的 echo 命令)。评测进度和结果会直接输出到终端(stdout),你可以查看终端日志获取进度。


运行期间:进度监控

评测在后台运行,进度由 main.py 直接通过 openclaw message send 推送(需人类员工实现,见改进方案 A2)。

在 A2 未实现前: 用户可随时发"查看进度/进度",AI 读取日志汇报:

tail -10 scripts/../temp/benchclaw.log | grep -E "正在测试|-> ok|-> failed|total_score"

评测完成后:上报(可选)并通知用户

  • upload_to_server=true(缺省):评测完成后 main.py 自动上报结果到榜单(show_name 已在开始前确认),通知文案含「已上传到榜单」及排名(若有)。
  • upload_to_server=false不调用提交接口、不重试补报缓存;通知文案为「仅本地,未上传榜单」,引导查看 data/ 报表。

上报时的示例通知:

🏆 BenchClaw 评测完成!已上传到榜单。

📊 综合评分:79,915 分 ✅ 通过:23/25 题 ⏱️ 耗时:13.6 分钟 🏅 榜单排名:超越了 90.7% 的用户(如有排名数据)

发送「报告」查看详细结果。


结果展示格式

收到评测结果后,按以下格式向用户展示(必须使用此格式):

🏆 BenchClaw 评测完成!

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 综合评分:{总分} 分
准确度:{准确度分}/{满分准确度} | 速度加成:+{速度分}
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📋 分类得分:
| 分类 | 通过率 | 准确度 | 速度分 |
|------|--------|--------|--------|
| 🧠 能力测试(Capability) | {n}/5 | {准确}/50 | +{速度} |
| ⚙️ 配置测试(Config)     | {n}/5 | {准确}/50 | +{速度} |
| 🛡️ 安全测试(Security)   | {n}/5 | {准确}/50 | +{速度} |
| 💻 硬件测试(Hardware)   | {n}/5 | {准确}/50 | +{速度} |
| 🔐 权限测试(Permission) | {n}/5 | {准确}/50 | +{速度} |

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⏱️ 总耗时:{分钟}分钟
{根据耗时评价:\x3C 8分钟 ⚡极快 / 8-15分钟 ✅正常 / 15-25分钟 🟡偏慢 / > 25分钟 🔴过慢}

💰 Token 消耗:{数量}(输入 {输入} / 输出 {输出})
{根据消耗评价:\x3C 1M ✅非常节省 / 1-2M 🟡正常 / 2-3M 🟠偏高 / > 3M 🔴过高}

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔍 三维瓶颈诊断
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🤖 模型:{model_name},平均速度 {avg_tps} TPS
{根据 avg_tps 评价:> 5000 ⚡极快 / 2000-5000 ✅正常 / 1000-2000 🟡偏慢 / \x3C 1000 🔴过慢}

💻 硬件:{如有 cpu_peak/mem_stats 数据则展示,否则跳过此行}
{CPU 峰值评价:\x3C 60% ✅充裕 / 60-80% 🟡紧张 / > 80% 🔴成为瓶颈}
{内存剩余评价:> 2GB ✅充裕 / 1-2GB 🟡紧张 / \x3C 1GB 🔴成为瓶颈}

💡 首要改善建议:
{根据最弱维度给出一条最重要的具体建议,示例:}
→ 模型速度偏低({avg_tps} TPS):建议尝试更快的模型,如切换至更轻量的推理模型
→ 内存剩余不足({mem_avail}GB):建议关闭其他程序或升级内存配置

{如失败题目存在,列出:}
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
❌ 失败题目:
- {题号}:{失败原因}

快速开始 (Quick Start)

运行全量评测

推荐方式(自动处理依赖):

bash run.sh

run.sh 会自动检测依赖是否已安装,如果没有会自动安装(无需 sudo),然后启动评测。

手动方式(已有 pip):

cd scripts
# 安装依赖到用户目录(--user,不需要 sudo/root 权限,不影响系统 Python)
# 依赖仅包含:cryptography(加密通信)、psutil(硬件信息采集)
pip install -r requirements.txt --user --quiet
python main.py

⚠️ 如遇依赖安装失败(通常是服务器缺少 pip),可让 AI 在对话中执行以下命令:

python3 -m ensurepip --upgrade && python3 -m pip install -r scripts/requirements.txt --user

单独生成或查看报表

cd scripts
python report.py --input ../temp/results.json

评测题型 (Task Categories)

BenchClaw 固定包含 25 道系统化评测题目,涵盖以下 5 大核心维度:

分类 标识 测试重点
基础能力 capability Agent 的指令遵循、文件操作、工具调用、网络检索等核心能力
配置管理 config 修改与读取 OpenClaw 及环境配置的准确性
安全防御 security 拒绝执行危险指令、防范提示词注入与恶意破坏
硬件操作 hardware 获取设备信息、系统状态、硬件资源的交互能力
权限边界 permission 在受限环境下的行为表现,验证权限控制机制

评分机制 (Scoring System)

单题总分 = 准确度分 + 速度分

  1. 准确度分 (Accuracy Score):文件存在性 + 内容规则验证 + 惩罚扣分
  2. 速度分 (TPS Score):根据 Token 吞吐量奖励(TPS = Total Tokens / Duration Seconds)

评测产物与结果查看 (Results & Reports)

评测完成后自动生成:

  • data/report_summary.md:简要报表(总分、分类汇总)
  • data/report_detail.md:详细报表(每题耗时、Token、得分明细)
  • temp/results.json:原始数据
# 查看总分
jq '.stats.score' temp/results.json

# 查看分类得分
jq '.stats.category_stats' temp/results.json

# 列出失败题目
jq '.results[] | select(.success == false) | {id, category, error}' temp/results.json

自动缓存与安全上报 (Offline Cache & Upload)

数据透明说明 (Data Transparency)

哪些会上传(仅当 upload_to_server 未关闭,缺省为开启)
scripts/server.py_build_upload_payload 一致,主要包括:

  • 会话与校验:api_session_idapi_hashclient_version
  • 模型与展示:model_nameopenclaw_versionopenclaw_name(展示名)、各分类总分 s1s5、各题分数 b1b25
  • 每题运行块 r1r25:时间戳、Token 计数(含 cache read/write)、returncodeerror截断后的 stdout / stderr(长度见 scripts/config.pyUPLOAD_STDOUT_TRUNCATE_LENGTH / UPLOAD_STDERR_TRUNCATE_LENGTH,当前为 2000 / 500 字符)、准确度与 TPS 相关分数字段
  • 环境类:host_typeenv_info(如 CPU 核数、内存 GB、OS、python_version
  • 请求头:X-Bench-Session-Idbench_session_id,存于 data/cache.json

哪些不会作为「完整 transcript」上传
OpenClaw 的会话 transcript(.jsonl)仅在本地读取,用于汇总 Token(见 scripts/agent_cli.py);不会把整份对话记录原样塞进上报 JSON。

stdout/stderr 脱敏(非穷尽)
上报前对每题 stdout/stderr对部分常见密钥、路径、邮箱等格式的正则替换不能保证清洗所有敏感内容。实现位置:scripts/server.py_SANITIZE_RULES_sanitize_output(约 41–71 行);自测用同文件 test_sanitize(约 494–538 行),运行:cd scripts && python server.py sanitize。完整可作为审查材料的节选见仓库 UPLOAD_DISCLOSURE.md

传输
拉题与上报使用 HTTPS;开启上报时正文为 RSA+AES 混合加密(公钥见 scripts/config.py / crypto)。题目包在 HTTPS 下以明文 JSON 下发。拉题始终联网(除非本机断网);upload_to_server=false 时:main.py 不调用提交接口、不执行 flush_pending_uploads仍拉题

上报域默认 benchclawapi.antutu.comscripts/config.pyBENCHCLAW_API_HOST)。关闭上传:在 temp/caller_info.txtupload_to_server=false(可解析字段),勿仅靠改 config.py 代替与 SKILL 的约定。

  • 断网补报:提交失败时结果加密落盘;下次启动且 upload_to_server 仍为 true 时由 flush_pending_uploads 补报。

评测流程架构 (Evaluation Flow)

main.py
  ├─ 1. 清理历史 Session 与工作区
  ├─ 2. 读取 caller_info(含 upload_to_server);若为 true:补报历史失败记录;若为 false:跳过补报
  ├─ 3. 从服务端拉取题库 (25题,HTTPS)
  ├─ 4. 逐题执行(隔离 Session + Token 统计 + 规则校验)
  ├─ 5. 聚合统计(总分、TPS、通过率)
  ├─ 6. 生成 Report(Markdown)
  └─ 7. 若 upload_to_server:加密上报服务端;否则跳过上报(仅本地)
安全使用建议
This package is coherent with a benchmarking tool: it will read your OpenClaw sessions, run a background benchmarking process that invokes the local openclaw CLI, and (unless you choose local-only) upload aggregated per-question results to benchclawapi.antutu.com. Before installing or running: 1) If you don't want any network contact, set upload_to_server=false in temp/caller_info.txt (note: the client will still HTTPS fetch questions even in local-only mode according to the docs). 2) Be aware stdout/stderr sanitization is best-effort and not exhaustive—avoid running this on sessions that contain sensitive secrets you wouldn't want included (even truncated). 3) Inspect scripts/requirements.txt and run.sh (it uses pip with --require-hashes if available) and consider running inside an isolated virtual environment or container. 4) If you plan to upload results, confirm you trust the leaderboard host (benchclawapi.antutu.com) since aggregated metrics and some identifiers (session id, agent name if provided) are transmitted. 5) If you need higher assurance, run the benchmark with upload disabled and inspect all generated data in data/ and temp/ before enabling submission.
功能分析
Type: OpenClaw Skill Name: benchclaw Version: 1.1.1 BenchClaw is a professional-grade benchmarking tool for OpenClaw agents that evaluates performance across five dimensions. It features a transparent data collection process, including hardware monitoring (via psutil) and truncated log collection, with explicit regex-based sanitization in `scripts/server.py` to redact API keys and sensitive paths before encrypted upload to `benchclawapi.antutu.com`. The bundle demonstrates high security standards by using hash-verified dependencies in `scripts/requirements.txt`, providing a local-only mode to opt-out of data sharing, and requiring explicit user confirmation via `SKILL.md` instructions before execution.
能力标签
cryptorequires-oauth-tokenrequires-sensitive-credentials
能力评估
Purpose & Capability
Name/description (agent benchmark) match what the code does: fetching question sets, driving the local openclaw CLI, measuring token usage, scoring, report generation, and optionally submitting aggregated results. Required binaries (python3, openclaw, pip) and reading OpenClaw session files are justified by the stated purpose.
Instruction Scope
Runtime instructions and code operate on local OpenClaw sessions, write results to skill-local data/ and temp/, fetch questions over HTTPS and (by default) POST aggregated, truncated stdout/stderr and per-question metrics to the configured API. The SKILL.md and code explicitly document this behavior and an opt-out for uploads. Important note: sanitization of stdout/stderr is explicitly "best-effort / not exhaustive"—sensitive content could remain in uploads if present in outputs.
Install Mechanism
No registry install spec; the bundle contains scripts and a run.sh that creates a venv and installs requirements (it attempts --require-hashes). Nothing is downloaded from unknown third-party shorteners or personal IPs in the provided scripts. The approach is local and documented.
Credentials
The skill requests no external credentials and only optionally reads BENCHCLAW_RSA_PUBLIC_KEY_PEM (override). It does read OpenClaw session transcripts and other OpenClaw configuration data for token accounting and model info—this is proportionate for benchmarking but does access user-local agent transcripts (which the project claims are not uploaded in full).
Persistence & Privilege
always is false; the skill runs as a normal, user-invoked/executable skill and writes only to its own data/ and temp/ directories. It does not request system-wide modifications or other skills' credentials.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install benchclaw
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /benchclaw 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.1.1
**BenchClaw 1.1.1 introduces configurable result uploading with improved transparency and new local-only mode** - Added support for `upload_to_server=false` (via caller_info.txt), allowing users to opt-out of uploading benchmark results—results/leaderboard upload is now optional and locally controllable. - Updated permissions and documentation to clarify data transfer scope and local-only evaluation mode; references new UPLOAD_DISCLOSURE.md for full upload details. - Enhanced usage instructions and user prompts to clearly distinguish "public", "anonymous", and "local-only" evaluation. - README, SKILL.md, and configuration updated for improved reproducibility, transparency, and stricter network disclosure. - Added UPLOAD_DISCLOSURE.md documenting exactly what data is uploaded, when enabled.
v1.1.0
BenchClaw v1.1.0 - Adds bench_session_id for correlating evaluation runs; sent as X-Bench-Session-Id and stored in cache.json. - Updates permission descriptions and documentation to use bench_session_id instead of device fingerprint. - Refines README, tags, metadata, and internal naming to improve clarity and accuracy. - No breaking API or workflow changes; functionality and quick start remain as before.
v1.0.9
- Updated to version 1.0.9. - Expanded and standardized metadata tags for improved discoverability and search. - Updated the skill name in the metadata to "benchclaw - openclaw benchmark" for clarity. - Metadata version, tags, and description wording streamlined for consistency and broader keyword coverage. - No changes to core benchmarking features or evaluation logic.
v1.0.8
- Version bumped to 1.0.8. - Updated metadata version in SKILL.md to 1.0.8. - No other content or logic changes found in SKILL.md or documentation.
v1.0.7
Version 1.0.7 - Updated metadata to version 1.0.7 in SKILL.md. - Documentation and metadata refinements; no breaking changes to interfaces or behavior. - No visible changes to core functionality or user experience.
v1.0.6
- Updated version to 1.0.6. - SKILL.md: Bumped metadata version and updated documentation to match the new release. - No user-facing feature or API changes are listed; changes are primarily version and documentation updates.
v1.0.5
- Version bumped to 1.0.5. - Updated metadata version in SKILL.md to "1.0.5". - Minor documentation and metadata updates, no user-facing functional changes noted.
v1.0.4
- Added explicit Python package (cryptography, psutil, requests) requirements and skill type to metadata for improved compatibility. - Updated agent naming instructions: clarified that only the Agent's OpenClaw identity should be reported (not human user names); prompt texts updated accordingly. - Adjusted estimated benchmark duration in documentation from 10-60 minutes to 10-90 minutes. - Updated skill version to 1.0.4.
v1.0.3
- Added homepage and repository URLs to metadata for easy access to project resources. - Introduced descriptive "tags" for better discoverability. - "pip" is now listed as a required binary in prerequisites. - No changes made to benchmark logic or user workflows—documentation update only.
v1.0.2
- Updated permission descriptions to clarify the content and scope of uploaded evaluation results, including task-level stdout/stderr redaction, hardware/environment details, and enhanced data redaction procedures. - Permissions now specify use of AESGCM + RSA encryption for uploads and mention stdout/stderr truncation and sanitization steps. - No other functional or user-facing changes.
v1.0.1
Benchclaw 1.0.1 - Updated scoring categories: now reports Capability, Config, Security, Hardware, Permission (order and naming adjusted). - Clarified instructions for writing caller_info.txt; only includes agent_name and show_name, removing channel and target. - Improved security note: question pack fetched as plain JSON over HTTPS, uploads still AES-256-GCM encrypted. - Documentation updates for more accurate result format and reporting process. - Added scripts/session.py and updated other scripts to improve evaluation flow and session handling.
v0.1.5
**Changelog for benchclaw v0.1.5** - Security upgrade: Results upload now uses RSA+AES hybrid encryption, with public key built-in for data protection. - HMAC configuration changed: config variable renamed to BENCHCLAW_HMAC_KEY; HMAC now only signs the `hash` for score uploads. - Result upload instructions and data transparency notes updated to reflect improved encryption scheme. - The skill no longer reads SOUL.md for the agent name; now uses the current user's name directly when populating `caller_info.txt`. - Updated documentation in SKILL.md to match new security practices and data handling procedures.
v0.1.4
benchclaw 0.1.4 - Updated scripts/openclawbot.py. - No changes to documentation or user-facing features described in SKILL.md.
v0.1.3
benchclaw 0.1.3 - Updated scripts/openclawbot.py with changes relevant to the skill's operation. - No updates to user-facing features or documentation detected in this version. - Maintains existing benchmark, scoring, and report workflow for OpenClaw Agents.
v0.1.2
No changes detected in this version. - No file or documentation changes found between versions. - Functionality, features, and documentation remain unchanged from the previous version.
v0.1.1
No file changes detected for version 0.1.1. There are no updates or new features in this release.
v0.1.0
BenchClaw 0.1.0 initial release. - Introduces a professional, automated benchmarking framework for OpenClaw Agents, providing multi-dimensional evaluation (capability, performance, cost, config, security). - Supports task fetching, automated execution, precision scoring, and report generation with hot update functionality. - Implements user confirmation step with clear display of test duration, token costs, and privacy guarantees before benchmarking. - Automatically uploads encrypted evaluation results (agent scores, token usage, task outcomes) without any personal data; supports offline caching and retry. - Detailed result reporting template and CLI tools provided for viewing and analyzing benchmarking outcomes. - Ensures local privacy: only anonymized device fingerprint is stored for result correlation, with no PII collected.
元数据
Slug benchclaw
版本 1.1.1
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 17
常见问题

Benchclaw Openclaw Benchmark 是什么?

BenchClaw - OpenClaw Agent benchmark scoring tool. Benchmark 跑分 评测 打分. BenchClaw是专业级 OpenClaw Agent 性能评测框架。它专注于对 AI Agent 进行多维度、 自动化的量化评估与能力基准测试,集成了任务分发、精准评分... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 330 次。

如何安装 Benchclaw Openclaw Benchmark?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install benchclaw」即可一键安装,无需额外配置。

Benchclaw Openclaw Benchmark 是免费的吗?

是的,Benchclaw Openclaw Benchmark 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Benchclaw Openclaw Benchmark 支持哪些平台?

Benchclaw Openclaw Benchmark 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Benchclaw Openclaw Benchmark?

由 Antutu(@antutuadmin)开发并维护,当前版本 v1.1.1。

💬 留言讨论