← Back to Skills Marketplace
Agent Benchmark
by
yuyonghao-123
· GitHub ↗
· v0.1.1
· MIT-0
147
Downloads
0
Stars
0
Active Installs
2
Versions
Install in OpenClaw
/install yuyonghao-agent-benchmark
Description
提供基于12项标准化任务的AI Agent能力评估,涵盖文件操作、数据处理、系统操作、健壮性和代码质量,自动评分生成报告。
Usage Guidance
What to check before installing or running this skill:
- Do not run the skill with sensitive environment variables present. The runner forwards your process.env into any executed task code and tasks can read files and env vars.
- The SKILL.md instructs running a PowerShell runner that is not present in the package; instead a Node-based runner (index.js) is included — confirm which runner you intend to use and why documentation and code differ.
- Review index.js and any task files (tasks.json, default-tasks.json, extended tasks) before execution. The runner will write task code to disk and execute it; untrusted tasks could perform arbitrary I/O or network activity.
- Note the runner writes reports to '../../memory/benchmark-results.md' (outside the skill folder). If you don't want persistent copies in agent memory, modify the code or run in an isolated workspace.
- Ensure required interpreters (python/node/go) are either intentionally available or intentionally absent. If you don't want the skill to execute arbitrary interpreters, run it in a sandbox without those binaries.
- Ask the publisher (author/owner) for clarification: why does documentation reference PowerShell runner while code is Node-based, and why are some files (benchmark-runner.ps1) referenced but missing? If you cannot verify the source or intent, avoid running it on sensitive machines.
Capability Analysis
Type: OpenClaw Skill
Name: yuyonghao-agent-benchmark
Version: 0.1.1
The skill bundle is a benchmarking tool for AI agents, providing a suite of tasks to evaluate capabilities in file operations, data processing, and system interaction. It includes a Node.js runner (index.js) for Python, Node.js, and Go tasks, and references a PowerShell-based runner for a larger set of system tasks. While the tool executes arbitrary code and accesses environment variables as part of its benchmarking process, these actions are consistent with its stated purpose, and no evidence of malicious intent, data exfiltration, or unauthorized persistence was found.
Capability Assessment
Purpose & Capability
The README/SKILL.md describes a PowerShell-based 'benchmark-runner.ps1' and instructs running PowerShell scripts, but the package actually contains a Node CLI (index.js) and Node test harness. The SKILL metadata lists no required binaries, yet index.js spawns external runtimes ('python', 'node', 'go run'). The repository also ships multiple task sets (PowerShell-style tasks and Python tasks) so it's unclear which runner is authoritative. These inconsistencies mean the claimed purpose (PowerShell edition runner) doesn't fully align with the actual code and runtime assumptions.
Instruction Scope
SKILL.md tells users to run a local PowerShell runner path that is not present in the file manifest. The instructions encourage providing custom task files; index.js will write task code to disk and execute it with child processes, forwarding full process.env to children and allowing those tasks to read environment variables and the filesystem. The runtime behavior (create temp dirs, execute arbitrary code from tasks, and write reports) is broader than the missing/mismatched documentation implies and could run arbitrary user-supplied code.
Install Mechanism
There is no install spec (instruction-only), which is low-risk. However, the Node program expects external interpreters (python/node/go) to be available. Those required binaries are not declared in the skill metadata or SKILL.md, so runtime failures or hidden execution of local interpreters are possible if binaries exist. No remote downloads or unusual install steps are present.
Credentials
The skill requests no explicit credentials, but index.js passes the agent's full environment into spawned child processes and some default tasks read environment variables (e.g., task-011). The runner also writes reports to '../../memory/benchmark-results.md' (outside the skill folder), which persists data into an agent memory area. Executing arbitrary task code therefore has the ability to read environment variables and exfiltrate data if malicious task definitions are provided — this capability is proportional if you only run trusted tasks, but risky otherwise.
Persistence & Privilege
always: false (good), but index.js writes a generated report into a path two levels up ('../../memory/benchmark-results.md'), which is outside the skill directory and likely into the agent's persistent memory area. The skill therefore persists output in a global location without declaring that behavior in metadata or prompting the user.
How to Use
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install yuyonghao-agent-benchmark - After installation, invoke the skill by name or use
/yuyonghao-agent-benchmark - Provide required inputs per the skill's parameter spec and get structured output
Version History
v0.1.1
- 更新 package.json,版本号从 0.1.0 升级为 0.1.1
- 未对 SKILL.md 及核心功能文档进行修改
- 本次为 package 元数据小幅更新,无功能和文档变动
v0.1.0
AI Agent benchmark with 4 tests passing
Metadata
Frequently Asked Questions
What is Agent Benchmark?
提供基于12项标准化任务的AI Agent能力评估,涵盖文件操作、数据处理、系统操作、健壮性和代码质量,自动评分生成报告。 It is an AI Agent Skill for Claude Code / OpenClaw, with 147 downloads so far.
How do I install Agent Benchmark?
Run "/install yuyonghao-agent-benchmark" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Agent Benchmark free?
Yes, Agent Benchmark is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Agent Benchmark support?
Agent Benchmark is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Agent Benchmark?
It is built and maintained by yuyonghao-123 (@yuyonghao-123); the current version is v0.1.1.
More Skills