功能描述

通过12个标准化任务自动评估AI Agent在文件操作、数据处理、系统操作、健壮性与代码质量五大维度的综合能力。

使用说明 (SKILL.md)

agent-benchmark - AI Agent 能力评估基准

Name: Agent Benchmark
Author: yuyonghao-123

版本: 0.1.0
作者: 小蒲萄 (Clawd)
创建日期: 2026-03-18
类型: SWE-bench Lite (PowerShell Edition)

📖 简介

评估 AI Agent 任务完成能力的基准测试系统，灵感来源于 SWE-bench。

核心功能：

✅ 12 个标准化测试任务
✅ 5 大能力维度评估
✅ 自动化评分和报告生成
✅ 分类别能力分析
✅ 可自定义扩展任务

🎯 评估维度

维度	权重	测试内容
文件操作	15%	文件读写、目录管理、路径处理
数据处理	35%	JSON/CSV、数组、字符串、正则
系统操作	15%	环境变量、日期时间、系统信息
健壮性	15%	错误处理、异常捕获、边界条件
代码质量	20%	函数定义、代码复用、最佳实践

🚀 快速开始

运行完整基准测试

# 进入技能目录
cd C:\Users\99236\.openclaw\workspace\skills\agent-benchmark

# 运行基准测试
.\src\benchmark-runner.ps1

自定义任务文件

# 使用自定义任务集
.\src\benchmark-runner.ps1 -TaskFile ".	asks\custom-tasks.json"

# 指定输出目录
.\src\benchmark-runner.ps1 -OutputDir ".\reports\my-benchmark"

# 详细输出模式
.\src\benchmark-runner.ps1 -Verbose

📋 测试任务清单

Easy (基础能力)

ID	任务	类别	分值	测试点
task-001	文件创建与内容写入	文件操作	10	基础文件 I/O
task-002	JSON 数据处理	数据处理	10	JSON 解析/序列化
task-003	数组操作	数据处理	10	过滤/聚合
task-011	环境变量访问	系统操作	10	系统信息读取

Medium (中级能力)

ID	任务	类别	分值	测试点
task-004	目录列表与过滤	文件操作	15	文件搜索/统计
task-005	字符串操作	数据处理	15	多步骤转换
task-006	日期时间计算	系统操作	15	时间差计算
task-007	CSV 数据生成	数据处理	20	结构化数据导出
task-012	正则表达式匹配	数据处理	20	模式提取

Hard (高级能力)

ID	任务	类别	分值	测试点
task-008	错误处理测试	健壮性	20	try-catch 异常处理
task-009	多步骤数据管道	数据处理	25	复杂数据流
task-010	函数定义与调用	代码质量	25	代码复用/算法

总分: 195 分
及格线: 60% (117 分)
优秀线: 80% (156 分)

📊 评分标准

单任务评分

单任务得分 = 基础分 + 验证分 + 效率分

基础分:
- 成功完成：50%
- 部分完成：30%
- 失败：0%

验证分 (50%):
- 输出匹配预期：50%
- 输出部分匹配：25%
- 输出不匹配：0%

效率分 (10%):
- 在预期时间内完成：+10%
- 超时：0%

总体评级

平均分	评级	说明
≥ 0.9	🏆 S 级	生产就绪，超越预期
≥ 0.8	✅ A 级	生产就绪，表现优秀
≥ 0.7	👍 B 级	良好，少量改进空间
≥ 0.6	⚠️ C 级	及格，需要改进
\x3C 0.6	❌ D 级	不及格，需大幅提升

📁 文件结构

skills/agent-benchmark/
├── SKILL.md                      # 技能文档（本文件）
├── src/
│   ├── benchmark-runner.ps1      # 基准测试执行器
│   └── scoring-engine.ps1        # 评分引擎（可选扩展）
├── tasks/
│   ├── default-tasks.json        # 默认任务集（12 题）
│   ├── custom-tasks.json         # 自定义任务（用户添加）
│   └── advanced-tasks.json       # 高级任务集（未来扩展）
├── reports/
│   ├── benchmark-report-*.md     # 测试报告（自动生成）
│   └── summary.json              # 汇总数据（可选）
└── README.md                     # 使用说明

🧪 示例任务

任务示例：JSON 数据处理

{
  "id": "task-002",
  "name": "JSON Data Processing",
  "category": "Data Processing",
  "difficulty": "Easy",
  "description": "Parse JSON and extract specific field",
  "script": "$data = @{Name='Clawd'; Type='AI'; Version='1.0'} | ConvertTo-Json; $parsed = $data | ConvertFrom-Json; Write-Host \"Name: $($parsed.Name)\"",
  "expectedOutput": "Name: Clawd",
  "expectedTimeSeconds": 5,
  "points": 10
}

任务示例：错误处理测试

{
  "id": "task-008",
  "name": "Error Handling Test",
  "category": "Robustness",
  "difficulty": "Hard",
  "description": "Handle errors gracefully with try-catch",
  "script": "try { \
  $content = Get-Content -Path 'nonexistent-file.txt' -ErrorAction Stop\
  Write-Host 'File found'\
} catch {\
  Write-Host 'Error handled: File not found'\
}",
  "expectedOutput": "Error handled: File not found",
  "expectedTimeSeconds": 5,
  "points": 20
}

📈 报告示例

运行基准测试后生成 Markdown 报告：

# Agent Benchmark Report

**Generated**: 2026-03-18 09:30:00  
**Total Tasks**: 12  
**Average Score**: 0.85 / 1.0  
**Average Time**: 8.5s

## 📊 Summary

| Metric | Value |
|--------|-------|
| **Success Rate** | 83.3% |
| **Partial Completion** | 8.3% |
| **Failure Rate** | 8.3% |
| **Average Score** | 0.85 |
| **Avg Execution Time** | 8.5s |

## 📈 Detailed Results

| Task | Category | Difficulty | Status | Score | Time (s) |
|------|----------|------------|--------|-------|----------|
| File Creation | File Ops | Easy | ✅ Success | 1.0 | 2.3 |
| JSON Processing | Data | Easy | ✅ Success | 1.0 | 3.1 |
| ... | ... | ... | ... | ... | ... |

🔧 自定义任务

创建自定义任务集

创建 tasks/custom-tasks.json:

{
  "metadata": {
    "version": "1.0.0",
    "name": "My Custom Tasks",
    "description": "Custom benchmark for specific use case"
  },
  "tasks": [
    {
      "id": "custom-001",
      "name": "My Custom Task",
      "category": "Custom Category",
      "difficulty": "Medium",
      "description": "Description of what to do",
      "script": "Write-Host 'Hello World'",
      "expectedOutput": "Hello World",
      "expectedTimeSeconds": 5,
      "points": 15
    }
  ]
}

运行自定义任务

.\src\benchmark-runner.ps1 -TaskFile ".	asks\custom-tasks.json"

🎯 使用场景

✅ 适合的场景

Agent 能力评估 - 新版本发布前测试
回归测试 - 确保更新未破坏功能
对比测试 - 不同配置/模型对比
能力诊断 - 识别薄弱环节
持续集成 - 自动化质量门禁

❌ 不适合的场景

性能基准测试 - 非性能导向
安全测试 - 无安全相关任务
大规模负载测试 - 设计为轻量级

📊 指标解释

Task Completion Rate (任务完成率)

成功完成的任务数 / 总任务数

Average Execution Time (平均执行时间)

所有任务执行时间的算术平均值

Success Rate (成功率)

达到 80% 以上分数的任务比例

Error Rate (错误率)

执行中出错的任务比例

Code Quality (代码质量)

基于代码结构、复用性、最佳实践的综合评分

🔍 故障排查

常见问题

Q: 任务执行超时

原因：脚本复杂度过高或死循环
解决：检查任务脚本，增加 expectedTimeSeconds

Q: 输出验证失败

原因：实际输出与 expectedOutput 不匹配
解决：调整 expectedOutput 正则表达式或检查脚本逻辑

Q: 报告生成失败

原因：输出目录不存在或权限问题
解决：确保 OutputDir 存在且有写权限

📝 更新日志

v0.1.0 (2026-03-18)

✅ 初始版本发布
✅ 12 个标准化测试任务
✅ 5 大评估维度
✅ 自动化评分系统
✅ Markdown 报告生成
✅ 自定义任务支持

🤝 贡献

欢迎提交新的测试任务！请遵循以下格式：

任务难度分级明确（Easy/Medium/Hard）
预期输出可自动化验证
执行时间在 30 秒以内
不依赖外部网络/API
不修改系统配置

📄 许可证

MIT License

最后更新：2026-03-18

安全使用建议

What to consider before installing/running: - Clarify the runner: SKILL.md describes a PowerShell runner, but the package includes index.js (Node) that actually runs tasks. Ask the author which runner is intended. - Runtimes required: index.js may spawn python, node, and go; the registry metadata declares no required binaries. Ensure you want those interpreters available on the host and that you trust code executed by them. - Arbitrary code execution: the benchmark executes task-provided code by writing files and spawning interpreters. If you (or untrusted third parties) add tasks, they can run arbitrary commands with your user permissions. Run only in an isolated/sandbox environment or inspect tasks before execution. - Persistent writes: the tool writes reports into a '../../memory' path outside the skill directory—inspect where that resolves in your environment and whether you want benchmark outputs (which may include environment values) stored persistently. - Review tasks and index.js: before running, open tasks/*.json and index.js to confirm no task prints secrets or calls external endpoints; the provided files show no external network calls, but tasks can be extended. - Safe deployment recommendations: run in a disposable VM or container, or run with restricted user privileges and no sensitive env vars present; consider removing support for languages you don't want to allow (or run the runner in a capability-restricted sandbox). If you want, I can (1) point out the exact lines in index.js that create temp dirs, spawn child processes, and write to '../../memory', or (2) propose a safer runner configuration (e.g., disable python/go, restrict to dry-run mode).

功能分析

Type: OpenClaw Skill Name: agent-benchmark Version: 0.1.0 The bundle provides a benchmark suite for evaluating AI agent capabilities by executing arbitrary code (PowerShell, Python, Node.js) and accessing system metadata, such as environment variables ($env:USERNAME, $env:COMPUTERNAME) and process lists. While these high-risk capabilities are plausibly required for the stated purpose of a capability assessment tool, the framework for arbitrary code execution and the collection of system information are inherently risky. Furthermore, the documentation in SKILL.md contains hardcoded local paths (e.g., C:\Users\99236\...), and the bundle includes a Node.js runner (index.js) that dynamically writes and executes code from JSON task files, which could be leveraged for unintended execution if the task files are modified.

能力评估

ℹ Purpose & Capability

The stated purpose (agent capability benchmark) aligns with the included tasks and scoring logic. However the SKILL.md emphasizes a PowerShell runner (src/benchmark-runner.ps1) while the repository contains a Node.js index.js that implements a runner and executes arbitrary language code (python/node/go). The package metadata declares no required binaries, yet index.js expects interpreters/runtimes (python, node, go). This mismatch is disproportionate to the documented purpose and should be clarified.

⚠ Instruction Scope

SKILL.md instructs users to run a PowerShell script (src/benchmark-runner.ps1) and includes PowerShell task scripts, but the actual executable logic is index.js (Node). index.js writes files, creates temp directories, writes and executes user-supplied code (from tasks.json/tasks) by spawning child processes, and includes behavior not documented in SKILL.md. The instructions in SKILL.md do not fully describe what will be executed on the host.

ℹ Install Mechanism

There is no install spec (instruction-only claimed), which is lower risk, but the package contains Node code that will be executed if you run it. The tool expects external runtimes (python/node/go) though no required-binaries are declared. No remote download or obscure URLs are present in the package, which reduces installer risk, but the lack of declared runtime requirements is an inconsistency.

ℹ Credentials

The skill does not declare required environment variables, which matches registry metadata. index.js spawns processes inheriting process.env and some benchmark tasks intentionally read environment variables (task-011). That's reasonable for 'system operations' tests, but reports capture task outputs (which may include env values) and the tool will persist those outputs—so running tasks that print sensitive environment values could leak them into local reports.

⚠ Persistence & Privilege

index.js generates reports and explicitly writes a report to a relative '../../memory/benchmark-results.md' (i.e., escapes the package directory). Writing into a 'memory' path outside the skill directory can place results into agent persistent storage; SKILL.md did not document this. The skill is not marked always:true, but this unexpected persistent write and the discrepancy between documented runner and shipped Node runner is a privilege/persistence concern and should be clarified.

版本历史

v0.1.0

agent-benchmark v0.1.0 - Initial release of the skill. - Includes 12 standardized benchmark tasks across 5 ability dimensions. - Automated scoring system and Markdown report generation. - Supports custom task definition and analysis by category. - Provides PowerShell-based test runner and clear reporting structure.

元数据

Slug agent-benchmark

版本 0.1.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Agent Benchmark 是什么？

通过12个标准化任务自动评估AI Agent在文件操作、数据处理、系统操作、健壮性与代码质量五大维度的综合能力。它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 124 次。

如何安装 Agent Benchmark？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install agent-benchmark」即可一键安装，无需额外配置。

Agent Benchmark 是免费的吗？

是的，Agent Benchmark 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Agent Benchmark 支持哪些平台？

Agent Benchmark 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Agent Benchmark？

由 yuyonghao-123（@yuyonghao-123）开发并维护，当前版本 v0.1.0。

Agent Benchmark

agent-benchmark - AI Agent 能力评估基准

📖 简介

🎯 评估维度

🚀 快速开始

运行完整基准测试

自定义任务文件

📋 测试任务清单

Easy (基础能力)

Medium (中级能力)

Hard (高级能力)

📊 评分标准

单任务评分

总体评级

📁 文件结构

🧪 示例任务

任务示例：JSON 数据处理

任务示例：错误处理测试

📈 报告示例

🔧 自定义任务

创建自定义任务集

运行自定义任务

🎯 使用场景

✅ 适合的场景

❌ 不适合的场景

📊 指标解释

Task Completion Rate (任务完成率)

Average Execution Time (平均执行时间)

Success Rate (成功率)

Error Rate (错误率)

Code Quality (代码质量)

🔍 故障排查

常见问题

📝 更新日志

v0.1.0 (2026-03-18)

🤝 贡献

📄 许可证

Agent Benchmark 是什么？

如何安装 Agent Benchmark？

Agent Benchmark 是免费的吗？

Agent Benchmark 支持哪些平台？

谁开发了 Agent Benchmark？

💬 留言讨论