← 返回 Skills 市场
wanng-ide

Hle Benchmark Evolver

作者 WANGJUNJIE · GitHub ↗ · v1.0.0
cross-platform ⚠ suspicious
735
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install hle-benchmark-evolver
功能描述
Runs HLE-oriented benchmark reward ingestion and curriculum generation for capability-evolver. Use when the user asks to optimize Humanity's Last Exam score,...
使用说明 (SKILL.md)

HLE Benchmark Evolver

This skill operationalizes HLE score-driven evolution for OpenClaw.

When to Use

  • User asks to improve HLE score (for example target >= 60%).
  • User provides question-level benchmark output and wants it converted to reward.
  • User wants easy-first curriculum queue and next-focus questions.
  • User asks for an immediate benchmark result snapshot.

Inputs

  • Benchmark report JSON path (--report=/abs/path/report.json)
  • Optional benchmark id (cais/hle default)

Workflow

  1. Validate the report JSON exists and is parseable.
  2. Ingest report into capability-evolver benchmark reward state.
  3. Generate curriculum signals:
    • benchmark_*
    • curriculum_stage:*
    • focus_subject:*
    • focus_modality:*
    • question_focus:*
  4. Return a compact result summary for this run.

Run

node skills/hle-benchmark-evolver/run_result.js --report=/absolute/path/hle_report.json

Full automatic loop (starts evolution cycle):

node skills/hle-benchmark-evolver/run_pipeline.js --report=/absolute/path/hle_report.json --cycles=1

If your evaluator can be called from shell, let pipeline generate the report each cycle:

node skills/hle-benchmark-evolver/run_pipeline.js \
  --report=/absolute/path/hle_report.json \
  --eval_cmd="python /path/to/eval_hle.py --out {{report}}" \
  --cycles=3 --interval_ms=2000

If no --report is provided, it defaults to:

skills/capability-evolver/assets/gep/hle_report.template.json

Output Contract

Always print JSON with these fields:

  • benchmark_id
  • run_id
  • accuracy
  • reward
  • trend
  • curriculum_stage
  • queue_size
  • focus_subjects
  • focus_modalities
  • next_questions

Notes

  • This skill handles reward/curriculum ingestion. It does not directly solve HLE questions.
  • run_pipeline.js links ingestion, evolve, and solidify into one executable loop.
安全使用建议
This skill appears to implement HLE report ingestion and curriculum generation, but take these precautions before installing or running it: - Ensure the expected sibling modules exist: capability-evolver (or feishu-evolver-wrapper). Inspect their src/gep/benchmarkReward.js and index.js to confirm what state files and side effects they perform. - Avoid passing untrusted commands to --eval_cmd. The pipeline will run that command via the shell (and may execute it as a temporary script using a login shell), giving it full access to the agent's environment and filesystem; it can read env vars and files or exfiltrate data. - Run the skill in an isolated/test environment first (no secrets in environment, limited filesystem access) and try it with the provided sample report to observe behaviour. - Review where the benchmark state is stored (reward.getStatePath()) and ensure you are comfortable with reads/writes to that path. - If you must run eval_cmd, prefer to run a controlled evaluator executable you trust and pass a restricted environment (or run inside a sandbox/container). If you want a safer install, ask the skill author to: declare the dependency on capability-evolver, document the state path and files touched, and add explicit warnings and safeguards around executing arbitrary eval_cmd shell commands.
功能分析
Type: OpenClaw Skill Name: hle-benchmark-evolver Version: 1.0.0 The skill is highly suspicious due to a critical Remote Code Execution (RCE) vulnerability. The `SKILL.md` documentation explicitly instructs the OpenClaw agent to execute arbitrary shell commands via the `--eval_cmd` parameter. The `run_pipeline.js` script then directly implements this by using `child_process.spawnSync('bash', ['-c', command])` to execute the provided command, which includes a `{{report}}` placeholder that could also be leveraged for further injection if the report path is attacker-controlled. While this capability is presented as a feature to integrate external evaluators, it represents a severe prompt injection and shell injection risk, allowing an attacker to execute arbitrary code on the host system if they can control the input to this skill.
能力评估
Purpose & Capability
The code implements ingestion, reporting, and a pipeline that calls out to a 'capability-evolver' (or a 'feishu-evolver-wrapper') module and invokes that skill's index.js for evolve/solidify. That dependency is not declared in the SKILL.md or package metadata; the skill will fail or behave differently if those sibling modules are missing. Otherwise the requested capabilities (parse report → ingest → generate curriculum signals → optionally drive evolve/solidify) match the stated purpose.
Instruction Scope
SKILL.md and the scripts allow/encourage executing arbitrary evaluator commands via --eval_cmd which are run through the shell (runShell) and may be written to a temporary script and executed via 'bash -l'. This grants those commands full access to the process environment and filesystem that the agent runs with and can run arbitrary code, read files, or exfiltrate data. The instructions do not warn about that risk or restrict which commands may be executed.
Install Mechanism
There is no network download or install spec — the skill is instruction + local JS files only. No external packages are fetched. That lowers install risk, but the skill expects local sibling modules to exist (capability-evolver or feishu-evolver-wrapper).
Credentials
The skill declares no required env vars, which is consistent with its metadata, but at runtime it spawns child processes and passes the full process.env to them. Those child processes (eval_cmd or invoked index.js in capability-evolver) can access any environment secrets available to the agent. Also the skill reads/writes state files via the external benchmarkReward module — the path and contents of those files are not documented in SKILL.md.
Persistence & Privilege
always:false and no explicit persistent installation are used. The skill writes temporary shell scripts to the current working directory when executing complex commands and will rely on state files in the capability-evolver module's state path. It does not modify other skills' configs directly, but it calls other skill code (capability-evolver) which could have broader effects — verify those sibling modules before use.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install hle-benchmark-evolver
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /hle-benchmark-evolver 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
- Initial release of hle-benchmark-evolver skill for OpenClaw. - Enables ingestion of HLE benchmark report JSONs to drive curriculum and evolution workflows. - Supports easy-first curriculum queues, focus area suggestion, and immediate result summaries. - Offers shell commands for both single-run and fully automated evolution-feedback loops. - Always outputs compact, structured JSON summarizing key progress metrics and curriculum focus.
元数据
Slug hle-benchmark-evolver
版本 1.0.0
许可证
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Hle Benchmark Evolver 是什么?

Runs HLE-oriented benchmark reward ingestion and curriculum generation for capability-evolver. Use when the user asks to optimize Humanity's Last Exam score,... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 735 次。

如何安装 Hle Benchmark Evolver?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install hle-benchmark-evolver」即可一键安装,无需额外配置。

Hle Benchmark Evolver 是免费的吗?

是的,Hle Benchmark Evolver 完全免费(开源免费),可自由下载、安装和使用。

Hle Benchmark Evolver 支持哪些平台?

Hle Benchmark Evolver 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Hle Benchmark Evolver?

由 WANGJUNJIE(@wanng-ide)开发并维护,当前版本 v1.0.0。

💬 留言讨论