← 返回 Skills 市场

Hle Benchmark Evolver

Name: Hle Benchmark Evolver
Author: wanng-ide

作者 WANGJUNJIE · GitHub ↗ · v1.0.0

cross-platform ⚠ suspicious

735

总下载

当前安装

版本数

在 OpenClaw 中安装

/install hle-benchmark-evolver

功能描述

Runs HLE-oriented benchmark reward ingestion and curriculum generation for capability-evolver. Use when the user asks to optimize Humanity's Last Exam score,...

使用说明 (SKILL.md)

HLE Benchmark Evolver

This skill operationalizes HLE score-driven evolution for OpenClaw.

When to Use

User asks to improve HLE score (for example target >= 60%).
User provides question-level benchmark output and wants it converted to reward.
User wants easy-first curriculum queue and next-focus questions.
User asks for an immediate benchmark result snapshot.

Inputs

Benchmark report JSON path (--report=/abs/path/report.json)
Optional benchmark id (cais/hle default)

Workflow

Validate the report JSON exists and is parseable.
Ingest report into capability-evolver benchmark reward state.
Generate curriculum signals:
- benchmark_*
- curriculum_stage:*
- focus_subject:*
- focus_modality:*
- question_focus:*
Return a compact result summary for this run.

Run

node skills/hle-benchmark-evolver/run_result.js --report=/absolute/path/hle_report.json

Full automatic loop (starts evolution cycle):

node skills/hle-benchmark-evolver/run_pipeline.js --report=/absolute/path/hle_report.json --cycles=1

If your evaluator can be called from shell, let pipeline generate the report each cycle:

node skills/hle-benchmark-evolver/run_pipeline.js \
  --report=/absolute/path/hle_report.json \
  --eval_cmd="python /path/to/eval_hle.py --out {{report}}" \
  --cycles=3 --interval_ms=2000

If no --report is provided, it defaults to:

skills/capability-evolver/assets/gep/hle_report.template.json

Output Contract

Always print JSON with these fields:

benchmark_id
run_id
accuracy
reward
trend
curriculum_stage
queue_size
focus_subjects
focus_modalities
next_questions

Notes

This skill handles reward/curriculum ingestion. It does not directly solve HLE questions.
run_pipeline.js links ingestion, evolve, and solidify into one executable loop.

安全使用建议

This skill appears to implement HLE report ingestion and curriculum generation, but take these precautions before installing or running it: - Ensure the expected sibling modules exist: capability-evolver (or feishu-evolver-wrapper). Inspect their src/gep/benchmarkReward.js and index.js to confirm what state files and side effects they perform. - Avoid passing untrusted commands to --eval_cmd. The pipeline will run that command via the shell (and may execute it as a temporary script using a login shell), giving it full access to the agent's environment and filesystem; it can read env vars and files or exfiltrate data. - Run the skill in an isolated/test environment first (no secrets in environment, limited filesystem access) and try it with the provided sample report to observe behaviour. - Review where the benchmark state is stored (reward.getStatePath()) and ensure you are comfortable with reads/writes to that path. - If you must run eval_cmd, prefer to run a controlled evaluator executable you trust and pass a restricted environment (or run inside a sandbox/container). If you want a safer install, ask the skill author to: declare the dependency on capability-evolver, document the state path and files touched, and add explicit warnings and safeguards around executing arbitrary eval_cmd shell commands.

功能分析

Type: OpenClaw Skill Name: hle-benchmark-evolver Version: 1.0.0 The skill is highly suspicious due to a critical Remote Code Execution (RCE) vulnerability. The `SKILL.md` documentation explicitly instructs the OpenClaw agent to execute arbitrary shell commands via the `--eval_cmd` parameter. The `run_pipeline.js` script then directly implements this by using `child_process.spawnSync('bash', ['-c', command])` to execute the provided command, which includes a `{{report}}` placeholder that could also be leveraged for further injection if the report path is attacker-controlled. While this capability is presented as a feature to integrate external evaluators, it represents a severe prompt injection and shell injection risk, allowing an attacker to execute arbitrary code on the host system if they can control the input to this skill.

能力评估

ℹ Purpose & Capability

The code implements ingestion, reporting, and a pipeline that calls out to a 'capability-evolver' (or a 'feishu-evolver-wrapper') module and invokes that skill's index.js for evolve/solidify. That dependency is not declared in the SKILL.md or package metadata; the skill will fail or behave differently if those sibling modules are missing. Otherwise the requested capabilities (parse report → ingest → generate curriculum signals → optionally drive evolve/solidify) match the stated purpose.

⚠ Instruction Scope

SKILL.md and the scripts allow/encourage executing arbitrary evaluator commands via --eval_cmd which are run through the shell (runShell) and may be written to a temporary script and executed via 'bash -l'. This grants those commands full access to the process environment and filesystem that the agent runs with and can run arbitrary code, read files, or exfiltrate data. The instructions do not warn about that risk or restrict which commands may be executed.

✓ Install Mechanism

There is no network download or install spec — the skill is instruction + local JS files only. No external packages are fetched. That lowers install risk, but the skill expects local sibling modules to exist (capability-evolver or feishu-evolver-wrapper).

⚠ Credentials

The skill declares no required env vars, which is consistent with its metadata, but at runtime it spawns child processes and passes the full process.env to them. Those child processes (eval_cmd or invoked index.js in capability-evolver) can access any environment secrets available to the agent. Also the skill reads/writes state files via the external benchmarkReward module — the path and contents of those files are not documented in SKILL.md.

ℹ Persistence & Privilege

always:false and no explicit persistent installation are used. The skill writes temporary shell scripts to the current working directory when executing complex commands and will rely on state files in the capability-evolver module's state path. It does not modify other skills' configs directly, but it calls other skill code (capability-evolver) which could have broader effects — verify those sibling modules before use.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install hle-benchmark-evolver
安装完成后，直接呼叫该 Skill 的名称或使用 /hle-benchmark-evolver 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

- Initial release of hle-benchmark-evolver skill for OpenClaw. - Enables ingestion of HLE benchmark report JSONs to drive curriculum and evolution workflows. - Supports easy-first curriculum queues, focus area suggestion, and immediate result summaries. - Offers shell commands for both single-run and fully automated evolution-feedback loops. - Always outputs compact, structured JSON summarizing key progress metrics and curriculum focus.

元数据

Slug hle-benchmark-evolver

版本 1.0.0

许可证 —

累计安装 0

当前安装数 0

历史版本数 1

常见问题