← Back to Skills Marketplace
wanng-ide

Hle Benchmark Evolver

by WANGJUNJIE · GitHub ↗ · v1.0.0
cross-platform ⚠ suspicious
735
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install hle-benchmark-evolver
Description
Runs HLE-oriented benchmark reward ingestion and curriculum generation for capability-evolver. Use when the user asks to optimize Humanity's Last Exam score,...
README (SKILL.md)

HLE Benchmark Evolver

This skill operationalizes HLE score-driven evolution for OpenClaw.

When to Use

  • User asks to improve HLE score (for example target >= 60%).
  • User provides question-level benchmark output and wants it converted to reward.
  • User wants easy-first curriculum queue and next-focus questions.
  • User asks for an immediate benchmark result snapshot.

Inputs

  • Benchmark report JSON path (--report=/abs/path/report.json)
  • Optional benchmark id (cais/hle default)

Workflow

  1. Validate the report JSON exists and is parseable.
  2. Ingest report into capability-evolver benchmark reward state.
  3. Generate curriculum signals:
    • benchmark_*
    • curriculum_stage:*
    • focus_subject:*
    • focus_modality:*
    • question_focus:*
  4. Return a compact result summary for this run.

Run

node skills/hle-benchmark-evolver/run_result.js --report=/absolute/path/hle_report.json

Full automatic loop (starts evolution cycle):

node skills/hle-benchmark-evolver/run_pipeline.js --report=/absolute/path/hle_report.json --cycles=1

If your evaluator can be called from shell, let pipeline generate the report each cycle:

node skills/hle-benchmark-evolver/run_pipeline.js \
  --report=/absolute/path/hle_report.json \
  --eval_cmd="python /path/to/eval_hle.py --out {{report}}" \
  --cycles=3 --interval_ms=2000

If no --report is provided, it defaults to:

skills/capability-evolver/assets/gep/hle_report.template.json

Output Contract

Always print JSON with these fields:

  • benchmark_id
  • run_id
  • accuracy
  • reward
  • trend
  • curriculum_stage
  • queue_size
  • focus_subjects
  • focus_modalities
  • next_questions

Notes

  • This skill handles reward/curriculum ingestion. It does not directly solve HLE questions.
  • run_pipeline.js links ingestion, evolve, and solidify into one executable loop.
Usage Guidance
This skill appears to implement HLE report ingestion and curriculum generation, but take these precautions before installing or running it: - Ensure the expected sibling modules exist: capability-evolver (or feishu-evolver-wrapper). Inspect their src/gep/benchmarkReward.js and index.js to confirm what state files and side effects they perform. - Avoid passing untrusted commands to --eval_cmd. The pipeline will run that command via the shell (and may execute it as a temporary script using a login shell), giving it full access to the agent's environment and filesystem; it can read env vars and files or exfiltrate data. - Run the skill in an isolated/test environment first (no secrets in environment, limited filesystem access) and try it with the provided sample report to observe behaviour. - Review where the benchmark state is stored (reward.getStatePath()) and ensure you are comfortable with reads/writes to that path. - If you must run eval_cmd, prefer to run a controlled evaluator executable you trust and pass a restricted environment (or run inside a sandbox/container). If you want a safer install, ask the skill author to: declare the dependency on capability-evolver, document the state path and files touched, and add explicit warnings and safeguards around executing arbitrary eval_cmd shell commands.
Capability Analysis
Type: OpenClaw Skill Name: hle-benchmark-evolver Version: 1.0.0 The skill is highly suspicious due to a critical Remote Code Execution (RCE) vulnerability. The `SKILL.md` documentation explicitly instructs the OpenClaw agent to execute arbitrary shell commands via the `--eval_cmd` parameter. The `run_pipeline.js` script then directly implements this by using `child_process.spawnSync('bash', ['-c', command])` to execute the provided command, which includes a `{{report}}` placeholder that could also be leveraged for further injection if the report path is attacker-controlled. While this capability is presented as a feature to integrate external evaluators, it represents a severe prompt injection and shell injection risk, allowing an attacker to execute arbitrary code on the host system if they can control the input to this skill.
Capability Assessment
Purpose & Capability
The code implements ingestion, reporting, and a pipeline that calls out to a 'capability-evolver' (or a 'feishu-evolver-wrapper') module and invokes that skill's index.js for evolve/solidify. That dependency is not declared in the SKILL.md or package metadata; the skill will fail or behave differently if those sibling modules are missing. Otherwise the requested capabilities (parse report → ingest → generate curriculum signals → optionally drive evolve/solidify) match the stated purpose.
Instruction Scope
SKILL.md and the scripts allow/encourage executing arbitrary evaluator commands via --eval_cmd which are run through the shell (runShell) and may be written to a temporary script and executed via 'bash -l'. This grants those commands full access to the process environment and filesystem that the agent runs with and can run arbitrary code, read files, or exfiltrate data. The instructions do not warn about that risk or restrict which commands may be executed.
Install Mechanism
There is no network download or install spec — the skill is instruction + local JS files only. No external packages are fetched. That lowers install risk, but the skill expects local sibling modules to exist (capability-evolver or feishu-evolver-wrapper).
Credentials
The skill declares no required env vars, which is consistent with its metadata, but at runtime it spawns child processes and passes the full process.env to them. Those child processes (eval_cmd or invoked index.js in capability-evolver) can access any environment secrets available to the agent. Also the skill reads/writes state files via the external benchmarkReward module — the path and contents of those files are not documented in SKILL.md.
Persistence & Privilege
always:false and no explicit persistent installation are used. The skill writes temporary shell scripts to the current working directory when executing complex commands and will rely on state files in the capability-evolver module's state path. It does not modify other skills' configs directly, but it calls other skill code (capability-evolver) which could have broader effects — verify those sibling modules before use.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install hle-benchmark-evolver
  3. After installation, invoke the skill by name or use /hle-benchmark-evolver
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
- Initial release of hle-benchmark-evolver skill for OpenClaw. - Enables ingestion of HLE benchmark report JSONs to drive curriculum and evolution workflows. - Supports easy-first curriculum queues, focus area suggestion, and immediate result summaries. - Offers shell commands for both single-run and fully automated evolution-feedback loops. - Always outputs compact, structured JSON summarizing key progress metrics and curriculum focus.
Metadata
Slug hle-benchmark-evolver
Version 1.0.0
License
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Hle Benchmark Evolver?

Runs HLE-oriented benchmark reward ingestion and curriculum generation for capability-evolver. Use when the user asks to optimize Humanity's Last Exam score,... It is an AI Agent Skill for Claude Code / OpenClaw, with 735 downloads so far.

How do I install Hle Benchmark Evolver?

Run "/install hle-benchmark-evolver" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Hle Benchmark Evolver free?

Yes, Hle Benchmark Evolver is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Hle Benchmark Evolver support?

Hle Benchmark Evolver is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Hle Benchmark Evolver?

It is built and maintained by WANGJUNJIE (@wanng-ide); the current version is v1.0.0.

💬 Comments