← Back to Skills Marketplace
230
Downloads
0
Stars
0
Active Installs
3
Versions
Install in OpenClaw
/install openclaw-skill-eval
Description
Skill evaluation framework. Use when: testing trigger rate, quality compare (with/without skill), or model comparison. Runs via sessions_spawn + sessions_his...
Usage Guidance
This skill is internally consistent with its stated purpose, but take these precautions before running it:
- Review SKILL.md and the bundled scripts (especially anything that writes files) so you understand what will be stored under eval-workspace/.
- Don't run evaluations against skills or prompts that will surface sensitive credentials or personal data; persisted histories include tool calls and tool results and may capture secrets.
- Because the workflow uses sandbox="inherit" and cleanup="keep", spawned subagents inherit the main agent environment and histories are retained — consider running in a disposable/test account or environment if you have any sensitive registrations or tokens available to your agent.
- If you need to test locally, create a clean ~/.openclaw/openclaw.json or ensure skills.load.extraDirs points only to safe directories; the resolver reads that file to find skill paths.
- The skill does not auto-install anything (fake-tool requires manual copy + gateway restart), and there are no remote downloads in SKILL.md — still, inspect any scripts before executing them in your environment.
If you want to proceed safely: run evaluations on a non-production agent, delete eval-workspace/ after reviewing results, and avoid exposing real credentials during tests.
Capability Analysis
Type: OpenClaw Skill
Name: openclaw-skill-eval
Version: 1.1.1
The bundle is a comprehensive evaluation framework that utilizes high-risk capabilities, including spawning autonomous sub-sessions via 'sessions_spawn' and executing local Python scripts through 'subprocess' and 'exec' (e.g., in scripts/legacy/run_orchestrator.py). While these behaviors are aligned with the stated purpose of benchmarking AI skills, the framework lacks input sanitization, creating a risk of Remote Code Execution (RCE) if the agent processes untrusted evaluation metadata. Additionally, viewer/generate_review.py starts a local HTTP server and uses 'os.kill' to manage ports, which is aggressive behavior for a skill bundle. No evidence of intentional data exfiltration or backdoors was found, but the broad permissions required for operation warrant a suspicious classification.
Capability Assessment
Purpose & Capability
Name/description match the included files and runtime instructions: the repository contains resolver and analysis scripts, example evals, and a SKILL.md that instructs the agent to spawn subagents and run local Python analysis. The requested actions (reading skill paths, running trigger/quality/model workflows, writing per-iteration workspaces) are coherent with an evaluation framework.
Instruction Scope
Runtime instructions explicitly tell the agent to read ~/.openclaw/openclaw.json to locate skill directories, call sessions_spawn and sessions_history, run local Python scripts via exec, and write full evaluation data to eval-workspace/. Those actions are expected for an eval tool, but they grant the skill access to user config and full conversation histories (including tool calls/results).
Install Mechanism
No install spec is present (instruction-only skill). Scripts are bundled in the repo and meant to be run locally via exec. There are no remote downloads or installers referenced in SKILL.md; requirements.txt lists requests but analysis scripts are documented as offline. This is low install risk.
Credentials
The skill declares no required env vars or credentials (good). However, it reads ~/.openclaw/openclaw.json and requires subagents to use sandbox="inherit" in spawn calls, which means the spawned sessions may inherit the main agent's registration environment/skill context. While not an explicit credential request, this can expose the same runtime environment to subagents — the behavior is explainable by the tool's purpose but worth noting.
Persistence & Privilege
Workflows require cleanup="keep" and saving full_history.json / raw transcripts to eval-workspace/<skill>/iter-N/. Persisting full session histories (tool_use + tool_result) can retain sensitive data (API keys, tokens, user-provided secrets) if any eval touches them. Combined with sandbox="inherit", retained histories may contain environment-derived data. This is expected for an evaluation tool but represents a real privacy/storage risk that users must manage.
How to Use
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install openclaw-skill-eval - After installation, invoke the skill by name or use
/openclaw-skill-eval - Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.1.1
Security: Add runtime actions disclosure, change fake-tool setup to manual (no auto gateway restart or skill install).
v1.1.0
v1.1: Negative trigger detection (precision, F1), scenario tiering (Tier 1 core / Tier 2 optional / Tier 3 roadmap), false positive diagnosis. 26 unit tests.
v1.0.0
Initial release: trigger rate detection (positive + negative), quality compare, description diagnosis, model comparison. 26 unit tests.
Metadata
Frequently Asked Questions
What is Skill Eval?
Skill evaluation framework. Use when: testing trigger rate, quality compare (with/without skill), or model comparison. Runs via sessions_spawn + sessions_his... It is an AI Agent Skill for Claude Code / OpenClaw, with 230 downloads so far.
How do I install Skill Eval?
Run "/install openclaw-skill-eval" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Skill Eval free?
Yes, Skill Eval is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Skill Eval support?
Skill Eval is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Skill Eval?
It is built and maintained by Xiaoxing9 (@xiaoxing9); the current version is v1.1.1.
More Skills