← Back to Skills Marketplace

Skill Eval

Name: Skill Eval
Author: xiaoxing9

by Xiaoxing9 · GitHub ↗ · v1.1.1 · MIT-0

cross-platform ⚠ suspicious

230

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install openclaw-skill-eval

Description

Skill evaluation framework. Use when: testing trigger rate, quality compare (with/without skill), or model comparison. Runs via sessions_spawn + sessions_his...

Usage Guidance

This skill is internally consistent with its stated purpose, but take these precautions before running it: - Review SKILL.md and the bundled scripts (especially anything that writes files) so you understand what will be stored under eval-workspace/. - Don't run evaluations against skills or prompts that will surface sensitive credentials or personal data; persisted histories include tool calls and tool results and may capture secrets. - Because the workflow uses sandbox="inherit" and cleanup="keep", spawned subagents inherit the main agent environment and histories are retained — consider running in a disposable/test account or environment if you have any sensitive registrations or tokens available to your agent. - If you need to test locally, create a clean ~/.openclaw/openclaw.json or ensure skills.load.extraDirs points only to safe directories; the resolver reads that file to find skill paths. - The skill does not auto-install anything (fake-tool requires manual copy + gateway restart), and there are no remote downloads in SKILL.md — still, inspect any scripts before executing them in your environment. If you want to proceed safely: run evaluations on a non-production agent, delete eval-workspace/ after reviewing results, and avoid exposing real credentials during tests.

Capability Analysis

Type: OpenClaw Skill Name: openclaw-skill-eval Version: 1.1.1 The bundle is a comprehensive evaluation framework that utilizes high-risk capabilities, including spawning autonomous sub-sessions via 'sessions_spawn' and executing local Python scripts through 'subprocess' and 'exec' (e.g., in scripts/legacy/run_orchestrator.py). While these behaviors are aligned with the stated purpose of benchmarking AI skills, the framework lacks input sanitization, creating a risk of Remote Code Execution (RCE) if the agent processes untrusted evaluation metadata. Additionally, viewer/generate_review.py starts a local HTTP server and uses 'os.kill' to manage ports, which is aggressive behavior for a skill bundle. No evidence of intentional data exfiltration or backdoors was found, but the broad permissions required for operation warrant a suspicious classification.

Capability Assessment

✓ Purpose & Capability

Name/description match the included files and runtime instructions: the repository contains resolver and analysis scripts, example evals, and a SKILL.md that instructs the agent to spawn subagents and run local Python analysis. The requested actions (reading skill paths, running trigger/quality/model workflows, writing per-iteration workspaces) are coherent with an evaluation framework.

ℹ Instruction Scope

Runtime instructions explicitly tell the agent to read ~/.openclaw/openclaw.json to locate skill directories, call sessions_spawn and sessions_history, run local Python scripts via exec, and write full evaluation data to eval-workspace/. Those actions are expected for an eval tool, but they grant the skill access to user config and full conversation histories (including tool calls/results).

✓ Install Mechanism

No install spec is present (instruction-only skill). Scripts are bundled in the repo and meant to be run locally via exec. There are no remote downloads or installers referenced in SKILL.md; requirements.txt lists requests but analysis scripts are documented as offline. This is low install risk.

ℹ Credentials

The skill declares no required env vars or credentials (good). However, it reads ~/.openclaw/openclaw.json and requires subagents to use sandbox="inherit" in spawn calls, which means the spawned sessions may inherit the main agent's registration environment/skill context. While not an explicit credential request, this can expose the same runtime environment to subagents — the behavior is explainable by the tool's purpose but worth noting.

⚠ Persistence & Privilege

Workflows require cleanup="keep" and saving full_history.json / raw transcripts to eval-workspace/<skill>/iter-N/. Persisting full session histories (tool_use + tool_result) can retain sensitive data (API keys, tokens, user-provided secrets) if any eval touches them. Combined with sandbox="inherit", retained histories may contain environment-derived data. This is expected for an evaluation tool but represents a real privacy/storage risk that users must manage.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install openclaw-skill-eval
After installation, invoke the skill by name or use /openclaw-skill-eval
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.1.1

Security: Add runtime actions disclosure, change fake-tool setup to manual (no auto gateway restart or skill install).

v1.1.0

v1.1: Negative trigger detection (precision, F1), scenario tiering (Tier 1 core / Tier 2 optional / Tier 3 roadmap), false positive diagnosis. 26 unit tests.

v1.0.0

Initial release: trigger rate detection (positive + negative), quality compare, description diagnosis, model comparison. 26 unit tests.

Metadata

Slug openclaw-skill-eval

Version 1.1.1

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 3

Frequently Asked Questions

What is Skill Eval?

Skill evaluation framework. Use when: testing trigger rate, quality compare (with/without skill), or model comparison. Runs via sessions_spawn + sessions_his... It is an AI Agent Skill for Claude Code / OpenClaw, with 230 downloads so far.

How do I install Skill Eval?

Run "/install openclaw-skill-eval" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Skill Eval free?

Yes, Skill Eval is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Skill Eval support?

Skill Eval is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Skill Eval?

It is built and maintained by Xiaoxing9 (@xiaoxing9); the current version is v1.1.1.

More Skills