← Back to Skills Marketplace
Skill-Eval
by
jensen-srp
· GitHub ↗
· v0.4.0
· MIT-0
310
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install skill-eval
Description
Autonomous engine that systematically evaluates and ranks agent skills across models using rubric grading, error taxonomy, and improvement feedback loops.
Usage Guidance
This skill contains a detailed engine design but includes no code, no install, and no declared credentials—yet its instructions assume running scripts, writing files, and calling multiple external LLM providers. Before installing or enabling it: 1) Ask the publisher for the missing artifacts (scripts, eval configs) and for a clear list of required API keys and network endpoints. 2) Confirm where it will write outputs and whether it will read other skills or registries; restrict its filesystem scope in a sandbox if possible. 3) Require explicit declarations of any model provider credentials (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.) and limit which credentials it may use. 4) If you must run it, do so in an isolated environment with restricted network and filesystem access and audit its operations and generated files. The source is unknown—treat it as untrusted until those gaps are resolved.
Capability Analysis
Type: OpenClaw Skill
Name: skill-eval
Version: 0.4.0
The `skill-eval` bundle describes an autonomous system for evaluating and improving AI agent skills. It features high-risk capabilities, including spawning subagents for comparative testing (Phase 3) and a 'self-evolving' mechanism (Phase 9 and 12) that directs the agent to modify its own instructions in `SKILL.md` based on evaluation results. While these functions are consistent with the stated purpose of a benchmarking engine, the combination of self-modification and sub-agent execution constitutes a high-risk operational profile. No evidence of intentional malice, such as data exfiltration or unauthorized access, was identified.
Capability Assessment
Purpose & Capability
The stated purpose (autonomous, multi-model skill-evaluation engine) is plausible, but the SKILL.md describes running scripts, producing files (skill-cards, leaderboard HTML), and calling multiple external model providers. The skill package contains only SKILL.md and no scripts, no eval config files, and declares no required credentials—this is inconsistent with a tool that must orchestrate external model APIs and filesystem outputs.
Instruction Scope
The instructions reference reading/writing structured directories (evals/, workspaces/, knowledge/, skill-cards/, leaderboard/), running generator scripts (generate_skill_card.py, generate_leaderboard.py), and contacting multiple model providers for execution/judging/improvement. Those operations imply filesystem and network/API access, and potentially reading many other skills' manifests; none of these scopes are declared or constrained in the package. Because the skill is instruction-only, the agent would be given broad discretion to create files, call external models, or fetch registries to satisfy the instructions.
Install Mechanism
There is no install spec and no code files — the skill is instruction-only. This minimizes direct install-time risks (no downloaded executables). However, runtime instructions still imply actions (network calls, file writes) which are outside the install scope.
Credentials
The SKILL.md explicitly expects interaction with multiple external model providers (Anthropic, OpenAI, Google) but the skill declares no required environment variables or primary credential. A real deployment would normally require API keys or tokens for those services. The absence of declared credentials is an incoherence: either the skill expects preexisting global access (not documented) or it omits required sensitive permissions. Both cases should be clarified before use.
Persistence & Privilege
The skill does not request always:true, does not include install-time modifications, and is user-invocable only. The SKILL.md describes writing outputs into its own workspace directories, which is a normal level of presence for an evaluation tool and does not, from the provided material, claim system-wide privilege changes.
How to Use
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install skill-eval - After installation, invoke the skill by name or use
/skill-eval - Provide required inputs per the skill's parameter spec and get structured output
Version History
v0.4.0
Skill-Eval v0.4.0 introduces multi-model evaluation and improvement capabilities.
- Added support for evaluating skills across multiple execution models, enabling per-model scoring and consistency analysis.
- Introduced distinct roles for execution, judge, and improvement models; these can be configured globally or per-skill.
- Output reports (skill cards) and the leaderboard now display per-model results and highlight cross-model performance.
- Improved handling for unavailable models, dependency-gated and phantom tooling skills, and unsubstantiated claims.
- Expanded knowledge base with an improvement engine for skill rewrites based on evaluation outcomes.
Metadata
Frequently Asked Questions
What is Skill-Eval?
Autonomous engine that systematically evaluates and ranks agent skills across models using rubric grading, error taxonomy, and improvement feedback loops. It is an AI Agent Skill for Claude Code / OpenClaw, with 310 downloads so far.
How do I install Skill-Eval?
Run "/install skill-eval" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Skill-Eval free?
Yes, Skill-Eval is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Skill-Eval support?
Skill-Eval is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Skill-Eval?
It is built and maintained by jensen-srp (@jensen-srp); the current version is v0.4.0.
More Skills