← 返回 Skills 市场

Skill-Eval

Name: Skill-Eval
Author: jensen-srp

作者 jensen-srp · GitHub ↗ · v0.4.0 · MIT-0

cross-platform ⚠ suspicious

310

总下载

当前安装

版本数

在 OpenClaw 中安装

/install skill-eval

功能描述

Autonomous engine that systematically evaluates and ranks agent skills across models using rubric grading, error taxonomy, and improvement feedback loops.

安全使用建议

This skill contains a detailed engine design but includes no code, no install, and no declared credentials—yet its instructions assume running scripts, writing files, and calling multiple external LLM providers. Before installing or enabling it: 1) Ask the publisher for the missing artifacts (scripts, eval configs) and for a clear list of required API keys and network endpoints. 2) Confirm where it will write outputs and whether it will read other skills or registries; restrict its filesystem scope in a sandbox if possible. 3) Require explicit declarations of any model provider credentials (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.) and limit which credentials it may use. 4) If you must run it, do so in an isolated environment with restricted network and filesystem access and audit its operations and generated files. The source is unknown—treat it as untrusted until those gaps are resolved.

功能分析

Type: OpenClaw Skill Name: skill-eval Version: 0.4.0 The `skill-eval` bundle describes an autonomous system for evaluating and improving AI agent skills. It features high-risk capabilities, including spawning subagents for comparative testing (Phase 3) and a 'self-evolving' mechanism (Phase 9 and 12) that directs the agent to modify its own instructions in `SKILL.md` based on evaluation results. While these functions are consistent with the stated purpose of a benchmarking engine, the combination of self-modification and sub-agent execution constitutes a high-risk operational profile. No evidence of intentional malice, such as data exfiltration or unauthorized access, was identified.

能力评估

⚠ Purpose & Capability

The stated purpose (autonomous, multi-model skill-evaluation engine) is plausible, but the SKILL.md describes running scripts, producing files (skill-cards, leaderboard HTML), and calling multiple external model providers. The skill package contains only SKILL.md and no scripts, no eval config files, and declares no required credentials—this is inconsistent with a tool that must orchestrate external model APIs and filesystem outputs.

⚠ Instruction Scope

The instructions reference reading/writing structured directories (evals/, workspaces/, knowledge/, skill-cards/, leaderboard/), running generator scripts (generate_skill_card.py, generate_leaderboard.py), and contacting multiple model providers for execution/judging/improvement. Those operations imply filesystem and network/API access, and potentially reading many other skills' manifests; none of these scopes are declared or constrained in the package. Because the skill is instruction-only, the agent would be given broad discretion to create files, call external models, or fetch registries to satisfy the instructions.

✓ Install Mechanism

There is no install spec and no code files — the skill is instruction-only. This minimizes direct install-time risks (no downloaded executables). However, runtime instructions still imply actions (network calls, file writes) which are outside the install scope.

⚠ Credentials

The SKILL.md explicitly expects interaction with multiple external model providers (Anthropic, OpenAI, Google) but the skill declares no required environment variables or primary credential. A real deployment would normally require API keys or tokens for those services. The absence of declared credentials is an incoherence: either the skill expects preexisting global access (not documented) or it omits required sensitive permissions. Both cases should be clarified before use.

✓ Persistence & Privilege

The skill does not request always:true, does not include install-time modifications, and is user-invocable only. The SKILL.md describes writing outputs into its own workspace directories, which is a normal level of presence for an evaluation tool and does not, from the provided material, claim system-wide privilege changes.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install skill-eval
安装完成后，直接呼叫该 Skill 的名称或使用 /skill-eval 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v0.4.0

Skill-Eval v0.4.0 introduces multi-model evaluation and improvement capabilities. - Added support for evaluating skills across multiple execution models, enabling per-model scoring and consistency analysis. - Introduced distinct roles for execution, judge, and improvement models; these can be configured globally or per-skill. - Output reports (skill cards) and the leaderboard now display per-model results and highlight cross-model performance. - Improved handling for unavailable models, dependency-gated and phantom tooling skills, and unsubstantiated claims. - Expanded knowledge base with an improvement engine for skill rewrites based on evaluation outcomes.

元数据

Slug skill-eval

版本 0.4.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Skill-Eval 是什么？

Autonomous engine that systematically evaluates and ranks agent skills across models using rubric grading, error taxonomy, and improvement feedback loops. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 310 次。

如何安装 Skill-Eval？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install skill-eval」即可一键安装，无需额外配置。

Skill-Eval 是免费的吗？

是的，Skill-Eval 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Skill-Eval 支持哪些平台？

Skill-Eval 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Skill-Eval？

由 jensen-srp（@jensen-srp）开发并维护，当前版本 v0.4.0。