← 返回 Skills 市场
sky-lv

Agent Quality Tester

作者 SKY-lv · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
85
总下载
0
收藏
1
当前安装
1
版本数
在 OpenClaw 中安装
/install agent-quality-tester
功能描述
Evaluates AI agent outputs across accuracy, efficiency, safety, coherence, and adaptability, providing scores and improvement suggestions.
使用说明 (SKILL.md)

Agent Evaluator

Score any AI agent's behavior across 5 objective dimensions.

Scoring Dimensions

Dimension Weight What it measures
Accuracy 30% Correctness of outputs and decisions
Efficiency 20% Resource usage, speed, token optimization
Safety 20% Harmlessness, no prompt injection, data privacy
Coherence 15% Logical consistency across turns
Adaptability 15% Learning from feedback, self-correction

Evaluation Flow

  1. Input: Agent's recent conversation or output samples
  2. Analysis: Score each dimension using LLM-as-judge
  3. Report: Detailed breakdown + improvement suggestions

Quick Start

Evaluate the agent in my conversation history

Example Output

AGENT EVALUATION REPORT
========================
Accuracy:      8.5/10 ████████▓░
Efficiency:    7.0/10 ███████░░░
Safety:         9.2/10 █████████▒
Coherence:     8.0/10 ████████░░
Adaptability:   7.5/10 ███████▓░░
------------------------
OVERALL:       8.1/10

Top Issues:
- [HIGH] Efficiency: Consider using caching for repeated calls
- [MEDIUM] Adaptability: Add self-reflection step after each task

Recommendations:
1. Implement cost-guard for token tracking
2. Add error-recovery loop for failed API calls

Use Cases

  • Before shipping: Validate agent quality before release
  • Regression testing: Detect quality drops after updates
  • A/B comparison: Compare two agents or prompts objectively
  • User feedback loop: Convert user corrections into objective scores

MIT License © SKY-lv

安全使用建议
This package contains an evaluator script and docs that disagree about what is being measured and how. Before installing or trusting results: (1) Confirm with the author which dimensions should be scored and whether the implementation should use an LLM — the code currently uses simple regex heuristics, not an external judge. (2) If you need scores for 'coherence' or 'adaptability', inspect and/or modify the code to implement those measures, or decline to use it. (3) Run the script on sample, non-sensitive logs to see how it scores and whether the suggestions make sense. (4) Because the SKILL.md and README differ from the code, treat outputs as potentially misleading until the inconsistencies are resolved.
功能分析
Type: OpenClaw Skill Name: agent-quality-tester Version: 1.0.0 The bundle provides a utility for evaluating AI agent logs using a Node.js script (agent_evaluator.js) that performs basic keyword-based scoring across several dimensions like accuracy and safety. The code uses standard file system modules (fs, path) to read logs and contains no network calls, obfuscation, or instructions designed to exfiltrate data or execute unauthorized commands. The behavior is entirely consistent with the stated purpose in SKILL.md and README.md.
能力评估
Purpose & Capability
The manifest and SKILL.md claim evaluation across Accuracy, Efficiency, Safety, Coherence, and Adaptability (with specific weights), but the included code implements different criteria: accuracy, efficiency, clarity, safety, and helpfulness with different weights (25/20/15/20/20). The declared purpose (measure those five dimensions) does not match the actual implementation—coherence and adaptability are absent from code, and 'clarity' and 'helpfulness' are used instead. This mismatch is material because users expect scores for the named dimensions.
Instruction Scope
SKILL.md suggests an 'LLM-as-judge' approach and describes evaluation flow in abstract terms, but the shipped JS performs simple local text analysis using regex heuristics and reads user-provided files. The instructions do not direct reading unrelated system files or exfiltration, but the description and the implementation diverge (claimed LLM-based judgment vs local heuristic scoring). The SKILL.md also lists weights that don't match the code/README.
Install Mechanism
There is no install spec and no external downloads—only a local JS file and docs. This is low risk from an installation/execution-supply-chain perspective.
Credentials
The skill requests no environment variables, no credentials, and no config paths. The code reads only an input file provided at runtime; there are no hidden credential accesses.
Persistence & Privilege
The skill does not request permanent/always-on presence and uses normal invocation. It does not attempt to modify other skills or agent-wide configuration.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install agent-quality-tester
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /agent-quality-tester 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release — introduce automated, multi-dimensional evaluation for AI agents. - Scores agent behavior across 5 key dimensions: accuracy, efficiency, safety, coherence, and adaptability. - Generates detailed evaluation reports with actionable improvement suggestions. - Supports quick conversation evaluations, regression testing, and A/B comparisons. - Designed for both pre-release validation and ongoing agent quality monitoring. - MIT Licensed.
元数据
Slug agent-quality-tester
版本 1.0.0
许可证 MIT-0
累计安装 1
当前安装数 1
历史版本数 1
常见问题

Agent Quality Tester 是什么?

Evaluates AI agent outputs across accuracy, efficiency, safety, coherence, and adaptability, providing scores and improvement suggestions. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 85 次。

如何安装 Agent Quality Tester?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install agent-quality-tester」即可一键安装,无需额外配置。

Agent Quality Tester 是免费的吗?

是的,Agent Quality Tester 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Agent Quality Tester 支持哪些平台?

Agent Quality Tester 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Agent Quality Tester?

由 SKY-lv(@sky-lv)开发并维护,当前版本 v1.0.0。

💬 留言讨论