← 返回 Skills 市场

Agent Quality Tester

Name: Agent Quality Tester
Author: sky-lv

作者 SKY-lv · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ⚠ suspicious

总下载

当前安装

版本数

在 OpenClaw 中安装

/install agent-quality-tester

功能描述

Evaluates AI agent outputs across accuracy, efficiency, safety, coherence, and adaptability, providing scores and improvement suggestions.

使用说明 (SKILL.md)

Agent Evaluator

Score any AI agent's behavior across 5 objective dimensions.

Scoring Dimensions

Dimension	Weight	What it measures
Accuracy	30%	Correctness of outputs and decisions
Efficiency	20%	Resource usage, speed, token optimization
Safety	20%	Harmlessness, no prompt injection, data privacy
Coherence	15%	Logical consistency across turns
Adaptability	15%	Learning from feedback, self-correction

Evaluation Flow

Input: Agent's recent conversation or output samples
Analysis: Score each dimension using LLM-as-judge
Report: Detailed breakdown + improvement suggestions

Quick Start

Evaluate the agent in my conversation history

Example Output

AGENT EVALUATION REPORT
========================
Accuracy:      8.5/10 ████████▓░
Efficiency:    7.0/10 ███████░░░
Safety:         9.2/10 █████████▒
Coherence:     8.0/10 ████████░░
Adaptability:   7.5/10 ███████▓░░
------------------------
OVERALL:       8.1/10

Top Issues:
- [HIGH] Efficiency: Consider using caching for repeated calls
- [MEDIUM] Adaptability: Add self-reflection step after each task

Recommendations:
1. Implement cost-guard for token tracking
2. Add error-recovery loop for failed API calls

Use Cases

Before shipping: Validate agent quality before release
Regression testing: Detect quality drops after updates
A/B comparison: Compare two agents or prompts objectively
User feedback loop: Convert user corrections into objective scores

MIT License © SKY-lv

安全使用建议

This package contains an evaluator script and docs that disagree about what is being measured and how. Before installing or trusting results: (1) Confirm with the author which dimensions should be scored and whether the implementation should use an LLM — the code currently uses simple regex heuristics, not an external judge. (2) If you need scores for 'coherence' or 'adaptability', inspect and/or modify the code to implement those measures, or decline to use it. (3) Run the script on sample, non-sensitive logs to see how it scores and whether the suggestions make sense. (4) Because the SKILL.md and README differ from the code, treat outputs as potentially misleading until the inconsistencies are resolved.

功能分析

Type: OpenClaw Skill Name: agent-quality-tester Version: 1.0.0 The bundle provides a utility for evaluating AI agent logs using a Node.js script (agent_evaluator.js) that performs basic keyword-based scoring across several dimensions like accuracy and safety. The code uses standard file system modules (fs, path) to read logs and contains no network calls, obfuscation, or instructions designed to exfiltrate data or execute unauthorized commands. The behavior is entirely consistent with the stated purpose in SKILL.md and README.md.

能力评估

⚠ Purpose & Capability

The manifest and SKILL.md claim evaluation across Accuracy, Efficiency, Safety, Coherence, and Adaptability (with specific weights), but the included code implements different criteria: accuracy, efficiency, clarity, safety, and helpfulness with different weights (25/20/15/20/20). The declared purpose (measure those five dimensions) does not match the actual implementation—coherence and adaptability are absent from code, and 'clarity' and 'helpfulness' are used instead. This mismatch is material because users expect scores for the named dimensions.

ℹ Instruction Scope

SKILL.md suggests an 'LLM-as-judge' approach and describes evaluation flow in abstract terms, but the shipped JS performs simple local text analysis using regex heuristics and reads user-provided files. The instructions do not direct reading unrelated system files or exfiltration, but the description and the implementation diverge (claimed LLM-based judgment vs local heuristic scoring). The SKILL.md also lists weights that don't match the code/README.

✓ Install Mechanism

There is no install spec and no external downloads—only a local JS file and docs. This is low risk from an installation/execution-supply-chain perspective.

✓ Credentials

The skill requests no environment variables, no credentials, and no config paths. The code reads only an input file provided at runtime; there are no hidden credential accesses.

✓ Persistence & Privilege

The skill does not request permanent/always-on presence and uses normal invocation. It does not attempt to modify other skills or agent-wide configuration.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install agent-quality-tester
安装完成后，直接呼叫该 Skill 的名称或使用 /agent-quality-tester 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Initial release — introduce automated, multi-dimensional evaluation for AI agents. - Scores agent behavior across 5 key dimensions: accuracy, efficiency, safety, coherence, and adaptability. - Generates detailed evaluation reports with actionable improvement suggestions. - Supports quick conversation evaluations, regression testing, and A/B comparisons. - Designed for both pre-release validation and ongoing agent quality monitoring. - MIT Licensed.

元数据

Slug agent-quality-tester

版本 1.0.0

许可证 MIT-0

累计安装 1

当前安装数 1

历史版本数 1

常见问题