Agent Evaluation

Name: Agent Evaluation
Author: rustyorb

功能描述

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

安全使用建议

Reasonable to install as an evaluation aid. Treat it as high-level guidance, not an automated evaluator, and avoid placing confidential benchmark, prompt, or production data into agent prompts unless your normal data-handling controls allow it.

功能分析

Type: OpenClaw Skill Name: agent-evaluation Version: 1.0.0 The skill bundle contains standard metadata and a markdown file describing 'Agent Evaluation'. The markdown content sets a persona for the AI agent and provides extensive information about evaluating LLM agents, including capabilities, requirements, patterns, and anti-patterns. There are no instructions for malicious execution, data exfiltration, persistence, or prompt injection aiming to subvert the agent's intended purpose or security boundaries. All content is descriptive and aligns with the stated goal of agent evaluation.

能力评估

✓ Purpose & Capability

The artifact purpose and content align: it describes agent testing, benchmark design, capability assessment, reliability metrics, regression testing, and evaluation anti-patterns.

✓ Instruction Scope

Instructions are advisory and domain-scoped; they do not ask the agent to override user intent, access private data, run commands, or perform high-impact actions.

✓ Install Mechanism

The package contains a single non-executable SKILL.md file with metadata frontmatter and no scripts, binaries, dependencies, or install hooks.

✓ Credentials

No environment variables, credentials, local files, network services, accounts, or external APIs are requested.

✓ Persistence & Privilege

No persistence, privilege escalation, background execution, memory storage, credential handling, or session/profile use is described.

版本历史

v1.0.0

- Initial release of agent-evaluation skill for testing and benchmarking LLM agents. - Supports behavioral testing, capability assessment, reliability metrics, and production monitoring. - Includes practical testing patterns: statistical test evaluation, behavioral contract testing, and adversarial testing. - Highlights common anti-patterns and sharp edges in LLM agent evaluation. - Designed for use alongside related skills such as multi-agent orchestration and autonomous agents.

元数据

Slug agent-evaluation

版本 1.0.0

许可证 —

累计安装 184

当前安装数 60

历史版本数 1

常见问题

Agent Evaluation 是什么？

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 5372 次。

如何安装 Agent Evaluation？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install agent-evaluation」即可一键安装，无需额外配置。

Agent Evaluation 是免费的吗？

是的，Agent Evaluation 完全免费（开源免费），可自由下载、安装和使用。

Agent Evaluation 支持哪些平台？

Agent Evaluation 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Agent Evaluation？

由 rustyorb（@rustyorb）开发并维护，当前版本 v1.0.0。

Agent Evaluation 是什么？

如何安装 Agent Evaluation？

Agent Evaluation 是免费的吗？

Agent Evaluation 支持哪些平台？

谁开发了 Agent Evaluation？

💬 留言讨论