Agent Evaluation

Name: Agent Evaluation
Author: rustyorb

Description

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

Usage Guidance

Reasonable to install as an evaluation aid. Treat it as high-level guidance, not an automated evaluator, and avoid placing confidential benchmark, prompt, or production data into agent prompts unless your normal data-handling controls allow it.

Capability Analysis

Type: OpenClaw Skill Name: agent-evaluation Version: 1.0.0 The skill bundle contains standard metadata and a markdown file describing 'Agent Evaluation'. The markdown content sets a persona for the AI agent and provides extensive information about evaluating LLM agents, including capabilities, requirements, patterns, and anti-patterns. There are no instructions for malicious execution, data exfiltration, persistence, or prompt injection aiming to subvert the agent's intended purpose or security boundaries. All content is descriptive and aligns with the stated goal of agent evaluation.

Capability Assessment

✓ Purpose & Capability

The artifact purpose and content align: it describes agent testing, benchmark design, capability assessment, reliability metrics, regression testing, and evaluation anti-patterns.

✓ Instruction Scope

Instructions are advisory and domain-scoped; they do not ask the agent to override user intent, access private data, run commands, or perform high-impact actions.

✓ Install Mechanism

The package contains a single non-executable SKILL.md file with metadata frontmatter and no scripts, binaries, dependencies, or install hooks.

✓ Credentials

No environment variables, credentials, local files, network services, accounts, or external APIs are requested.

✓ Persistence & Privilege

No persistence, privilege escalation, background execution, memory storage, credential handling, or session/profile use is described.

Version History

v1.0.0

- Initial release of agent-evaluation skill for testing and benchmarking LLM agents. - Supports behavioral testing, capability assessment, reliability metrics, and production monitoring. - Includes practical testing patterns: statistical test evaluation, behavioral contract testing, and adversarial testing. - Highlights common anti-patterns and sharp edges in LLM agent evaluation. - Designed for use alongside related skills such as multi-agent orchestration and autonomous agents.

Metadata

Slug agent-evaluation

Version 1.0.0

License —

All-time Installs 184

Active Installs 60

Total Versions 1

Frequently Asked Questions

What is Agent Evaluation?

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent. It is an AI Agent Skill for Claude Code / OpenClaw, with 5372 downloads so far.

How do I install Agent Evaluation?

Run "/install agent-evaluation" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Agent Evaluation free?

Yes, Agent Evaluation is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Agent Evaluation support?

Agent Evaluation is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Agent Evaluation?

It is built and maintained by rustyorb (@rustyorb); the current version is v1.0.0.

More Skills

What is Agent Evaluation?

How do I install Agent Evaluation?

Is Agent Evaluation free?

Which platforms does Agent Evaluation support?

Who created Agent Evaluation?

💬 Comments