← Back to Skills Marketplace
rustyorb

Agent Evaluation

by rustyorb · GitHub ↗ · v1.0.0
cross-platform ✓ Security Clean
5372
Downloads
8
Stars
60
Active Installs
1
Versions
Install in OpenClaw
/install agent-evaluation
Description
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
Usage Guidance
Reasonable to install as an evaluation aid. Treat it as high-level guidance, not an automated evaluator, and avoid placing confidential benchmark, prompt, or production data into agent prompts unless your normal data-handling controls allow it.
Capability Analysis
Type: OpenClaw Skill Name: agent-evaluation Version: 1.0.0 The skill bundle contains standard metadata and a markdown file describing 'Agent Evaluation'. The markdown content sets a persona for the AI agent and provides extensive information about evaluating LLM agents, including capabilities, requirements, patterns, and anti-patterns. There are no instructions for malicious execution, data exfiltration, persistence, or prompt injection aiming to subvert the agent's intended purpose or security boundaries. All content is descriptive and aligns with the stated goal of agent evaluation.
Capability Assessment
Purpose & Capability
The artifact purpose and content align: it describes agent testing, benchmark design, capability assessment, reliability metrics, regression testing, and evaluation anti-patterns.
Instruction Scope
Instructions are advisory and domain-scoped; they do not ask the agent to override user intent, access private data, run commands, or perform high-impact actions.
Install Mechanism
The package contains a single non-executable SKILL.md file with metadata frontmatter and no scripts, binaries, dependencies, or install hooks.
Credentials
No environment variables, credentials, local files, network services, accounts, or external APIs are requested.
Persistence & Privilege
No persistence, privilege escalation, background execution, memory storage, credential handling, or session/profile use is described.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install agent-evaluation
  3. After installation, invoke the skill by name or use /agent-evaluation
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
- Initial release of agent-evaluation skill for testing and benchmarking LLM agents. - Supports behavioral testing, capability assessment, reliability metrics, and production monitoring. - Includes practical testing patterns: statistical test evaluation, behavioral contract testing, and adversarial testing. - Highlights common anti-patterns and sharp edges in LLM agent evaluation. - Designed for use alongside related skills such as multi-agent orchestration and autonomous agents.
Metadata
Slug agent-evaluation
Version 1.0.0
License
All-time Installs 184
Active Installs 60
Total Versions 1
Frequently Asked Questions

What is Agent Evaluation?

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent. It is an AI Agent Skill for Claude Code / OpenClaw, with 5372 downloads so far.

How do I install Agent Evaluation?

Run "/install agent-evaluation" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Agent Evaluation free?

Yes, Agent Evaluation is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Agent Evaluation support?

Agent Evaluation is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Agent Evaluation?

It is built and maintained by rustyorb (@rustyorb); the current version is v1.0.0.

💬 Comments