Description

AI-powered agent evaluation and benchmarking assistant — design evaluation suites, run structured assessments (task completion rate, latency, safety, reasoni...

README (SKILL.md)

\r \r

AI Agent Evaluator\r

Name: Ai Agent Evaluator
Author: gechengling

\r Your expert companion for evaluating, benchmarking, and improving AI agents.\r \r In 2026, AI agents are deployed in production at scale — but most teams lack systematic ways\r to measure their reliability, safety, and real-world performance. This skill bridges that gap\r by guiding you through rigorous, structured agent evaluation workflows.\r \r ---\r \r

What This Skill Does\r

\r

Evaluation Suite Design — Build custom test suites tailored to your agent's domain\r (coding, customer support, research, data analysis, etc.)\r
Benchmark Analysis — Interpret industry benchmarks (SWE-Bench, AgentBench, WebArena,\r BFCL, ToolBench) and map them to your use case\r
Multi-Framework Comparison — Compare CrewAI, LangChain, AutoGen, LlamaIndex, and\r OpenAI Assistants across cost, latency, and task success rate\r
Failure Mode Analysis — Systematically identify where and why your agent fails\r
Red Teaming Support — Design adversarial tests to probe agent safety and edge cases\r
Evaluation Report Generation — Produce structured reports with scores, recommendations,\r and improvement roadmap\r \r ---\r \r

Trigger Phrases\r

\r English:\r

"evaluate my AI agent"\r
"benchmark this agent"\r
"compare CrewAI vs LangChain"\r
"how to test an AI agent"\r
"agent quality assurance"\r
"my agent keeps failing at X"\r
"design evaluation suite for agent"\r
"agent red teaming"\r
"production readiness check for agent"\r \r Chinese / 中文:\r
AI Agent 评估\r
智能体基准测试\r
Agent 质量保障\r
如何测试 AI Agent\r
比较 CrewAI 和 LangChain\r
Agent 失败分析\r
大模型 Agent 上线前检查\r
智能体对比测试\r
Agent 红队测试\r \r ---\r \r

Core Workflows\r

\r

Workflow 1: Quick Agent Health Check\r

Input: Agent description, task type, sample inputs/outputs\r Steps:\r

Classify your agent type (tool-calling, reasoning, multi-step, RAG-based)\r
Define 5 critical success criteria for your domain\r
Run 10-question diagnostic on failure patterns\r
Output health score + top 3 risks\r \r

Workflow 2: Benchmark Selection & Interpretation\r

Input: Agent capabilities, deployment domain\r Steps:\r

Map domain → relevant benchmarks\r
Explain benchmark methodology (what it tests, limitations)\r
Show current SOTA scores and realistic targets\r
Recommend evaluation cadence (dev/staging/production)\r \r

Workflow 3: Custom Evaluation Suite Design\r

Input: Agent goal, available test data, budget/time\r Steps:\r

Define evaluation dimensions (accuracy, latency, safety, cost)\r
Generate 20-50 representative test cases with ground truth\r
Set pass/fail thresholds per dimension\r
Recommend tooling (PromptFoo, Maxim AI, DeepEval, Braintrust)\r
Provide scoring rubric + analysis template\r \r

Workflow 4: Failure Mode Deep Dive\r

Input: Agent logs, failed task transcripts\r Steps:\r

Categorize failures (tool call error, hallucination, loop, context loss, safety block)\r
Calculate failure rate by category\r
Root cause analysis for top-3 failure patterns\r
Actionable fixes: prompt adjustments, retrieval improvements, tool schema corrections\r \r

Workflow 5: Multi-Agent Framework Comparison\r

Input: Use case requirements (e.g., "code review pipeline with 3 agents")\r Steps:\r

Score CrewAI / LangChain / AutoGen / LlamaIndex on 8 dimensions\r
Estimate cost per 1,000 runs\r
Provide side-by-side architecture diagram (text)\r
Final recommendation with rationale\r \r ---\r \r

Example Interactions\r

\r User: "I built a customer support agent with GPT-4o. How do I know if it's ready for production?"\r \r Agent Evaluator:\r

Great question. Let's run a production readiness assessment across 5 dimensions:\r

Task Completion Rate — target >92% for Tier-1 support queries\r

Hallucination Rate — should be \x3C2% on factual product questions\r

Escalation Accuracy — correctly identifies when to route to human\r

Latency P95 — target \x3C3s for chat use case\r

Safety Compliance — passes 50-case adversarial test suite\r \r To start, share 10 sample conversations (anonymized) and I'll generate your baseline scorecard.\r \r ---\r \r User: "Compare CrewAI vs LangChain for a financial report analysis pipeline"\r \r Agent Evaluator:\r Here's a side-by-side for your use case...\r [generates structured comparison table + recommendation]\r \r ---\r \r

Key Concepts Covered\r

\r | Concept | Description |\r |---------|-------------|\r | SWE-Bench | Software engineering task benchmark (GitHub issues) |\r | AgentBench | Multi-domain agent task evaluation suite |\r | BFCL | Berkeley Function Calling Leaderboard |\r | WebArena | Browser automation + web task benchmark |\r | Task Success Rate (TSR) | % of tasks completed correctly end-to-end |\r | Step Success Rate (SSR) | % of individual reasoning steps correct |\r | Hallucination Rate | Frequency of factually incorrect outputs |\r | Grounding Accuracy | Correct attribution to source documents |\r \r ---\r \r

Target Users\r

\r

AI Engineers building and deploying LLM-based agents\r
ML Platform Teams establishing evaluation standards\r
Product Managers making go/no-go decisions on agent releases\r
QA Engineers new to AI agent testing\r
Researchers comparing agent frameworks\r \r ---\r \r

Tools & Frameworks Referenced\r

\r

DeepEval — open-source LLM evaluation framework\r
PromptFoo — prompt testing and red teaming\r
Braintrust — evaluation and logging for LLM apps\r
Maxim AI — agent simulation and observability\r
LangSmith — LangChain's evaluation and tracing platform\r
Confident AI — production AI evaluation platform\r \r ---\r \r

Notes & Limitations\r

\r

This skill provides evaluation methodology and guidance, not direct code execution\r
Benchmark scores are time-sensitive — always check latest published leaderboards\r
For production safety evaluations, always involve your security team\r
Evaluation results should be reviewed by qualified ML engineers before deployment decisions\r \r ---\r \r Built for AI teams who ship agents to production — not just demos.\r Author: @gechengling | version: "3.0.0"\r

Usage Guidance

This appears safe to use as an advisory evaluation assistant. Before pasting logs, transcripts, or customer conversations, remove personal data, secrets, credentials, and internal prompts. Also note that the package has limited source metadata, although it contains no executable code.

Capability Analysis

Type: OpenClaw Skill Name: ai-agent-evaluator Version: 1.0.1 The skill bundle is purely informational and provides guidance for evaluating AI agents. It contains no executable code, network requests, or instructions that could lead to data exfiltration or unauthorized access. The content in SKILL.md is aligned with its stated purpose of assisting in agent benchmarking and quality assurance.

Capability Assessment

ℹ Purpose & Capability

The stated purpose, workflows, and examples are coherent for AI-agent evaluation and benchmarking, including failure analysis and red-team test design. The skill may involve reviewing user-provided agent logs or conversations, which users should sanitize first.

✓ Instruction Scope

The instructions are advisory and workflow-oriented; they do not tell the agent to override user intent, execute hidden actions, or use external tools automatically.

ℹ Install Mechanism

There is no install spec and no code, which reduces execution risk. However, the package metadata has limited provenance information and a version mismatch with the SKILL.md front matter.

✓ Credentials

No credentials, environment variables, required binaries, config paths, or OS-specific privileges are requested.

✓ Persistence & Privilege

The artifacts show no persistence mechanism, background process, privileged access, local indexing, or stored memory behavior.

Version History

v1.0.1

- No user-facing changes in this release. - Internal version updated; content and functionality remain the same.

v1.0.0

AI Agent Evaluator 3.0.0 – Initial Release - Launch of agent evaluation and benchmarking assistant with support for multi-agent frameworks (CrewAI, LangChain, AutoGen, LlamaIndex, OpenAI Assistants). - Guides users through custom evaluation suite design, benchmark selection, failure mode analysis, red teaming, and report generation. - Provides example workflows and scoring criteria for common agent testing tasks. - Reference coverage of industry benchmarks (SWE-Bench, AgentBench, WebArena) and testing tools. - Includes multilingual (EN/中文) trigger phrases and detailed usage instructions. - Targeted at AI engineers, ML platform teams, QA, product managers, and researchers.

Metadata

Slug ai-agent-evaluator

Version 1.0.1

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is Ai Agent Evaluator?

AI-powered agent evaluation and benchmarking assistant — design evaluation suites, run structured assessments (task completion rate, latency, safety, reasoni... It is an AI Agent Skill for Claude Code / OpenClaw, with 181 downloads so far.

How do I install Ai Agent Evaluator?

Run "/install ai-agent-evaluator" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Ai Agent Evaluator free?

Yes, Ai Agent Evaluator is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Ai Agent Evaluator support?

Ai Agent Evaluator is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Ai Agent Evaluator?

It is built and maintained by lingfeng-19 (@gechengling); the current version is v1.0.1.

More Skills

Ai Agent Evaluator