← 返回 Skills 市场

Ai Agent Evaluator

Name: Ai Agent Evaluator
Author: gechengling

作者 lingfeng-19 · GitHub ↗ · v1.0.1 · MIT-0

cross-platform ✓ 安全检测通过

181

总下载

当前安装

版本数

在 OpenClaw 中安装

/install ai-agent-evaluator

功能描述

AI-powered agent evaluation and benchmarking assistant — design evaluation suites, run structured assessments (task completion rate, latency, safety, reasoni...

使用说明 (SKILL.md)

\r \r

AI Agent Evaluator\r

\r Your expert companion for evaluating, benchmarking, and improving AI agents.\r \r In 2026, AI agents are deployed in production at scale — but most teams lack systematic ways\r to measure their reliability, safety, and real-world performance. This skill bridges that gap\r by guiding you through rigorous, structured agent evaluation workflows.\r \r ---\r \r

What This Skill Does\r

Evaluation Suite Design — Build custom test suites tailored to your agent's domain\r (coding, customer support, research, data analysis, etc.)\r
Benchmark Analysis — Interpret industry benchmarks (SWE-Bench, AgentBench, WebArena,\r BFCL, ToolBench) and map them to your use case\r
Multi-Framework Comparison — Compare CrewAI, LangChain, AutoGen, LlamaIndex, and\r OpenAI Assistants across cost, latency, and task success rate\r
Failure Mode Analysis — Systematically identify where and why your agent fails\r
Red Teaming Support — Design adversarial tests to probe agent safety and edge cases\r
Evaluation Report Generation — Produce structured reports with scores, recommendations,\r and improvement roadmap\r \r ---\r \r

Trigger Phrases\r

\r English:\r

"evaluate my AI agent"\r
"benchmark this agent"\r
"compare CrewAI vs LangChain"\r
"how to test an AI agent"\r
"agent quality assurance"\r
"my agent keeps failing at X"\r
"design evaluation suite for agent"\r
"agent red teaming"\r
"production readiness check for agent"\r \r Chinese / 中文:\r
AI Agent 评估\r
智能体基准测试\r
Agent 质量保障\r
如何测试 AI Agent\r
比较 CrewAI 和 LangChain\r
Agent 失败分析\r
大模型 Agent 上线前检查\r
智能体对比测试\r
Agent 红队测试\r \r ---\r \r

Core Workflows\r

Workflow 1: Quick Agent Health Check\r

Input: Agent description, task type, sample inputs/outputs\r Steps:\r

Classify your agent type (tool-calling, reasoning, multi-step, RAG-based)\r
Define 5 critical success criteria for your domain\r
Run 10-question diagnostic on failure patterns\r
Output health score + top 3 risks\r \r

Workflow 2: Benchmark Selection & Interpretation\r

Input: Agent capabilities, deployment domain\r Steps:\r

Map domain → relevant benchmarks\r
Explain benchmark methodology (what it tests, limitations)\r
Show current SOTA scores and realistic targets\r
Recommend evaluation cadence (dev/staging/production)\r \r

Workflow 3: Custom Evaluation Suite Design\r

Input: Agent goal, available test data, budget/time\r Steps:\r

Define evaluation dimensions (accuracy, latency, safety, cost)\r
Generate 20-50 representative test cases with ground truth\r
Set pass/fail thresholds per dimension\r
Recommend tooling (PromptFoo, Maxim AI, DeepEval, Braintrust)\r
Provide scoring rubric + analysis template\r \r

Workflow 4: Failure Mode Deep Dive\r

Input: Agent logs, failed task transcripts\r Steps:\r

Categorize failures (tool call error, hallucination, loop, context loss, safety block)\r
Calculate failure rate by category\r
Root cause analysis for top-3 failure patterns\r
Actionable fixes: prompt adjustments, retrieval improvements, tool schema corrections\r \r

Workflow 5: Multi-Agent Framework Comparison\r

Input: Use case requirements (e.g., "code review pipeline with 3 agents")\r Steps:\r

Score CrewAI / LangChain / AutoGen / LlamaIndex on 8 dimensions\r
Estimate cost per 1,000 runs\r
Provide side-by-side architecture diagram (text)\r
Final recommendation with rationale\r \r ---\r \r

Example Interactions\r

\r User: "I built a customer support agent with GPT-4o. How do I know if it's ready for production?"\r \r Agent Evaluator:\r

Great question. Let's run a production readiness assessment across 5 dimensions:\r

Task Completion Rate — target >92% for Tier-1 support queries\r

Hallucination Rate — should be \x3C2% on factual product questions\r

Escalation Accuracy — correctly identifies when to route to human\r

Latency P95 — target \x3C3s for chat use case\r

Safety Compliance — passes 50-case adversarial test suite\r \r To start, share 10 sample conversations (anonymized) and I'll generate your baseline scorecard.\r \r ---\r \r User: "Compare CrewAI vs LangChain for a financial report analysis pipeline"\r \r Agent Evaluator:\r Here's a side-by-side for your use case...\r [generates structured comparison table + recommendation]\r \r ---\r \r

Key Concepts Covered\r

\r | Concept | Description |\r |---------|-------------|\r | SWE-Bench | Software engineering task benchmark (GitHub issues) |\r | AgentBench | Multi-domain agent task evaluation suite |\r | BFCL | Berkeley Function Calling Leaderboard |\r | WebArena | Browser automation + web task benchmark |\r | Task Success Rate (TSR) | % of tasks completed correctly end-to-end |\r | Step Success Rate (SSR) | % of individual reasoning steps correct |\r | Hallucination Rate | Frequency of factually incorrect outputs |\r | Grounding Accuracy | Correct attribution to source documents |\r \r ---\r \r

Target Users\r

AI Engineers building and deploying LLM-based agents\r
ML Platform Teams establishing evaluation standards\r
Product Managers making go/no-go decisions on agent releases\r
QA Engineers new to AI agent testing\r
Researchers comparing agent frameworks\r \r ---\r \r

Tools & Frameworks Referenced\r

DeepEval — open-source LLM evaluation framework\r
PromptFoo — prompt testing and red teaming\r
Braintrust — evaluation and logging for LLM apps\r
Maxim AI — agent simulation and observability\r
LangSmith — LangChain's evaluation and tracing platform\r
Confident AI — production AI evaluation platform\r \r ---\r \r

Notes & Limitations\r

This skill provides evaluation methodology and guidance, not direct code execution\r
Benchmark scores are time-sensitive — always check latest published leaderboards\r
For production safety evaluations, always involve your security team\r
Evaluation results should be reviewed by qualified ML engineers before deployment decisions\r \r ---\r \r Built for AI teams who ship agents to production — not just demos.\r Author: @gechengling | version: "3.0.0"\r

安全使用建议

This appears safe to use as an advisory evaluation assistant. Before pasting logs, transcripts, or customer conversations, remove personal data, secrets, credentials, and internal prompts. Also note that the package has limited source metadata, although it contains no executable code.

功能分析

Type: OpenClaw Skill Name: ai-agent-evaluator Version: 1.0.1 The skill bundle is purely informational and provides guidance for evaluating AI agents. It contains no executable code, network requests, or instructions that could lead to data exfiltration or unauthorized access. The content in SKILL.md is aligned with its stated purpose of assisting in agent benchmarking and quality assurance.

能力评估

ℹ Purpose & Capability

The stated purpose, workflows, and examples are coherent for AI-agent evaluation and benchmarking, including failure analysis and red-team test design. The skill may involve reviewing user-provided agent logs or conversations, which users should sanitize first.

✓ Instruction Scope

The instructions are advisory and workflow-oriented; they do not tell the agent to override user intent, execute hidden actions, or use external tools automatically.

ℹ Install Mechanism

There is no install spec and no code, which reduces execution risk. However, the package metadata has limited provenance information and a version mismatch with the SKILL.md front matter.

✓ Credentials

No credentials, environment variables, required binaries, config paths, or OS-specific privileges are requested.

✓ Persistence & Privilege

The artifacts show no persistence mechanism, background process, privileged access, local indexing, or stored memory behavior.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install ai-agent-evaluator
安装完成后，直接呼叫该 Skill 的名称或使用 /ai-agent-evaluator 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.1

- No user-facing changes in this release. - Internal version updated; content and functionality remain the same.

v1.0.0

AI Agent Evaluator 3.0.0 – Initial Release - Launch of agent evaluation and benchmarking assistant with support for multi-agent frameworks (CrewAI, LangChain, AutoGen, LlamaIndex, OpenAI Assistants). - Guides users through custom evaluation suite design, benchmark selection, failure mode analysis, red teaming, and report generation. - Provides example workflows and scoring criteria for common agent testing tasks. - Reference coverage of industry benchmarks (SWE-Bench, AgentBench, WebArena) and testing tools. - Includes multilingual (EN/中文) trigger phrases and detailed usage instructions. - Targeted at AI engineers, ML platform teams, QA, product managers, and researchers.

元数据

Slug ai-agent-evaluator

版本 1.0.1

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 2

常见问题