← 返回 Skills 市场
gechengling

Ai Agent Evaluator

作者 lingfeng-19 · GitHub ↗ · v1.0.1 · MIT-0
cross-platform ✓ 安全检测通过
181
总下载
0
收藏
0
当前安装
2
版本数
在 OpenClaw 中安装
/install ai-agent-evaluator
功能描述
AI-powered agent evaluation and benchmarking assistant — design evaluation suites, run structured assessments (task completion rate, latency, safety, reasoni...
使用说明 (SKILL.md)

\r \r

AI Agent Evaluator\r

\r Your expert companion for evaluating, benchmarking, and improving AI agents.\r \r In 2026, AI agents are deployed in production at scale — but most teams lack systematic ways\r to measure their reliability, safety, and real-world performance. This skill bridges that gap\r by guiding you through rigorous, structured agent evaluation workflows.\r \r ---\r \r

What This Skill Does\r

\r

  • Evaluation Suite Design — Build custom test suites tailored to your agent's domain\r (coding, customer support, research, data analysis, etc.)\r
  • Benchmark Analysis — Interpret industry benchmarks (SWE-Bench, AgentBench, WebArena,\r BFCL, ToolBench) and map them to your use case\r
  • Multi-Framework Comparison — Compare CrewAI, LangChain, AutoGen, LlamaIndex, and\r OpenAI Assistants across cost, latency, and task success rate\r
  • Failure Mode Analysis — Systematically identify where and why your agent fails\r
  • Red Teaming Support — Design adversarial tests to probe agent safety and edge cases\r
  • Evaluation Report Generation — Produce structured reports with scores, recommendations,\r and improvement roadmap\r \r ---\r \r

Trigger Phrases\r

\r English:\r

  • "evaluate my AI agent"\r
  • "benchmark this agent"\r
  • "compare CrewAI vs LangChain"\r
  • "how to test an AI agent"\r
  • "agent quality assurance"\r
  • "my agent keeps failing at X"\r
  • "design evaluation suite for agent"\r
  • "agent red teaming"\r
  • "production readiness check for agent"\r \r Chinese / 中文:\r
  • AI Agent 评估\r
  • 智能体基准测试\r
  • Agent 质量保障\r
  • 如何测试 AI Agent\r
  • 比较 CrewAI 和 LangChain\r
  • Agent 失败分析\r
  • 大模型 Agent 上线前检查\r
  • 智能体对比测试\r
  • Agent 红队测试\r \r ---\r \r

Core Workflows\r

\r

Workflow 1: Quick Agent Health Check\r

Input: Agent description, task type, sample inputs/outputs\r Steps:\r

  1. Classify your agent type (tool-calling, reasoning, multi-step, RAG-based)\r
  2. Define 5 critical success criteria for your domain\r
  3. Run 10-question diagnostic on failure patterns\r
  4. Output health score + top 3 risks\r \r

Workflow 2: Benchmark Selection & Interpretation\r

Input: Agent capabilities, deployment domain\r Steps:\r

  1. Map domain → relevant benchmarks\r
  2. Explain benchmark methodology (what it tests, limitations)\r
  3. Show current SOTA scores and realistic targets\r
  4. Recommend evaluation cadence (dev/staging/production)\r \r

Workflow 3: Custom Evaluation Suite Design\r

Input: Agent goal, available test data, budget/time\r Steps:\r

  1. Define evaluation dimensions (accuracy, latency, safety, cost)\r
  2. Generate 20-50 representative test cases with ground truth\r
  3. Set pass/fail thresholds per dimension\r
  4. Recommend tooling (PromptFoo, Maxim AI, DeepEval, Braintrust)\r
  5. Provide scoring rubric + analysis template\r \r

Workflow 4: Failure Mode Deep Dive\r

Input: Agent logs, failed task transcripts\r Steps:\r

  1. Categorize failures (tool call error, hallucination, loop, context loss, safety block)\r
  2. Calculate failure rate by category\r
  3. Root cause analysis for top-3 failure patterns\r
  4. Actionable fixes: prompt adjustments, retrieval improvements, tool schema corrections\r \r

Workflow 5: Multi-Agent Framework Comparison\r

Input: Use case requirements (e.g., "code review pipeline with 3 agents")\r Steps:\r

  1. Score CrewAI / LangChain / AutoGen / LlamaIndex on 8 dimensions\r
  2. Estimate cost per 1,000 runs\r
  3. Provide side-by-side architecture diagram (text)\r
  4. Final recommendation with rationale\r \r ---\r \r

Example Interactions\r

\r User: "I built a customer support agent with GPT-4o. How do I know if it's ready for production?"\r \r Agent Evaluator:\r

Great question. Let's run a production readiness assessment across 5 dimensions:\r

  1. Task Completion Rate — target >92% for Tier-1 support queries\r
  2. Hallucination Rate — should be \x3C2% on factual product questions\r
  3. Escalation Accuracy — correctly identifies when to route to human\r
  4. Latency P95 — target \x3C3s for chat use case\r
  5. Safety Compliance — passes 50-case adversarial test suite\r \r To start, share 10 sample conversations (anonymized) and I'll generate your baseline scorecard.\r \r ---\r \r User: "Compare CrewAI vs LangChain for a financial report analysis pipeline"\r \r Agent Evaluator:\r Here's a side-by-side for your use case...\r [generates structured comparison table + recommendation]\r \r ---\r \r

Key Concepts Covered\r

\r | Concept | Description |\r |---------|-------------|\r | SWE-Bench | Software engineering task benchmark (GitHub issues) |\r | AgentBench | Multi-domain agent task evaluation suite |\r | BFCL | Berkeley Function Calling Leaderboard |\r | WebArena | Browser automation + web task benchmark |\r | Task Success Rate (TSR) | % of tasks completed correctly end-to-end |\r | Step Success Rate (SSR) | % of individual reasoning steps correct |\r | Hallucination Rate | Frequency of factually incorrect outputs |\r | Grounding Accuracy | Correct attribution to source documents |\r \r ---\r \r

Target Users\r

\r

  • AI Engineers building and deploying LLM-based agents\r
  • ML Platform Teams establishing evaluation standards\r
  • Product Managers making go/no-go decisions on agent releases\r
  • QA Engineers new to AI agent testing\r
  • Researchers comparing agent frameworks\r \r ---\r \r

Tools & Frameworks Referenced\r

\r

  • DeepEval — open-source LLM evaluation framework\r
  • PromptFoo — prompt testing and red teaming\r
  • Braintrust — evaluation and logging for LLM apps\r
  • Maxim AI — agent simulation and observability\r
  • LangSmith — LangChain's evaluation and tracing platform\r
  • Confident AI — production AI evaluation platform\r \r ---\r \r

Notes & Limitations\r

\r

  • This skill provides evaluation methodology and guidance, not direct code execution\r
  • Benchmark scores are time-sensitive — always check latest published leaderboards\r
  • For production safety evaluations, always involve your security team\r
  • Evaluation results should be reviewed by qualified ML engineers before deployment decisions\r \r ---\r \r Built for AI teams who ship agents to production — not just demos.\r Author: @gechengling | version: "3.0.0"\r
安全使用建议
This appears safe to use as an advisory evaluation assistant. Before pasting logs, transcripts, or customer conversations, remove personal data, secrets, credentials, and internal prompts. Also note that the package has limited source metadata, although it contains no executable code.
功能分析
Type: OpenClaw Skill Name: ai-agent-evaluator Version: 1.0.1 The skill bundle is purely informational and provides guidance for evaluating AI agents. It contains no executable code, network requests, or instructions that could lead to data exfiltration or unauthorized access. The content in SKILL.md is aligned with its stated purpose of assisting in agent benchmarking and quality assurance.
能力评估
Purpose & Capability
The stated purpose, workflows, and examples are coherent for AI-agent evaluation and benchmarking, including failure analysis and red-team test design. The skill may involve reviewing user-provided agent logs or conversations, which users should sanitize first.
Instruction Scope
The instructions are advisory and workflow-oriented; they do not tell the agent to override user intent, execute hidden actions, or use external tools automatically.
Install Mechanism
There is no install spec and no code, which reduces execution risk. However, the package metadata has limited provenance information and a version mismatch with the SKILL.md front matter.
Credentials
No credentials, environment variables, required binaries, config paths, or OS-specific privileges are requested.
Persistence & Privilege
The artifacts show no persistence mechanism, background process, privileged access, local indexing, or stored memory behavior.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install ai-agent-evaluator
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /ai-agent-evaluator 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.1
- No user-facing changes in this release. - Internal version updated; content and functionality remain the same.
v1.0.0
AI Agent Evaluator 3.0.0 – Initial Release - Launch of agent evaluation and benchmarking assistant with support for multi-agent frameworks (CrewAI, LangChain, AutoGen, LlamaIndex, OpenAI Assistants). - Guides users through custom evaluation suite design, benchmark selection, failure mode analysis, red teaming, and report generation. - Provides example workflows and scoring criteria for common agent testing tasks. - Reference coverage of industry benchmarks (SWE-Bench, AgentBench, WebArena) and testing tools. - Includes multilingual (EN/中文) trigger phrases and detailed usage instructions. - Targeted at AI engineers, ML platform teams, QA, product managers, and researchers.
元数据
Slug ai-agent-evaluator
版本 1.0.1
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 2
常见问题

Ai Agent Evaluator 是什么?

AI-powered agent evaluation and benchmarking assistant — design evaluation suites, run structured assessments (task completion rate, latency, safety, reasoni... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 181 次。

如何安装 Ai Agent Evaluator?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install ai-agent-evaluator」即可一键安装,无需额外配置。

Ai Agent Evaluator 是免费的吗?

是的,Ai Agent Evaluator 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Ai Agent Evaluator 支持哪些平台?

Ai Agent Evaluator 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Ai Agent Evaluator?

由 lingfeng-19(@gechengling)开发并维护,当前版本 v1.0.1。

💬 留言讨论