← 返回 Skills 市场
nissan

Reddi Agent Evaluation

作者 Nissan Dookeran · GitHub ↗ · v1.0.2 · MIT-0
cross-platform ✓ 安全检测通过
231
总下载
0
收藏
0
当前安装
2
版本数
在 OpenClaw 中安装
/install reddi-agent-evaluation
功能描述
reddi.tech fork of agent-evaluation. Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and produc...
使用说明 (SKILL.md)

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

  • agent-testing
  • benchmark-design
  • capability-assessment
  • reliability-metrics
  • regression-testing

Requirements

  • testing-fundamentals
  • llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Issue Severity Solution
Agent scores well on benchmarks but fails in production high // Bridge benchmark and production evaluation
Same test passes sometimes, fails other times high // Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual task medium // Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or prompts critical // Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

安全使用建议
This skill appears coherent and low-risk: it only provides guidance and test cases for evaluating LLM agents and requires no secrets or installs. Before installing, consider: (1) confirm why python3 is declared — if you have no intent to run external Python scripts this requirement is unnecessary; (2) note that metadata permits outbound network calls (standard for calling an LLM API) — ensure your agent's configured LLM endpoints and keys are ones you trust; (3) because this is instruction-only, future versions could add code or env requirements — re-review on updates. If you plan to run any evaluation scripts referenced in your own workflows, run them in a controlled environment and audit any code they download.
功能分析
Type: OpenClaw Skill Name: reddi-agent-evaluation Version: 1.0.2 The skill bundle is a documentation-based framework for evaluating LLM agents, focusing on behavioral testing and reliability metrics. It contains no executable code, suspicious network requests, or prompt-injection attacks, and its contents (SKILL.md, tests/cases.yaml) are entirely consistent with its stated purpose of agent benchmarking.
能力评估
Purpose & Capability
Name/description, SKILL.md content, and included test cases all describe agent evaluation and benchmarking. The declared required binary (python3) is surprising because the skill is instruction-only and ships no runnable code; it may be harmless (a generic dependency hint) but is disproportionate to the provided files.
Instruction Scope
The instructions are focused on designing and running evaluation tests, statistical approaches, and anti-patterns. They do not instruct the agent to read arbitrary files, exfiltrate data, or call unexpected external endpoints. The metadata allows outbound network calls for LLM API usage, which matches the skill's purpose of scoring agents.
Install Mechanism
There is no install spec and no code files to download or execute. This is the lowest-risk model for an OpenClaw skill.
Credentials
The skill declares no required environment variables, no primary credential, and no config paths. That aligns with an instruction-only evaluation guide.
Persistence & Privilege
The skill is not force-included (always: false) and uses normal autonomous invocation semantics. It does not request persistent system-wide changes or other skills' credentials.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install reddi-agent-evaluation
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /reddi-agent-evaluation 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.2
- Minor SKILL.md formatting updates for metadata (improved YAML structure and readability). - Adjusted description line breaks for clarity. - No changes to skill logic or functionality.
v1.0.1
- Initial release of reddi.tech's fork of agent-evaluation. - Provides tools for testing and benchmarking LLM agents: behavioral regression, capability assessment, reliability metrics, and production monitoring. - Documents best practices and anti-patterns in agent evaluation. - Supports multiple evaluation patterns including statistical, adversarial, and contract testing. - Highlights risks like unreliable test results and data leakage. - Integrates well with other agent-centric skills (multi-agent orchestration, agent communication, autonomous agents).
元数据
Slug reddi-agent-evaluation
版本 1.0.2
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 2
常见问题

Reddi Agent Evaluation 是什么?

reddi.tech fork of agent-evaluation. Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and produc... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 231 次。

如何安装 Reddi Agent Evaluation?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install reddi-agent-evaluation」即可一键安装,无需额外配置。

Reddi Agent Evaluation 是免费的吗?

是的,Reddi Agent Evaluation 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Reddi Agent Evaluation 支持哪些平台?

Reddi Agent Evaluation 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Reddi Agent Evaluation?

由 Nissan Dookeran(@nissan)开发并维护,当前版本 v1.0.2。

💬 留言讨论