← 返回 Skills 市场

Reddi Agent Evaluation

Name: Reddi Agent Evaluation
Author: nissan

作者 Nissan Dookeran · GitHub ↗ · v1.0.2 · MIT-0

cross-platform ✓ 安全检测通过

231

总下载

当前安装

版本数

在 OpenClaw 中安装

/install reddi-agent-evaluation

功能描述

reddi.tech fork of agent-evaluation. Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and produc...

使用说明 (SKILL.md)

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Issue	Severity	Solution
Agent scores well on benchmarks but fails in production	high	// Bridge benchmark and production evaluation
Same test passes sometimes, fails other times	high	// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual task	medium	// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or prompts	critical	// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

安全使用建议

This skill appears coherent and low-risk: it only provides guidance and test cases for evaluating LLM agents and requires no secrets or installs. Before installing, consider: (1) confirm why python3 is declared — if you have no intent to run external Python scripts this requirement is unnecessary; (2) note that metadata permits outbound network calls (standard for calling an LLM API) — ensure your agent's configured LLM endpoints and keys are ones you trust; (3) because this is instruction-only, future versions could add code or env requirements — re-review on updates. If you plan to run any evaluation scripts referenced in your own workflows, run them in a controlled environment and audit any code they download.

功能分析

Type: OpenClaw Skill Name: reddi-agent-evaluation Version: 1.0.2 The skill bundle is a documentation-based framework for evaluating LLM agents, focusing on behavioral testing and reliability metrics. It contains no executable code, suspicious network requests, or prompt-injection attacks, and its contents (SKILL.md, tests/cases.yaml) are entirely consistent with its stated purpose of agent benchmarking.

能力评估

ℹ Purpose & Capability

Name/description, SKILL.md content, and included test cases all describe agent evaluation and benchmarking. The declared required binary (python3) is surprising because the skill is instruction-only and ships no runnable code; it may be harmless (a generic dependency hint) but is disproportionate to the provided files.

✓ Instruction Scope

The instructions are focused on designing and running evaluation tests, statistical approaches, and anti-patterns. They do not instruct the agent to read arbitrary files, exfiltrate data, or call unexpected external endpoints. The metadata allows outbound network calls for LLM API usage, which matches the skill's purpose of scoring agents.

✓ Install Mechanism

There is no install spec and no code files to download or execute. This is the lowest-risk model for an OpenClaw skill.

✓ Credentials

The skill declares no required environment variables, no primary credential, and no config paths. That aligns with an instruction-only evaluation guide.

✓ Persistence & Privilege

The skill is not force-included (always: false) and uses normal autonomous invocation semantics. It does not request persistent system-wide changes or other skills' credentials.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install reddi-agent-evaluation
安装完成后，直接呼叫该 Skill 的名称或使用 /reddi-agent-evaluation 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.2

- Minor SKILL.md formatting updates for metadata (improved YAML structure and readability). - Adjusted description line breaks for clarity. - No changes to skill logic or functionality.

v1.0.1

- Initial release of reddi.tech's fork of agent-evaluation. - Provides tools for testing and benchmarking LLM agents: behavioral regression, capability assessment, reliability metrics, and production monitoring. - Documents best practices and anti-patterns in agent evaluation. - Supports multiple evaluation patterns including statistical, adversarial, and contract testing. - Highlights risks like unreliable test results and data leakage. - Integrates well with other agent-centric skills (multi-agent orchestration, agent communication, autonomous agents).

元数据

Slug reddi-agent-evaluation

版本 1.0.2

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 2

常见问题