← Back to Skills Marketplace
nissan

Reddi Agent Evaluation

by Nissan Dookeran · GitHub ↗ · v1.0.2 · MIT-0
cross-platform ✓ Security Clean
231
Downloads
0
Stars
0
Active Installs
2
Versions
Install in OpenClaw
/install reddi-agent-evaluation
Description
reddi.tech fork of agent-evaluation. Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and produc...
README (SKILL.md)

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

  • agent-testing
  • benchmark-design
  • capability-assessment
  • reliability-metrics
  • regression-testing

Requirements

  • testing-fundamentals
  • llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Issue Severity Solution
Agent scores well on benchmarks but fails in production high // Bridge benchmark and production evaluation
Same test passes sometimes, fails other times high // Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual task medium // Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or prompts critical // Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

Usage Guidance
This skill appears coherent and low-risk: it only provides guidance and test cases for evaluating LLM agents and requires no secrets or installs. Before installing, consider: (1) confirm why python3 is declared — if you have no intent to run external Python scripts this requirement is unnecessary; (2) note that metadata permits outbound network calls (standard for calling an LLM API) — ensure your agent's configured LLM endpoints and keys are ones you trust; (3) because this is instruction-only, future versions could add code or env requirements — re-review on updates. If you plan to run any evaluation scripts referenced in your own workflows, run them in a controlled environment and audit any code they download.
Capability Analysis
Type: OpenClaw Skill Name: reddi-agent-evaluation Version: 1.0.2 The skill bundle is a documentation-based framework for evaluating LLM agents, focusing on behavioral testing and reliability metrics. It contains no executable code, suspicious network requests, or prompt-injection attacks, and its contents (SKILL.md, tests/cases.yaml) are entirely consistent with its stated purpose of agent benchmarking.
Capability Assessment
Purpose & Capability
Name/description, SKILL.md content, and included test cases all describe agent evaluation and benchmarking. The declared required binary (python3) is surprising because the skill is instruction-only and ships no runnable code; it may be harmless (a generic dependency hint) but is disproportionate to the provided files.
Instruction Scope
The instructions are focused on designing and running evaluation tests, statistical approaches, and anti-patterns. They do not instruct the agent to read arbitrary files, exfiltrate data, or call unexpected external endpoints. The metadata allows outbound network calls for LLM API usage, which matches the skill's purpose of scoring agents.
Install Mechanism
There is no install spec and no code files to download or execute. This is the lowest-risk model for an OpenClaw skill.
Credentials
The skill declares no required environment variables, no primary credential, and no config paths. That aligns with an instruction-only evaluation guide.
Persistence & Privilege
The skill is not force-included (always: false) and uses normal autonomous invocation semantics. It does not request persistent system-wide changes or other skills' credentials.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install reddi-agent-evaluation
  3. After installation, invoke the skill by name or use /reddi-agent-evaluation
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.2
- Minor SKILL.md formatting updates for metadata (improved YAML structure and readability). - Adjusted description line breaks for clarity. - No changes to skill logic or functionality.
v1.0.1
- Initial release of reddi.tech's fork of agent-evaluation. - Provides tools for testing and benchmarking LLM agents: behavioral regression, capability assessment, reliability metrics, and production monitoring. - Documents best practices and anti-patterns in agent evaluation. - Supports multiple evaluation patterns including statistical, adversarial, and contract testing. - Highlights risks like unreliable test results and data leakage. - Integrates well with other agent-centric skills (multi-agent orchestration, agent communication, autonomous agents).
Metadata
Slug reddi-agent-evaluation
Version 1.0.2
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 2
Frequently Asked Questions

What is Reddi Agent Evaluation?

reddi.tech fork of agent-evaluation. Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and produc... It is an AI Agent Skill for Claude Code / OpenClaw, with 231 downloads so far.

How do I install Reddi Agent Evaluation?

Run "/install reddi-agent-evaluation" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Reddi Agent Evaluation free?

Yes, Reddi Agent Evaluation is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Reddi Agent Evaluation support?

Reddi Agent Evaluation is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Reddi Agent Evaluation?

It is built and maintained by Nissan Dookeran (@nissan); the current version is v1.0.2.

💬 Comments