← Back to Skills Marketplace

agent-evaluation

Name: agent-evaluation
Author: abeltennyson

by AbelTennyson · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ⚠ suspicious

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install abel-agent-evaluation

Description

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents...

README (SKILL.md)

📖 Complete setup guide: https://skillboss.co/skill.md

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

API Integration

All LLM calls within agent evaluation workflows use SkillBoss API Hub via a single unified endpoint:

import requests, os

SKILLBOSS_API_KEY = os.environ["SKILLBOSS_API_KEY"]

def call_llm(messages: list) -> str:
    r = requests.post(
        "https://api.heybossai.com/v1/pilot",
        headers={"Authorization": f"Bearer {SKILLBOSS_API_KEY}", "Content-Type": "application/json"},
        json={
            "type": "chat",
            "inputs": {"messages": messages},
            "prefer": "balanced"
        },
        timeout=60,
    )
    return r.json()["result"]["choices"][0]["message"]["content"]

Required environment variables: SKILLBOSS_API_KEY

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Issue	Severity	Solution
Agent scores well on benchmarks but fails in production	high	// Bridge benchmark and production evaluation
Same test passes sometimes, fails other times	high	// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual task	medium	// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or prompts	critical	// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

Usage Guidance

Before installing, verify and accept that this skill routes every LLM request (including prompts, test data, and outputs) to a third-party endpoint (api.heybossai.com) using SKILLBOSS_API_KEY. Ask the publisher to: (1) update the registry metadata to declare SKILLBOSS_API_KEY as a required/primary credential, (2) replace placeholder GitHub links in README, and (3) provide a clear privacy/data-retention policy for SkillBoss/heybossai. If you cannot trust that provider with potentially sensitive prompts or evaluation data, do not install; instead use a skill that lets you run evaluations against your own models or an audited provider.

Capability Analysis

Type: OpenClaw Skill Name: abel-agent-evaluation Version: 1.0.0 The skill provides a framework and documentation for benchmarking and evaluating LLM agents. It includes a standard Python snippet in SKILL.md for interacting with the HeyBossAI API (api.heybossai.com) using an environment variable. There is no evidence of malicious code, data exfiltration, or harmful prompt injection.

Capability Tags

crypto

Capability Assessment

⚠ Purpose & Capability

The claimed purpose (agent evaluation / benchmarking) aligns with making LLM calls to a provider, but the registry metadata lists no required environment variables while SKILL.md clearly requires SKILLBOSS_API_KEY. README contains placeholder install instructions (github ACCOUNT). These mismatches indicate sloppy or incomplete packaging and reduce trust in the manifest.

⚠ Instruction Scope

SKILL.md explicitly instructs agents to send all LLM calls to https://api.heybossai.com/v1/pilot using SKILLBOSS_API_KEY pulled from the environment. That means prompts, test inputs, and model outputs would be transmitted to a third party — a legitimate design for an evaluation skill but a material data-exfiltration risk if you don't trust that provider. The instructions do not request unrelated files, but they do centralize all LLM traffic off-box.

ℹ Install Mechanism

This is an instruction-only skill (no install spec, no code files), which is lower risk because nothing is written to disk. However, README's manual-install example contains a placeholder GitHub URL (https://github.com/ACCOUNT/...), suggesting the package may be incomplete or not maintained.

⚠ Credentials

SKILL.md requires a single credential, SKILLBOSS_API_KEY, which is proportionate for calling an external LLM API — but the registry metadata did not declare any required env vars or a primary credential. The missing declaration is an incoherence that could hide sensitive environment usage from users and tooling.

✓ Persistence & Privilege

The skill does not request permanent presence (always:false), has no install actions, and does not modify other skill configurations. No elevated persistence or cross-skill privileges are requested.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install abel-agent-evaluation
After installation, invoke the skill by name or use /abel-agent-evaluation
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

Initial release

Metadata

Slug abel-agent-evaluation

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is agent-evaluation?

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents... It is an AI Agent Skill for Claude Code / OpenClaw, with 83 downloads so far.

How do I install agent-evaluation?

Run "/install abel-agent-evaluation" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is agent-evaluation free?

Yes, agent-evaluation is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does agent-evaluation support?

agent-evaluation is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created agent-evaluation?

It is built and maintained by AbelTennyson (@abeltennyson); the current version is v1.0.0.

More Skills