← Back to Skills Marketplace
abeltennyson

agent-evaluation

by AbelTennyson · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
83
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install abel-agent-evaluation
Description
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents...
README (SKILL.md)

📖 Complete setup guide: https://skillboss.co/skill.md

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

  • agent-testing
  • benchmark-design
  • capability-assessment
  • reliability-metrics
  • regression-testing

Requirements

  • testing-fundamentals
  • llm-fundamentals

API Integration

All LLM calls within agent evaluation workflows use SkillBoss API Hub via a single unified endpoint:

import requests, os

SKILLBOSS_API_KEY = os.environ["SKILLBOSS_API_KEY"]

def call_llm(messages: list) -> str:
    r = requests.post(
        "https://api.heybossai.com/v1/pilot",
        headers={"Authorization": f"Bearer {SKILLBOSS_API_KEY}", "Content-Type": "application/json"},
        json={
            "type": "chat",
            "inputs": {"messages": messages},
            "prefer": "balanced"
        },
        timeout=60,
    )
    return r.json()["result"]["choices"][0]["message"]["content"]

Required environment variables: SKILLBOSS_API_KEY

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Issue Severity Solution
Agent scores well on benchmarks but fails in production high // Bridge benchmark and production evaluation
Same test passes sometimes, fails other times high // Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual task medium // Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or prompts critical // Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

Usage Guidance
Before installing, verify and accept that this skill routes every LLM request (including prompts, test data, and outputs) to a third-party endpoint (api.heybossai.com) using SKILLBOSS_API_KEY. Ask the publisher to: (1) update the registry metadata to declare SKILLBOSS_API_KEY as a required/primary credential, (2) replace placeholder GitHub links in README, and (3) provide a clear privacy/data-retention policy for SkillBoss/heybossai. If you cannot trust that provider with potentially sensitive prompts or evaluation data, do not install; instead use a skill that lets you run evaluations against your own models or an audited provider.
Capability Analysis
Type: OpenClaw Skill Name: abel-agent-evaluation Version: 1.0.0 The skill provides a framework and documentation for benchmarking and evaluating LLM agents. It includes a standard Python snippet in SKILL.md for interacting with the HeyBossAI API (api.heybossai.com) using an environment variable. There is no evidence of malicious code, data exfiltration, or harmful prompt injection.
Capability Tags
crypto
Capability Assessment
Purpose & Capability
The claimed purpose (agent evaluation / benchmarking) aligns with making LLM calls to a provider, but the registry metadata lists no required environment variables while SKILL.md clearly requires SKILLBOSS_API_KEY. README contains placeholder install instructions (github ACCOUNT). These mismatches indicate sloppy or incomplete packaging and reduce trust in the manifest.
Instruction Scope
SKILL.md explicitly instructs agents to send all LLM calls to https://api.heybossai.com/v1/pilot using SKILLBOSS_API_KEY pulled from the environment. That means prompts, test inputs, and model outputs would be transmitted to a third party — a legitimate design for an evaluation skill but a material data-exfiltration risk if you don't trust that provider. The instructions do not request unrelated files, but they do centralize all LLM traffic off-box.
Install Mechanism
This is an instruction-only skill (no install spec, no code files), which is lower risk because nothing is written to disk. However, README's manual-install example contains a placeholder GitHub URL (https://github.com/ACCOUNT/...), suggesting the package may be incomplete or not maintained.
Credentials
SKILL.md requires a single credential, SKILLBOSS_API_KEY, which is proportionate for calling an external LLM API — but the registry metadata did not declare any required env vars or a primary credential. The missing declaration is an incoherence that could hide sensitive environment usage from users and tooling.
Persistence & Privilege
The skill does not request permanent presence (always:false), has no install actions, and does not modify other skill configurations. No elevated persistence or cross-skill privileges are requested.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install abel-agent-evaluation
  3. After installation, invoke the skill by name or use /abel-agent-evaluation
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release
Metadata
Slug abel-agent-evaluation
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is agent-evaluation?

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents... It is an AI Agent Skill for Claude Code / OpenClaw, with 83 downloads so far.

How do I install agent-evaluation?

Run "/install abel-agent-evaluation" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is agent-evaluation free?

Yes, agent-evaluation is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does agent-evaluation support?

agent-evaluation is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created agent-evaluation?

It is built and maintained by AbelTennyson (@abeltennyson); the current version is v1.0.0.

💬 Comments