← 返回 Skills 市场

agent-evaluation

Name: agent-evaluation
Author: abeltennyson

作者 AbelTennyson · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ⚠ suspicious

总下载

当前安装

版本数

在 OpenClaw 中安装

/install abel-agent-evaluation

功能描述

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents...

使用说明 (SKILL.md)

📖 Complete setup guide: https://skillboss.co/skill.md

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

API Integration

All LLM calls within agent evaluation workflows use SkillBoss API Hub via a single unified endpoint:

import requests, os

SKILLBOSS_API_KEY = os.environ["SKILLBOSS_API_KEY"]

def call_llm(messages: list) -> str:
    r = requests.post(
        "https://api.heybossai.com/v1/pilot",
        headers={"Authorization": f"Bearer {SKILLBOSS_API_KEY}", "Content-Type": "application/json"},
        json={
            "type": "chat",
            "inputs": {"messages": messages},
            "prefer": "balanced"
        },
        timeout=60,
    )
    return r.json()["result"]["choices"][0]["message"]["content"]

Required environment variables: SKILLBOSS_API_KEY

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Issue	Severity	Solution
Agent scores well on benchmarks but fails in production	high	// Bridge benchmark and production evaluation
Same test passes sometimes, fails other times	high	// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual task	medium	// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or prompts	critical	// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

安全使用建议

Before installing, verify and accept that this skill routes every LLM request (including prompts, test data, and outputs) to a third-party endpoint (api.heybossai.com) using SKILLBOSS_API_KEY. Ask the publisher to: (1) update the registry metadata to declare SKILLBOSS_API_KEY as a required/primary credential, (2) replace placeholder GitHub links in README, and (3) provide a clear privacy/data-retention policy for SkillBoss/heybossai. If you cannot trust that provider with potentially sensitive prompts or evaluation data, do not install; instead use a skill that lets you run evaluations against your own models or an audited provider.

功能分析

Type: OpenClaw Skill Name: abel-agent-evaluation Version: 1.0.0 The skill provides a framework and documentation for benchmarking and evaluating LLM agents. It includes a standard Python snippet in SKILL.md for interacting with the HeyBossAI API (api.heybossai.com) using an environment variable. There is no evidence of malicious code, data exfiltration, or harmful prompt injection.

能力标签

crypto

能力评估

⚠ Purpose & Capability

The claimed purpose (agent evaluation / benchmarking) aligns with making LLM calls to a provider, but the registry metadata lists no required environment variables while SKILL.md clearly requires SKILLBOSS_API_KEY. README contains placeholder install instructions (github ACCOUNT). These mismatches indicate sloppy or incomplete packaging and reduce trust in the manifest.

⚠ Instruction Scope

SKILL.md explicitly instructs agents to send all LLM calls to https://api.heybossai.com/v1/pilot using SKILLBOSS_API_KEY pulled from the environment. That means prompts, test inputs, and model outputs would be transmitted to a third party — a legitimate design for an evaluation skill but a material data-exfiltration risk if you don't trust that provider. The instructions do not request unrelated files, but they do centralize all LLM traffic off-box.

ℹ Install Mechanism

This is an instruction-only skill (no install spec, no code files), which is lower risk because nothing is written to disk. However, README's manual-install example contains a placeholder GitHub URL (https://github.com/ACCOUNT/...), suggesting the package may be incomplete or not maintained.

⚠ Credentials

SKILL.md requires a single credential, SKILLBOSS_API_KEY, which is proportionate for calling an external LLM API — but the registry metadata did not declare any required env vars or a primary credential. The missing declaration is an incoherence that could hide sensitive environment usage from users and tooling.

✓ Persistence & Privilege

The skill does not request permanent presence (always:false), has no install actions, and does not modify other skill configurations. No elevated persistence or cross-skill privileges are requested.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install abel-agent-evaluation
安装完成后，直接呼叫该 Skill 的名称或使用 /abel-agent-evaluation 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Initial release

元数据

Slug abel-agent-evaluation

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题