← 返回 Skills 市场

Agent Cost Eval Kit

Name: Agent Cost Eval Kit
Author: choosenobody

作者 choosenobody · GitHub ↗ · v1.0.1 · MIT-0

cross-platform ✓ 安全检测通过

总下载

当前安装

版本数

在 OpenClaw 中安装

/install agent-cost-eval-kit

功能描述

Evaluate whether an agent cost-control change actually reduced waste without obvious quality or reliability regressions. After an audit or manual change, hel...

使用说明 (SKILL.md)

Agent Cost Eval Kit

Evaluate whether a cost-control change should be kept, reverted, narrowed, or tested further.

This skill does not find waste. It helps you judge whether a change you already made or are considering is working out.

When To Use

Use this after you have already:

run waste-audit or agent-routing-waste-audit and applied a change, OR
manually reduced retries, changed a fallback chain, switched a model tier, changed a sub-agent assignment, rescheduled a recurring job, or narrowed a routing path

Now you want to know: is the change actually working, or is it creating new problems?

Install

Workspace install:

openclaw skills install agent-cost-eval-kit

Install for all local agents:

openclaw skills install agent-cost-eval-kit --global

To force-update an existing install:

openclaw skills install agent-cost-eval-kit --global --force

Activation

Primary activation phrase:

eval agent cost change

Acceptable examples:

eval agent cost change from this before/after run
eval agent cost change after reducing retries
eval agent cost change after changing fallback policy
eval agent cost change after switching model tier
eval agent cost change after narrowing local/cloud routing

Required Input

At minimum, provide:

Change made: What you changed
Before: Short summary of before behavior
After: Short summary of after behavior
Observed result: What you noticed

Better inputs — include if available:

Task type: What kind of task was being run
Risk class: Low / Medium / High / Blocked (see Risk Class section)
Cost / token / latency data: Any numbers you collected
Quality or reliability issue observed: Did anything get worse
Human notes: Your own assessment

Do not invent data. If you do not have it, say so.

If before/after evidence is missing or too thin, this skill will respond:

Needs More Samples

Unsafe to Judge

Input examples: See references/before-after-examples.md for concrete before/after templates you can copy, modify, and paste.

What You Will Get

1. Decision

One of:

Keep Change — evidence supports keeping the change
Revert Change — evidence suggests the change caused problems
Narrow Change — keep the change but limit it to lower-risk tasks
Needs More Samples — not enough evidence to decide
Unsafe to Judge — high-risk change with insufficient evidence to evaluate safely

2. Evidence Level

Level 1: Anecdotal or single sample
Level 2: Small before/after sample
Level 3: Repeated samples with cost, latency, success, and quality notes

3. Before / After Summary

Compact structured summary covering:

route / model / retry / fallback if available
token use
estimated cost
latency
success / failure
quality notes
reliability notes

4. Cost Result

Separately report:

token change
estimated cost change
latency change
recurring impact if relevant

5. Quality and Reliability Result

Check for:

obvious quality loss
more failures
more retries
worse fallback behavior
incomplete outputs
missing safety checks
human reviewer concerns

6. Risk Class

Assign one of:

Low — simple intelligence tasks, summarization, low-stakes data tasks
Medium — complex multi-step tasks, internal tooling
High — coding, code review, security analysis, production operations
Blocked — wallet operations, payment operations, legal or compliance workflows, irreversible actions

High-risk or blocked workflows require human review before any model downgrade or change is kept.

7. Recommendation

One practical action:

keep the change
revert the change
narrow the change to lower-risk tasks
exclude high-risk workflows
collect more samples
run a shadow test
compare with a human quality rubric

8. Manual Verification Prompt

A ready-to-copy prompt for your agent. The Manual Verification Prompt should appear in the first answer.

Please evaluate this cost-control change.

Change made: \x3Cwhat you changed>
Before summary: \x3Cbefore behavior>
After summary: \x3Cafter behavior>
Task type: \x3Ctask type>
Risk class: \x3CLow / Medium / High / Blocked>

\x3Cinclude any cost/token/latency data, observed quality issues, or human notes here>

Please evaluate whether this change should be kept, reverted, narrowed, or tested further.
Do not edit, disable, delete, switch models, or change any config automatically.
Inspect only. Return your evaluation with evidence level and recommended action.
Redact secrets before pasting anything here.

Safety Boundaries

This skill must state clearly what it will and will not do.

It will:

evaluate before/after evidence you provide
ask for more samples if evidence is thin
flag high-risk workflows that need human review
refuse to recommend keeping Blocked workflow changes without explicit human approval
tell you when it is unsafe to judge

It will not:

find recurring job waste (use waste-audit)
audit routing waste from scratch (use agent-routing-waste-audit)
auto-apply policy changes
edit config files
switch models or providers
guarantee equal quality at lower cost
approve high-risk workflow downgrades without explicit human review
replace human review for production, coding, security, wallet, payment, legal, compliance, or irreversible-action workflows
require you to paste secrets, private keys, API keys, credentials, or full private logs — redact before pasting

Relationship to Other Agent Cost Control Skills

Skill	Role
waste-audit	Finds recurring OpenClaw job waste
agent-routing-waste-audit	Finds routing / retry / fallback / model-assignment waste
agent-cost-eval-kit	Evaluates whether a cost-control change should be kept, reverted, narrowed, or tested further

This skill comes after an audit or manual change. It is not the first audit step. Use it when you already know what you changed and want to know if it is working.

What This Will Not Do

It will not find recurring job waste. Use waste-audit.
It will not audit routing waste from scratch. Use agent-routing-waste-audit.
It will not auto-apply policy changes.
It will not edit config files.
It will not switch models or providers.
It will not guarantee equal quality at lower cost.
It will not approve high-risk workflow downgrades without explicit human review.
It will not replace human review for production, coding, security, wallet, payment, legal, compliance, or irreversible-action workflows.

安全使用建议

This appears safe to install as an evaluation aid. Prefer workspace install unless you intentionally want it available to all local agents, and use force-update only when you mean to replace an existing install. Do not paste secrets, API keys, private keys, or full private logs into the evaluation prompt.

能力标签

cryptorequires-walletrequires-sensitive-credentials

能力评估

✓ Purpose & Capability

The purpose is coherent: it evaluates before/after evidence about cost, latency, quality, reliability, and risk after a user-directed cost-control change. The artifacts contain markdown guidance and examples, not executable automation.

✓ Instruction Scope

The runtime instructions are scoped to evaluation only and explicitly say not to edit, disable, delete, switch models, change config automatically, or replace human review for high-risk workflows.

ℹ Install Mechanism

The skill documents standard OpenClaw workspace, global, and force-update install commands. Global and force installs have broader persistence, but they are disclosed and not hidden.

ℹ Credentials

Requested inputs fit the stated purpose. Metadata capability tags mention crypto, wallet, and sensitive credentials, but the artifact text only mentions those as blocked or redaction examples and says credentials/full private logs are not required.

ℹ Persistence & Privilege

A global install would make the skill available to all local agents, and force-update can replace an installed version. There is no background worker, executable script, privilege escalation, network action, or hidden persistence in the artifacts.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install agent-cost-eval-kit
安装完成后，直接呼叫该 Skill 的名称或使用 /agent-cost-eval-kit 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.1

Fix: Change 'simple情报任务' to 'simple intelligence tasks' in Risk Class section (English-only website)

v1.0.0

Initial release. Evaluate whether a cost-control change should be kept, reverted, narrowed, or tested further.

元数据

Slug agent-cost-eval-kit

版本 1.0.1

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 2

常见问题