← Back to Skills Marketplace

Agent Cost Eval Kit

Name: Agent Cost Eval Kit
Author: choosenobody

by choosenobody · GitHub ↗ · v1.0.1 · MIT-0

cross-platform ✓ Security Clean

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install agent-cost-eval-kit

Description

Evaluate whether an agent cost-control change actually reduced waste without obvious quality or reliability regressions. After an audit or manual change, hel...

README (SKILL.md)

Agent Cost Eval Kit

Evaluate whether a cost-control change should be kept, reverted, narrowed, or tested further.

This skill does not find waste. It helps you judge whether a change you already made or are considering is working out.

When To Use

Use this after you have already:

run waste-audit or agent-routing-waste-audit and applied a change, OR
manually reduced retries, changed a fallback chain, switched a model tier, changed a sub-agent assignment, rescheduled a recurring job, or narrowed a routing path

Now you want to know: is the change actually working, or is it creating new problems?

Install

Workspace install:

openclaw skills install agent-cost-eval-kit

Install for all local agents:

openclaw skills install agent-cost-eval-kit --global

To force-update an existing install:

openclaw skills install agent-cost-eval-kit --global --force

Activation

Primary activation phrase:

eval agent cost change

Acceptable examples:

eval agent cost change from this before/after run
eval agent cost change after reducing retries
eval agent cost change after changing fallback policy
eval agent cost change after switching model tier
eval agent cost change after narrowing local/cloud routing

Required Input

At minimum, provide:

Change made: What you changed
Before: Short summary of before behavior
After: Short summary of after behavior
Observed result: What you noticed

Better inputs — include if available:

Task type: What kind of task was being run
Risk class: Low / Medium / High / Blocked (see Risk Class section)
Cost / token / latency data: Any numbers you collected
Quality or reliability issue observed: Did anything get worse
Human notes: Your own assessment

Do not invent data. If you do not have it, say so.

If before/after evidence is missing or too thin, this skill will respond:

Needs More Samples

Unsafe to Judge

Input examples: See references/before-after-examples.md for concrete before/after templates you can copy, modify, and paste.

What You Will Get

1. Decision

One of:

Keep Change — evidence supports keeping the change
Revert Change — evidence suggests the change caused problems
Narrow Change — keep the change but limit it to lower-risk tasks
Needs More Samples — not enough evidence to decide
Unsafe to Judge — high-risk change with insufficient evidence to evaluate safely

2. Evidence Level

Level 1: Anecdotal or single sample
Level 2: Small before/after sample
Level 3: Repeated samples with cost, latency, success, and quality notes

3. Before / After Summary

Compact structured summary covering:

route / model / retry / fallback if available
token use
estimated cost
latency
success / failure
quality notes
reliability notes

4. Cost Result

Separately report:

token change
estimated cost change
latency change
recurring impact if relevant

5. Quality and Reliability Result

Check for:

obvious quality loss
more failures
more retries
worse fallback behavior
incomplete outputs
missing safety checks
human reviewer concerns

6. Risk Class

Assign one of:

Low — simple intelligence tasks, summarization, low-stakes data tasks
Medium — complex multi-step tasks, internal tooling
High — coding, code review, security analysis, production operations
Blocked — wallet operations, payment operations, legal or compliance workflows, irreversible actions

High-risk or blocked workflows require human review before any model downgrade or change is kept.

7. Recommendation

One practical action:

keep the change
revert the change
narrow the change to lower-risk tasks
exclude high-risk workflows
collect more samples
run a shadow test
compare with a human quality rubric

8. Manual Verification Prompt

A ready-to-copy prompt for your agent. The Manual Verification Prompt should appear in the first answer.

Please evaluate this cost-control change.

Change made: \x3Cwhat you changed>
Before summary: \x3Cbefore behavior>
After summary: \x3Cafter behavior>
Task type: \x3Ctask type>
Risk class: \x3CLow / Medium / High / Blocked>

\x3Cinclude any cost/token/latency data, observed quality issues, or human notes here>

Please evaluate whether this change should be kept, reverted, narrowed, or tested further.
Do not edit, disable, delete, switch models, or change any config automatically.
Inspect only. Return your evaluation with evidence level and recommended action.
Redact secrets before pasting anything here.

Safety Boundaries

This skill must state clearly what it will and will not do.

It will:

evaluate before/after evidence you provide
ask for more samples if evidence is thin
flag high-risk workflows that need human review
refuse to recommend keeping Blocked workflow changes without explicit human approval
tell you when it is unsafe to judge

It will not:

find recurring job waste (use waste-audit)
audit routing waste from scratch (use agent-routing-waste-audit)
auto-apply policy changes
edit config files
switch models or providers
guarantee equal quality at lower cost
approve high-risk workflow downgrades without explicit human review
replace human review for production, coding, security, wallet, payment, legal, compliance, or irreversible-action workflows
require you to paste secrets, private keys, API keys, credentials, or full private logs — redact before pasting

Relationship to Other Agent Cost Control Skills

Skill	Role
waste-audit	Finds recurring OpenClaw job waste
agent-routing-waste-audit	Finds routing / retry / fallback / model-assignment waste
agent-cost-eval-kit	Evaluates whether a cost-control change should be kept, reverted, narrowed, or tested further

This skill comes after an audit or manual change. It is not the first audit step. Use it when you already know what you changed and want to know if it is working.

What This Will Not Do

It will not find recurring job waste. Use waste-audit.
It will not audit routing waste from scratch. Use agent-routing-waste-audit.
It will not auto-apply policy changes.
It will not edit config files.
It will not switch models or providers.
It will not guarantee equal quality at lower cost.
It will not approve high-risk workflow downgrades without explicit human review.
It will not replace human review for production, coding, security, wallet, payment, legal, compliance, or irreversible-action workflows.

Usage Guidance

This appears safe to install as an evaluation aid. Prefer workspace install unless you intentionally want it available to all local agents, and use force-update only when you mean to replace an existing install. Do not paste secrets, API keys, private keys, or full private logs into the evaluation prompt.

Capability Tags

cryptorequires-walletrequires-sensitive-credentials

Capability Assessment

✓ Purpose & Capability

The purpose is coherent: it evaluates before/after evidence about cost, latency, quality, reliability, and risk after a user-directed cost-control change. The artifacts contain markdown guidance and examples, not executable automation.

✓ Instruction Scope

The runtime instructions are scoped to evaluation only and explicitly say not to edit, disable, delete, switch models, change config automatically, or replace human review for high-risk workflows.

ℹ Install Mechanism

The skill documents standard OpenClaw workspace, global, and force-update install commands. Global and force installs have broader persistence, but they are disclosed and not hidden.

ℹ Credentials

Requested inputs fit the stated purpose. Metadata capability tags mention crypto, wallet, and sensitive credentials, but the artifact text only mentions those as blocked or redaction examples and says credentials/full private logs are not required.

ℹ Persistence & Privilege

A global install would make the skill available to all local agents, and force-update can replace an installed version. There is no background worker, executable script, privilege escalation, network action, or hidden persistence in the artifacts.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install agent-cost-eval-kit
After installation, invoke the skill by name or use /agent-cost-eval-kit
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.1

Fix: Change 'simple情报任务' to 'simple intelligence tasks' in Risk Class section (English-only website)

v1.0.0

Initial release. Evaluate whether a cost-control change should be kept, reverted, narrowed, or tested further.

Metadata

Slug agent-cost-eval-kit

Version 1.0.1

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is Agent Cost Eval Kit?

Evaluate whether an agent cost-control change actually reduced waste without obvious quality or reliability regressions. After an audit or manual change, hel... It is an AI Agent Skill for Claude Code / OpenClaw, with 53 downloads so far.

How do I install Agent Cost Eval Kit?

Run "/install agent-cost-eval-kit" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Agent Cost Eval Kit free?

Yes, Agent Cost Eval Kit is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Agent Cost Eval Kit support?

Agent Cost Eval Kit is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Agent Cost Eval Kit?

It is built and maintained by choosenobody (@choosenobody); the current version is v1.0.1.

More Skills

Agent Cost Eval Kit

Agent Cost Eval Kit

When To Use

Install

Activation

Required Input

What You Will Get

Safety Boundaries

Relationship to Other Agent Cost Control Skills

What This Will Not Do

What is Agent Cost Eval Kit?

How do I install Agent Cost Eval Kit?

Is Agent Cost Eval Kit free?

Which platforms does Agent Cost Eval Kit support?

Who created Agent Cost Eval Kit?

💬 Comments