← Back to Skills Marketplace
choosenobody

Agent Cost Eval Kit

by choosenobody · GitHub ↗ · v1.0.1 · MIT-0
cross-platform ✓ Security Clean
53
Downloads
0
Stars
0
Active Installs
2
Versions
Install in OpenClaw
/install agent-cost-eval-kit
Description
Evaluate whether an agent cost-control change actually reduced waste without obvious quality or reliability regressions. After an audit or manual change, hel...
README (SKILL.md)

Agent Cost Eval Kit

Evaluate whether a cost-control change should be kept, reverted, narrowed, or tested further.

This skill does not find waste. It helps you judge whether a change you already made or are considering is working out.

When To Use

Use this after you have already:

  • run waste-audit or agent-routing-waste-audit and applied a change, OR
  • manually reduced retries, changed a fallback chain, switched a model tier, changed a sub-agent assignment, rescheduled a recurring job, or narrowed a routing path

Now you want to know: is the change actually working, or is it creating new problems?

Install

Workspace install:

openclaw skills install agent-cost-eval-kit

Install for all local agents:

openclaw skills install agent-cost-eval-kit --global

To force-update an existing install:

openclaw skills install agent-cost-eval-kit --global --force

Activation

Primary activation phrase:

eval agent cost change

Acceptable examples:

eval agent cost change from this before/after run
eval agent cost change after reducing retries
eval agent cost change after changing fallback policy
eval agent cost change after switching model tier
eval agent cost change after narrowing local/cloud routing

Required Input

At minimum, provide:

  • Change made: What you changed
  • Before: Short summary of before behavior
  • After: Short summary of after behavior
  • Observed result: What you noticed

Better inputs — include if available:

  • Task type: What kind of task was being run
  • Risk class: Low / Medium / High / Blocked (see Risk Class section)
  • Cost / token / latency data: Any numbers you collected
  • Quality or reliability issue observed: Did anything get worse
  • Human notes: Your own assessment

Do not invent data. If you do not have it, say so.

If before/after evidence is missing or too thin, this skill will respond:

Needs More Samples

or

Unsafe to Judge

Input examples: See references/before-after-examples.md for concrete before/after templates you can copy, modify, and paste.

What You Will Get

1. Decision

One of:

  • Keep Change — evidence supports keeping the change
  • Revert Change — evidence suggests the change caused problems
  • Narrow Change — keep the change but limit it to lower-risk tasks
  • Needs More Samples — not enough evidence to decide
  • Unsafe to Judge — high-risk change with insufficient evidence to evaluate safely

2. Evidence Level

  • Level 1: Anecdotal or single sample
  • Level 2: Small before/after sample
  • Level 3: Repeated samples with cost, latency, success, and quality notes

3. Before / After Summary

Compact structured summary covering:

  • route / model / retry / fallback if available
  • token use
  • estimated cost
  • latency
  • success / failure
  • quality notes
  • reliability notes

4. Cost Result

Separately report:

  • token change
  • estimated cost change
  • latency change
  • recurring impact if relevant

5. Quality and Reliability Result

Check for:

  • obvious quality loss
  • more failures
  • more retries
  • worse fallback behavior
  • incomplete outputs
  • missing safety checks
  • human reviewer concerns

6. Risk Class

Assign one of:

  • Low — simple intelligence tasks, summarization, low-stakes data tasks
  • Medium — complex multi-step tasks, internal tooling
  • High — coding, code review, security analysis, production operations
  • Blocked — wallet operations, payment operations, legal or compliance workflows, irreversible actions

High-risk or blocked workflows require human review before any model downgrade or change is kept.

7. Recommendation

One practical action:

  • keep the change
  • revert the change
  • narrow the change to lower-risk tasks
  • exclude high-risk workflows
  • collect more samples
  • run a shadow test
  • compare with a human quality rubric

8. Manual Verification Prompt

A ready-to-copy prompt for your agent. The Manual Verification Prompt should appear in the first answer.

Please evaluate this cost-control change.

Change made: \x3Cwhat you changed>
Before summary: \x3Cbefore behavior>
After summary: \x3Cafter behavior>
Task type: \x3Ctask type>
Risk class: \x3CLow / Medium / High / Blocked>

\x3Cinclude any cost/token/latency data, observed quality issues, or human notes here>

Please evaluate whether this change should be kept, reverted, narrowed, or tested further.
Do not edit, disable, delete, switch models, or change any config automatically.
Inspect only. Return your evaluation with evidence level and recommended action.
Redact secrets before pasting anything here.

Safety Boundaries

This skill must state clearly what it will and will not do.

It will:

  • evaluate before/after evidence you provide
  • ask for more samples if evidence is thin
  • flag high-risk workflows that need human review
  • refuse to recommend keeping Blocked workflow changes without explicit human approval
  • tell you when it is unsafe to judge

It will not:

  • find recurring job waste (use waste-audit)
  • audit routing waste from scratch (use agent-routing-waste-audit)
  • auto-apply policy changes
  • edit config files
  • switch models or providers
  • guarantee equal quality at lower cost
  • approve high-risk workflow downgrades without explicit human review
  • replace human review for production, coding, security, wallet, payment, legal, compliance, or irreversible-action workflows
  • require you to paste secrets, private keys, API keys, credentials, or full private logs — redact before pasting

Relationship to Other Agent Cost Control Skills

Skill Role
waste-audit Finds recurring OpenClaw job waste
agent-routing-waste-audit Finds routing / retry / fallback / model-assignment waste
agent-cost-eval-kit Evaluates whether a cost-control change should be kept, reverted, narrowed, or tested further

This skill comes after an audit or manual change. It is not the first audit step. Use it when you already know what you changed and want to know if it is working.

What This Will Not Do

  • It will not find recurring job waste. Use waste-audit.
  • It will not audit routing waste from scratch. Use agent-routing-waste-audit.
  • It will not auto-apply policy changes.
  • It will not edit config files.
  • It will not switch models or providers.
  • It will not guarantee equal quality at lower cost.
  • It will not approve high-risk workflow downgrades without explicit human review.
  • It will not replace human review for production, coding, security, wallet, payment, legal, compliance, or irreversible-action workflows.
Usage Guidance
This appears safe to install as an evaluation aid. Prefer workspace install unless you intentionally want it available to all local agents, and use force-update only when you mean to replace an existing install. Do not paste secrets, API keys, private keys, or full private logs into the evaluation prompt.
Capability Tags
cryptorequires-walletrequires-sensitive-credentials
Capability Assessment
Purpose & Capability
The purpose is coherent: it evaluates before/after evidence about cost, latency, quality, reliability, and risk after a user-directed cost-control change. The artifacts contain markdown guidance and examples, not executable automation.
Instruction Scope
The runtime instructions are scoped to evaluation only and explicitly say not to edit, disable, delete, switch models, change config automatically, or replace human review for high-risk workflows.
Install Mechanism
The skill documents standard OpenClaw workspace, global, and force-update install commands. Global and force installs have broader persistence, but they are disclosed and not hidden.
Credentials
Requested inputs fit the stated purpose. Metadata capability tags mention crypto, wallet, and sensitive credentials, but the artifact text only mentions those as blocked or redaction examples and says credentials/full private logs are not required.
Persistence & Privilege
A global install would make the skill available to all local agents, and force-update can replace an installed version. There is no background worker, executable script, privilege escalation, network action, or hidden persistence in the artifacts.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install agent-cost-eval-kit
  3. After installation, invoke the skill by name or use /agent-cost-eval-kit
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.1
Fix: Change 'simple情报任务' to 'simple intelligence tasks' in Risk Class section (English-only website)
v1.0.0
Initial release. Evaluate whether a cost-control change should be kept, reverted, narrowed, or tested further.
Metadata
Slug agent-cost-eval-kit
Version 1.0.1
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 2
Frequently Asked Questions

What is Agent Cost Eval Kit?

Evaluate whether an agent cost-control change actually reduced waste without obvious quality or reliability regressions. After an audit or manual change, hel... It is an AI Agent Skill for Claude Code / OpenClaw, with 53 downloads so far.

How do I install Agent Cost Eval Kit?

Run "/install agent-cost-eval-kit" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Agent Cost Eval Kit free?

Yes, Agent Cost Eval Kit is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Agent Cost Eval Kit support?

Agent Cost Eval Kit is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Agent Cost Eval Kit?

It is built and maintained by choosenobody (@choosenobody); the current version is v1.0.1.

💬 Comments