Agent Cost Eval Kit
/install agent-cost-eval-kit
Agent Cost Eval Kit
Evaluate whether a cost-control change should be kept, reverted, narrowed, or tested further.
This skill does not find waste. It helps you judge whether a change you already made or are considering is working out.
When To Use
Use this after you have already:
- run
waste-auditoragent-routing-waste-auditand applied a change, OR - manually reduced retries, changed a fallback chain, switched a model tier, changed a sub-agent assignment, rescheduled a recurring job, or narrowed a routing path
Now you want to know: is the change actually working, or is it creating new problems?
Install
Workspace install:
openclaw skills install agent-cost-eval-kit
Install for all local agents:
openclaw skills install agent-cost-eval-kit --global
To force-update an existing install:
openclaw skills install agent-cost-eval-kit --global --force
Activation
Primary activation phrase:
eval agent cost change
Acceptable examples:
eval agent cost change from this before/after run
eval agent cost change after reducing retries
eval agent cost change after changing fallback policy
eval agent cost change after switching model tier
eval agent cost change after narrowing local/cloud routing
Required Input
At minimum, provide:
- Change made: What you changed
- Before: Short summary of before behavior
- After: Short summary of after behavior
- Observed result: What you noticed
Better inputs — include if available:
- Task type: What kind of task was being run
- Risk class: Low / Medium / High / Blocked (see Risk Class section)
- Cost / token / latency data: Any numbers you collected
- Quality or reliability issue observed: Did anything get worse
- Human notes: Your own assessment
Do not invent data. If you do not have it, say so.
If before/after evidence is missing or too thin, this skill will respond:
Needs More Samples
or
Unsafe to Judge
Input examples: See
references/before-after-examples.mdfor concrete before/after templates you can copy, modify, and paste.
What You Will Get
1. Decision
One of:
Keep Change— evidence supports keeping the changeRevert Change— evidence suggests the change caused problemsNarrow Change— keep the change but limit it to lower-risk tasksNeeds More Samples— not enough evidence to decideUnsafe to Judge— high-risk change with insufficient evidence to evaluate safely
2. Evidence Level
- Level 1: Anecdotal or single sample
- Level 2: Small before/after sample
- Level 3: Repeated samples with cost, latency, success, and quality notes
3. Before / After Summary
Compact structured summary covering:
- route / model / retry / fallback if available
- token use
- estimated cost
- latency
- success / failure
- quality notes
- reliability notes
4. Cost Result
Separately report:
- token change
- estimated cost change
- latency change
- recurring impact if relevant
5. Quality and Reliability Result
Check for:
- obvious quality loss
- more failures
- more retries
- worse fallback behavior
- incomplete outputs
- missing safety checks
- human reviewer concerns
6. Risk Class
Assign one of:
- Low — simple intelligence tasks, summarization, low-stakes data tasks
- Medium — complex multi-step tasks, internal tooling
- High — coding, code review, security analysis, production operations
- Blocked — wallet operations, payment operations, legal or compliance workflows, irreversible actions
High-risk or blocked workflows require human review before any model downgrade or change is kept.
7. Recommendation
One practical action:
- keep the change
- revert the change
- narrow the change to lower-risk tasks
- exclude high-risk workflows
- collect more samples
- run a shadow test
- compare with a human quality rubric
8. Manual Verification Prompt
A ready-to-copy prompt for your agent. The Manual Verification Prompt should appear in the first answer.
Please evaluate this cost-control change.
Change made: \x3Cwhat you changed>
Before summary: \x3Cbefore behavior>
After summary: \x3Cafter behavior>
Task type: \x3Ctask type>
Risk class: \x3CLow / Medium / High / Blocked>
\x3Cinclude any cost/token/latency data, observed quality issues, or human notes here>
Please evaluate whether this change should be kept, reverted, narrowed, or tested further.
Do not edit, disable, delete, switch models, or change any config automatically.
Inspect only. Return your evaluation with evidence level and recommended action.
Redact secrets before pasting anything here.
Safety Boundaries
This skill must state clearly what it will and will not do.
It will:
- evaluate before/after evidence you provide
- ask for more samples if evidence is thin
- flag high-risk workflows that need human review
- refuse to recommend keeping Blocked workflow changes without explicit human approval
- tell you when it is unsafe to judge
It will not:
- find recurring job waste (use waste-audit)
- audit routing waste from scratch (use agent-routing-waste-audit)
- auto-apply policy changes
- edit config files
- switch models or providers
- guarantee equal quality at lower cost
- approve high-risk workflow downgrades without explicit human review
- replace human review for production, coding, security, wallet, payment, legal, compliance, or irreversible-action workflows
- require you to paste secrets, private keys, API keys, credentials, or full private logs — redact before pasting
Relationship to Other Agent Cost Control Skills
| Skill | Role |
|---|---|
| waste-audit | Finds recurring OpenClaw job waste |
| agent-routing-waste-audit | Finds routing / retry / fallback / model-assignment waste |
| agent-cost-eval-kit | Evaluates whether a cost-control change should be kept, reverted, narrowed, or tested further |
This skill comes after an audit or manual change. It is not the first audit step. Use it when you already know what you changed and want to know if it is working.
What This Will Not Do
- It will not find recurring job waste. Use
waste-audit. - It will not audit routing waste from scratch. Use
agent-routing-waste-audit. - It will not auto-apply policy changes.
- It will not edit config files.
- It will not switch models or providers.
- It will not guarantee equal quality at lower cost.
- It will not approve high-risk workflow downgrades without explicit human review.
- It will not replace human review for production, coding, security, wallet, payment, legal, compliance, or irreversible-action workflows.
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install agent-cost-eval-kit - After installation, invoke the skill by name or use
/agent-cost-eval-kit - Provide required inputs per the skill's parameter spec and get structured output
What is Agent Cost Eval Kit?
Evaluate whether an agent cost-control change actually reduced waste without obvious quality or reliability regressions. After an audit or manual change, hel... It is an AI Agent Skill for Claude Code / OpenClaw, with 53 downloads so far.
How do I install Agent Cost Eval Kit?
Run "/install agent-cost-eval-kit" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Agent Cost Eval Kit free?
Yes, Agent Cost Eval Kit is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Agent Cost Eval Kit support?
Agent Cost Eval Kit is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Agent Cost Eval Kit?
It is built and maintained by choosenobody (@choosenobody); the current version is v1.0.1.