Agent Cost Eval Kit
/install agent-cost-eval-kit
Agent Cost Eval Kit
Evaluate whether a cost-control change should be kept, reverted, narrowed, or tested further.
This skill does not find waste. It helps you judge whether a change you already made or are considering is working out.
When To Use
Use this after you have already:
- run
waste-auditoragent-routing-waste-auditand applied a change, OR - manually reduced retries, changed a fallback chain, switched a model tier, changed a sub-agent assignment, rescheduled a recurring job, or narrowed a routing path
Now you want to know: is the change actually working, or is it creating new problems?
Install
Workspace install:
openclaw skills install agent-cost-eval-kit
Install for all local agents:
openclaw skills install agent-cost-eval-kit --global
To force-update an existing install:
openclaw skills install agent-cost-eval-kit --global --force
Activation
Primary activation phrase:
eval agent cost change
Acceptable examples:
eval agent cost change from this before/after run
eval agent cost change after reducing retries
eval agent cost change after changing fallback policy
eval agent cost change after switching model tier
eval agent cost change after narrowing local/cloud routing
Required Input
At minimum, provide:
- Change made: What you changed
- Before: Short summary of before behavior
- After: Short summary of after behavior
- Observed result: What you noticed
Better inputs — include if available:
- Task type: What kind of task was being run
- Risk class: Low / Medium / High / Blocked (see Risk Class section)
- Cost / token / latency data: Any numbers you collected
- Quality or reliability issue observed: Did anything get worse
- Human notes: Your own assessment
Do not invent data. If you do not have it, say so.
If before/after evidence is missing or too thin, this skill will respond:
Needs More Samples
or
Unsafe to Judge
Input examples: See
references/before-after-examples.mdfor concrete before/after templates you can copy, modify, and paste.
What You Will Get
1. Decision
One of:
Keep Change— evidence supports keeping the changeRevert Change— evidence suggests the change caused problemsNarrow Change— keep the change but limit it to lower-risk tasksNeeds More Samples— not enough evidence to decideUnsafe to Judge— high-risk change with insufficient evidence to evaluate safely
2. Evidence Level
- Level 1: Anecdotal or single sample
- Level 2: Small before/after sample
- Level 3: Repeated samples with cost, latency, success, and quality notes
3. Before / After Summary
Compact structured summary covering:
- route / model / retry / fallback if available
- token use
- estimated cost
- latency
- success / failure
- quality notes
- reliability notes
4. Cost Result
Separately report:
- token change
- estimated cost change
- latency change
- recurring impact if relevant
5. Quality and Reliability Result
Check for:
- obvious quality loss
- more failures
- more retries
- worse fallback behavior
- incomplete outputs
- missing safety checks
- human reviewer concerns
6. Risk Class
Assign one of:
- Low — simple intelligence tasks, summarization, low-stakes data tasks
- Medium — complex multi-step tasks, internal tooling
- High — coding, code review, security analysis, production operations
- Blocked — wallet operations, payment operations, legal or compliance workflows, irreversible actions
High-risk or blocked workflows require human review before any model downgrade or change is kept.
7. Recommendation
One practical action:
- keep the change
- revert the change
- narrow the change to lower-risk tasks
- exclude high-risk workflows
- collect more samples
- run a shadow test
- compare with a human quality rubric
8. Manual Verification Prompt
A ready-to-copy prompt for your agent. The Manual Verification Prompt should appear in the first answer.
Please evaluate this cost-control change.
Change made: \x3Cwhat you changed>
Before summary: \x3Cbefore behavior>
After summary: \x3Cafter behavior>
Task type: \x3Ctask type>
Risk class: \x3CLow / Medium / High / Blocked>
\x3Cinclude any cost/token/latency data, observed quality issues, or human notes here>
Please evaluate whether this change should be kept, reverted, narrowed, or tested further.
Do not edit, disable, delete, switch models, or change any config automatically.
Inspect only. Return your evaluation with evidence level and recommended action.
Redact secrets before pasting anything here.
Safety Boundaries
This skill must state clearly what it will and will not do.
It will:
- evaluate before/after evidence you provide
- ask for more samples if evidence is thin
- flag high-risk workflows that need human review
- refuse to recommend keeping Blocked workflow changes without explicit human approval
- tell you when it is unsafe to judge
It will not:
- find recurring job waste (use waste-audit)
- audit routing waste from scratch (use agent-routing-waste-audit)
- auto-apply policy changes
- edit config files
- switch models or providers
- guarantee equal quality at lower cost
- approve high-risk workflow downgrades without explicit human review
- replace human review for production, coding, security, wallet, payment, legal, compliance, or irreversible-action workflows
- require you to paste secrets, private keys, API keys, credentials, or full private logs — redact before pasting
Relationship to Other Agent Cost Control Skills
| Skill | Role |
|---|---|
| waste-audit | Finds recurring OpenClaw job waste |
| agent-routing-waste-audit | Finds routing / retry / fallback / model-assignment waste |
| agent-cost-eval-kit | Evaluates whether a cost-control change should be kept, reverted, narrowed, or tested further |
This skill comes after an audit or manual change. It is not the first audit step. Use it when you already know what you changed and want to know if it is working.
What This Will Not Do
- It will not find recurring job waste. Use
waste-audit. - It will not audit routing waste from scratch. Use
agent-routing-waste-audit. - It will not auto-apply policy changes.
- It will not edit config files.
- It will not switch models or providers.
- It will not guarantee equal quality at lower cost.
- It will not approve high-risk workflow downgrades without explicit human review.
- It will not replace human review for production, coding, security, wallet, payment, legal, compliance, or irreversible-action workflows.
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install agent-cost-eval-kit - 安装完成后,直接呼叫该 Skill 的名称或使用
/agent-cost-eval-kit触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Agent Cost Eval Kit 是什么?
Evaluate whether an agent cost-control change actually reduced waste without obvious quality or reliability regressions. After an audit or manual change, hel... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 53 次。
如何安装 Agent Cost Eval Kit?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install agent-cost-eval-kit」即可一键安装,无需额外配置。
Agent Cost Eval Kit 是免费的吗?
是的,Agent Cost Eval Kit 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Agent Cost Eval Kit 支持哪些平台?
Agent Cost Eval Kit 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Agent Cost Eval Kit?
由 choosenobody(@choosenobody)开发并维护,当前版本 v1.0.1。