Agent Scorecard
/install agent-scorecard
\r \r
Agent Scorecard Output Quality Framework\r
\r Configurable quality evaluation for AI agent outputs. Define criteria, run evaluations, track quality over time. No LLM-as-judge, no API calls, pattern-based automated checks.\r \r ---\r \r Configurable quality evaluation for AI agent outputs. Define criteria, run evaluations, track quality over time.\r \r Agent Scorecard gives you a structured, repeatable way to measure whether your AI agent is producing good output — and whether it's getting better or worse over time. No LLM-as-judge, no API calls, no external dependencies. Everything runs locally with pattern-based automated checks and optional human scoring.\r \r ---\r \r
The Problem\r
\r You changed your agent's system prompt. Is the output better now? You don't know. You added a new tool. Did response quality degrade? You have a feeling, but no data. Quality management for AI agents is mostly vibes.\r \r Agent Scorecard replaces vibes with numbers.\r \r
What It Does\r
\r
1. Define Quality Dimensions (config_example.json)\r
- Configure what "quality" means for your use case\r
- Set dimensions: accuracy, completeness, tone, format compliance, consistency — or your own\r
- Define rubrics (what does a 1 vs a 5 look like for each dimension?)\r
- Set weights (accuracy matters more than tone? Give it 2× weight)\r
- Set pass/fail thresholds per dimension\r \r
2. Evaluate (scorecard.py)\r
- Automated mode: Pattern-based checks run instantly with zero API calls\r
- Response length analysis (too short? too long?)\r
- Format compliance (expected headers, lists, code blocks present?)\r
- Sycophancy detection ("Great question!" markers)\r
- Filler/hedge word density ("basically", "perhaps", "I think")\r
- Required section verification\r
- Style consistency (sentence length variation)\r
- Manual mode: Interactive rubric-guided human scoring\r
- Blended mode: Combine auto scores with human judgment (averaged)\r
- Aggregate scoring with configurable method (weighted average, minimum, geometric mean)\r \r
3. Track (scorecard_track.py)\r
- Append every evaluation to a JSONL history file\r
- Filter by agent, task type, time period\r
- Compute trends per dimension (improving, degrading, stable)\r
- Linear regression slope for quantified direction\r
- Sparkline visualisations in terminal\r \r
4. Compare (scorecard_track.py)\r
- Before/after comparison (last N evals vs previous N)\r
- Per-dimension delta with direction indicators\r
- Perfect for measuring the impact of config changes\r \r
5. Report (scorecard_report.py)\r
- Single evaluation reports (markdown or JSON)\r
- History summary reports with tables and sparklines\r
- Per-dimension breakdowns with rubric reference\r
- Export to files or stdout\r \r ---\r \r
Quick Start\r
\r
# 1. Configure\r
cp config_example.json scorecard_config.json\r
# Edit dimensions, thresholds, and weights for your use case\r
\r
# 2. Evaluate a response\r
python3 scorecard.py --config scorecard_config.json --input response.txt\r
\r
# 3. Evaluate and save to history\r
python3 scorecard.py --config scorecard_config.json --input response.txt --save history.jsonl\r
\r
# 4. Manual scoring mode\r
python3 scorecard.py --config scorecard_config.json --input response.txt --manual --save history.jsonl\r
\r
# 5. View trends\r
python3 scorecard_track.py --history history.jsonl --summary\r
\r
# 6. Compare before/after (last 10 vs previous 10)\r
python3 scorecard_track.py --history history.jsonl --compare 10\r
\r
# 7. Generate a report\r
python3 scorecard_report.py --config scorecard_config.json --history history.jsonl\r
```\r
\r
## Programmatic Usage\r
\r
```python\r
from scorecard import Scorecard, _load_config\r
\r
cfg = _load_config("scorecard_config.json")\r
sc = Scorecard(cfg)\r
\r
text = open("agent_response.txt").read()\r
result = sc.evaluate(text, agent="my-agent", task_type="code-review")\r
\r
print(result.summary())\r
# Overall: 3.85/5 (PASS)\r
# ✓ Accuracy: 4.0/5 (threshold 3, weight 2.0) [auto]\r
# ✓ Completeness: 3.5/5 (threshold 3, weight 1.5) [auto]\r
# ...\r
\r
# Save for tracking\r
import json\r
with open("history.jsonl", "a") as f:\r
f.write(json.dumps(result.to_dict()) + "\
")\r
```\r
\r
---\r
\r
## Use Cases\r
\r
- **Prompt engineering:** Measure whether prompt changes improve output quality\r
- **Model comparison:** Same task, different models — which scores higher?\r
- **Agent regression testing:** Catch quality degradation before it ships\r
- **Team quality standards:** Define shared rubrics for consistent evaluation\r
- **Continuous monitoring:** Track quality trends over days/weeks/months\r
- **A/B testing:** Quantified before/after comparisons\r
\r
## What's Included\r
\r
| File | Purpose |\r
|------|---------|\r
| `scorecard.py` | Main evaluation engine — define, evaluate, score |\r
| `scorecard_track.py` | Historical tracking and trend analysis |\r
| `scorecard_report.py` | Report generation (markdown, JSON) |\r
| `config_example.json` | Full configuration template with all tunables |\r
| `LIMITATIONS.md` | What this tool doesn't do |\r
| `LICENSE` | MIT License |\r
\r
## Requirements\r
\r
- Python 3.8+\r
- No external dependencies (stdlib only)\r
- Works on any OS\r
- Platform-agnostic (works with any AI agent framework)\r
\r
## Configuration\r
\r
See `config_example.json` for the complete reference. Key areas:\r
\r
- **`DIMENSIONS`** — Quality dimensions with rubrics, weights, thresholds, and auto-checks\r
- **`AUTO_CHECKS`** — Tuning for each pattern-based check (markers, thresholds, penalties)\r
- **`AGGREGATE_METHOD`** — How to combine dimension scores ("weighted_average", "minimum", "geometric_mean")\r
- **`HISTORY_FILE`** — Where to store evaluation history\r
- **`REPORT_OUTPUT_DIR`** — Where reports are saved\r
\r
---\r
\r
## quality-verified\r
\r
\r
## License\r
\r
MIT — See `LICENSE` file.\r
\r
\r
---\r
\r
\r
## ⚠️ Security Note — Config File\r
\r
Configuration is loaded from a JSON file. This is safe to share — no code execution.\r
\r
- Config path is validated for existence and size (1MB cap) before loading\r
- Must be a `.json` file — raises `ValueError` if given a non-JSON path\r
- Keep your config under version control; it defines your quality rubrics and scoring weights\r
\r
## ⚠️ Disclaimer\r
\r
This software is provided "AS IS", without warranty of any kind, express or implied.\r
\r
**USE AT YOUR OWN RISK.**\r
\r
- The author(s) are NOT liable for any damages, losses, or consequences arising from \r
the use or misuse of this software — including but not limited to financial loss, \r
data loss, security breaches, business interruption, or any indirect/consequential damages.\r
- This software does NOT constitute financial, legal, trading, or professional advice.\r
- Users are solely responsible for evaluating whether this software is suitable for \r
their use case, environment, and risk tolerance.\r
- No guarantee is made regarding accuracy, reliability, completeness, or fitness \r
for any particular purpose.\r
- The author(s) are not responsible for how third parties use, modify, or distribute \r
this software after purchase.\r
\r
By downloading, installing, or using this software, you acknowledge that you have read \r
this disclaimer and agree to use the software entirely at your own risk.\r
\r
\r
**DATA DISCLAIMER:** This software processes and stores data locally on your system. \r
The author(s) are not responsible for data loss, corruption, or unauthorized access \r
resulting from software bugs, system failures, or user error. Always maintain \r
independent backups of important data. This software does not transmit data externally \r
unless explicitly configured by the user.\r
\r
---\r
\r
## Support & Links\r
\r
| | |\r
|---|---|\r
| 🐛 **Bug Reports** | [email protected] |\r
| ☕ **Ko-fi** | [ko-fi.com/theshadowrose](https://ko-fi.com/theshadowrose) |\r
| 🛒 **Gumroad** | [shadowyrose.gumroad.com](https://shadowyrose.gumroad.com) |\r
| 🐦 **Twitter** | [@TheShadowyRose](https://twitter.com/TheShadowyRose) |\r
| 🐙 **GitHub** | [github.com/TheShadowRose](https://github.com/TheShadowRose) |\r
| 🧠 **PromptBase** | [promptbase.com/profile/shadowrose](https://promptbase.com/profile/shadowrose) |\r
\r
*Built with [OpenClaw](https://github.com/openclaw/openclaw) — thank you for making this possible.*\r
\r
---\r
\r
🛠️ **Need something custom?** Custom OpenClaw agents & skills starting at $500. If you can describe it, I can build it. → [Hire me on Fiverr](https://www.fiverr.com/s/jjmlZ0v)\r
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install agent-scorecard - 安装完成后,直接呼叫该 Skill 的名称或使用
/agent-scorecard触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Agent Scorecard 是什么?
Configurable quality evaluation for AI agent outputs. Define criteria, run evaluations, track quality over time. No LLM-as-judge, no API calls, pattern-based... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 439 次。
如何安装 Agent Scorecard?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install agent-scorecard」即可一键安装,无需额外配置。
Agent Scorecard 是免费的吗?
是的,Agent Scorecard 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Agent Scorecard 支持哪些平台?
Agent Scorecard 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Agent Scorecard?
由 Shadow Rose(@theshadowrose)开发并维护,当前版本 v1.0.6。