Description

Configurable quality evaluation for AI agent outputs. Define criteria, run evaluations, track quality over time. No LLM-as-judge, no API calls, pattern-based...

README (SKILL.md)

\r \r

Agent Scorecard Output Quality Framework\r

Name: Agent Scorecard
Author: theshadowrose

\r Configurable quality evaluation for AI agent outputs. Define criteria, run evaluations, track quality over time. No LLM-as-judge, no API calls, pattern-based automated checks.\r \r ---\r \r Configurable quality evaluation for AI agent outputs. Define criteria, run evaluations, track quality over time.\r \r Agent Scorecard gives you a structured, repeatable way to measure whether your AI agent is producing good output — and whether it's getting better or worse over time. No LLM-as-judge, no API calls, no external dependencies. Everything runs locally with pattern-based automated checks and optional human scoring.\r \r ---\r \r

The Problem\r

\r You changed your agent's system prompt. Is the output better now? You don't know. You added a new tool. Did response quality degrade? You have a feeling, but no data. Quality management for AI agents is mostly vibes.\r \r Agent Scorecard replaces vibes with numbers.\r \r

What It Does\r

\r

1. Define Quality Dimensions (`config_example.json`)\r

Configure what "quality" means for your use case\r
Set dimensions: accuracy, completeness, tone, format compliance, consistency — or your own\r
Define rubrics (what does a 1 vs a 5 look like for each dimension?)\r
Set weights (accuracy matters more than tone? Give it 2× weight)\r
Set pass/fail thresholds per dimension\r \r

2. Evaluate (`scorecard.py`)\r

Automated mode: Pattern-based checks run instantly with zero API calls\r
- Response length analysis (too short? too long?)\r
- Format compliance (expected headers, lists, code blocks present?)\r
- Sycophancy detection ("Great question!" markers)\r
- Filler/hedge word density ("basically", "perhaps", "I think")\r
- Required section verification\r
- Style consistency (sentence length variation)\r
Manual mode: Interactive rubric-guided human scoring\r
Blended mode: Combine auto scores with human judgment (averaged)\r
Aggregate scoring with configurable method (weighted average, minimum, geometric mean)\r \r

3. Track (`scorecard_track.py`)\r

Append every evaluation to a JSONL history file\r
Filter by agent, task type, time period\r
Compute trends per dimension (improving, degrading, stable)\r
Linear regression slope for quantified direction\r
Sparkline visualisations in terminal\r \r

4. Compare (`scorecard_track.py`)\r

Before/after comparison (last N evals vs previous N)\r
Per-dimension delta with direction indicators\r
Perfect for measuring the impact of config changes\r \r

5. Report (`scorecard_report.py`)\r

Single evaluation reports (markdown or JSON)\r
History summary reports with tables and sparklines\r
Per-dimension breakdowns with rubric reference\r
Export to files or stdout\r \r ---\r \r

Quick Start\r

\r

# 1. Configure\r
cp config_example.json scorecard_config.json\r
# Edit dimensions, thresholds, and weights for your use case\r
\r
# 2. Evaluate a response\r
python3 scorecard.py --config scorecard_config.json --input response.txt\r
\r
# 3. Evaluate and save to history\r
python3 scorecard.py --config scorecard_config.json --input response.txt --save history.jsonl\r
\r
# 4. Manual scoring mode\r
python3 scorecard.py --config scorecard_config.json --input response.txt --manual --save history.jsonl\r
\r
# 5. View trends\r
python3 scorecard_track.py --history history.jsonl --summary\r
\r
# 6. Compare before/after (last 10 vs previous 10)\r
python3 scorecard_track.py --history history.jsonl --compare 10\r
\r
# 7. Generate a report\r
python3 scorecard_report.py --config scorecard_config.json --history history.jsonl\r
```\r
\r
## Programmatic Usage\r
\r
```python\r
from scorecard import Scorecard, _load_config\r
\r
cfg = _load_config("scorecard_config.json")\r
sc = Scorecard(cfg)\r
\r
text = open("agent_response.txt").read()\r
result = sc.evaluate(text, agent="my-agent", task_type="code-review")\r
\r
print(result.summary())\r
# Overall: 3.85/5 (PASS)\r
#   ✓ Accuracy: 4.0/5 (threshold 3, weight 2.0) [auto]\r
#   ✓ Completeness: 3.5/5 (threshold 3, weight 1.5) [auto]\r
#   ...\r
\r
# Save for tracking\r
import json\r
with open("history.jsonl", "a") as f:\r
    f.write(json.dumps(result.to_dict()) + "\
")\r
```\r
\r
---\r
\r
## Use Cases\r
\r
- **Prompt engineering:** Measure whether prompt changes improve output quality\r
- **Model comparison:** Same task, different models — which scores higher?\r
- **Agent regression testing:** Catch quality degradation before it ships\r
- **Team quality standards:** Define shared rubrics for consistent evaluation\r
- **Continuous monitoring:** Track quality trends over days/weeks/months\r
- **A/B testing:** Quantified before/after comparisons\r
\r
## What's Included\r
\r
| File | Purpose |\r
|------|---------|\r
| `scorecard.py` | Main evaluation engine — define, evaluate, score |\r
| `scorecard_track.py` | Historical tracking and trend analysis |\r
| `scorecard_report.py` | Report generation (markdown, JSON) |\r
| `config_example.json` | Full configuration template with all tunables |\r
| `LIMITATIONS.md` | What this tool doesn't do |\r
| `LICENSE` | MIT License |\r
\r
## Requirements\r
\r
- Python 3.8+\r
- No external dependencies (stdlib only)\r
- Works on any OS\r
- Platform-agnostic (works with any AI agent framework)\r
\r
## Configuration\r
\r
See `config_example.json` for the complete reference. Key areas:\r
\r
- **`DIMENSIONS`** — Quality dimensions with rubrics, weights, thresholds, and auto-checks\r
- **`AUTO_CHECKS`** — Tuning for each pattern-based check (markers, thresholds, penalties)\r
- **`AGGREGATE_METHOD`** — How to combine dimension scores ("weighted_average", "minimum", "geometric_mean")\r
- **`HISTORY_FILE`** — Where to store evaluation history\r
- **`REPORT_OUTPUT_DIR`** — Where reports are saved\r
\r
---\r
\r
## quality-verified\r
\r
\r
## License\r
\r
MIT — See `LICENSE` file.\r
\r
\r
---\r
\r
\r
## ⚠️ Security Note — Config File\r
\r
Configuration is loaded from a JSON file. This is safe to share — no code execution.\r
\r
- Config path is validated for existence and size (1MB cap) before loading\r
- Must be a `.json` file — raises `ValueError` if given a non-JSON path\r
- Keep your config under version control; it defines your quality rubrics and scoring weights\r
\r
## ⚠️ Disclaimer\r
\r
This software is provided "AS IS", without warranty of any kind, express or implied.\r
\r
**USE AT YOUR OWN RISK.**\r
\r
- The author(s) are NOT liable for any damages, losses, or consequences arising from \r
  the use or misuse of this software — including but not limited to financial loss, \r
  data loss, security breaches, business interruption, or any indirect/consequential damages.\r
- This software does NOT constitute financial, legal, trading, or professional advice.\r
- Users are solely responsible for evaluating whether this software is suitable for \r
  their use case, environment, and risk tolerance.\r
- No guarantee is made regarding accuracy, reliability, completeness, or fitness \r
  for any particular purpose.\r
- The author(s) are not responsible for how third parties use, modify, or distribute \r
  this software after purchase.\r
\r
By downloading, installing, or using this software, you acknowledge that you have read \r
this disclaimer and agree to use the software entirely at your own risk.\r
\r
\r
**DATA DISCLAIMER:** This software processes and stores data locally on your system. \r
The author(s) are not responsible for data loss, corruption, or unauthorized access \r
resulting from software bugs, system failures, or user error. Always maintain \r
independent backups of important data. This software does not transmit data externally \r
unless explicitly configured by the user.\r
\r
---\r
\r
## Support & Links\r
\r
| | |\r
|---|---|\r
| 🐛 **Bug Reports** | [email protected] |\r
| ☕ **Ko-fi** | [ko-fi.com/theshadowrose](https://ko-fi.com/theshadowrose) |\r
| 🛒 **Gumroad** | [shadowyrose.gumroad.com](https://shadowyrose.gumroad.com) |\r
| 🐦 **Twitter** | [@TheShadowyRose](https://twitter.com/TheShadowyRose) |\r
| 🐙 **GitHub** | [github.com/TheShadowRose](https://github.com/TheShadowRose) |\r
| 🧠 **PromptBase** | [promptbase.com/profile/shadowrose](https://promptbase.com/profile/shadowrose) |\r
\r
*Built with [OpenClaw](https://github.com/openclaw/openclaw) — thank you for making this possible.*\r
\r
---\r
\r
🛠️ **Need something custom?** Custom OpenClaw agents & skills starting at $500. If you can describe it, I can build it. → [Hire me on Fiverr](https://www.fiverr.com/s/jjmlZ0v)\r

Usage Guidance

This skill appears coherent and implements a local, pattern-based scorecard for agent text. Before installing or running it: (1) review the included Python files yourself (they are present and readable) or run them in an isolated environment (virtualenv/container); (2) inspect and edit config_example.json to avoid recording sensitive content (the history file is plaintext JSONL and append-only by design); (3) be aware automated checks are surface-level (no semantic fact-checking) — rely on manual scoring for accuracy; (4) back up or secure the history/report directories if they might contain sensitive outputs; (5) if you need real-time or large-scale usage, consider replacing the JSONL history with a proper datastore because the tool has no concurrency safety. Overall, the package is internally consistent with low risk, but follow standard caution because the source provenance is 'unknown' and data stored is local and unencrypted.

Capability Analysis

Type: OpenClaw Skill Name: agent-scorecard Version: 1.0.6 The Agent Scorecard bundle is a legitimate quality evaluation framework for AI agent outputs, designed to run locally using only the Python standard library. The core logic in scorecard.py, scorecard_report.py, and scorecard_track.py implements pattern-based scoring (e.g., regex checks for headers, sycophancy markers, and sentence length) and historical trend analysis without any external API calls or network activity. No evidence of data exfiltration, malicious execution, or prompt injection was found; the instructions in SKILL.md and README.md are strictly aligned with the stated purpose of providing structured metrics for agent performance.

Capability Assessment

✓ Purpose & Capability

Name/description match the included Python modules (scorecard.py, scorecard_track.py, scorecard_report.py) and config examples; functionality (pattern-based checks, manual scoring, history tracking, reports) is implemented with no extraneous credentials or binaries.

✓ Instruction Scope

SKILL.md and README instruct running the included Python scripts on local files, saving history to a JSONL file, and generating reports. The runtime instructions only reference local files and config; they do not instruct reading unrelated system paths or exfiltrating data.

✓ Install Mechanism

No install spec or remote downloads are declared. The package is instruction + source files only and uses the Python standard library; nothing is fetched from external URLs at install time.

✓ Credentials

No environment variables, credentials, or config paths are required. The tool writes/reads local JSON/JSONL files (history, config, reports) which is appropriate for a tracking/reporting utility.

✓ Persistence & Privilege

always:false and no special privileges requested. The tool persists its own history and report files locally (append-only JSONL), which is expected behavior and limited in scope.

Version History

v1.0.6

No user-facing changes in this release. - Version number updated to 1.0.6 - No other changes detected

v1.0.5

**Configuration system migrated from Python to JSON.** - Configuration files are now in JSON format instead of executable Python. - All user config, CLI, and documentation examples updated to use `.json` config files (previously `.py`). - Loading the config no longer executes code; loading is now safe and only parses JSON. - Updated usage, programmatic interface, and security documentation to reflect the new JSON config approach. - Provided new JSON config template (`config_example.json`).

v1.0.4

- Added config_example.json as an example configuration file in JSON format. - Updated documentation in SKILL.md to reflect the new version. - Minor code or documentation updates in scorecard.py; no functional or breaking changes noted.

v1.0.3

- Fixed character encoding in the skill name ("�" replaced). - Updated version number to 1.0.3 in SKILL.md. - No functional or feature changes; documentation only.

v1.0.2

- Security: Added a prominent security note about config file (`scorecard_config.py`) execution and risks of running untrusted Python configs. - Documentation: Updated SKILL.md to reflect the new security notice and version bump to 1.0.2. - Version: Incremented version from 1.0.0 to 1.0.2 in documentation.

v1.0.0

Initial release of Agent Scorecard – Output Quality Framework - Enables configurable, rubric-based evaluation of AI agent outputs with pattern-based automated checks (no LLM-as-judge, no external API calls). - Supports custom quality dimensions, weights, and thresholds; allows for both automated and manual scoring. - Tracks output quality over time, aggregates scoring, shows trends, and enables before/after comparisons. - Generates detailed reports and allows programmatic usage for integration into workflows. - Requires only Python 3.8+, with zero external dependencies; works fully locally.

Metadata

Slug agent-scorecard

Version 1.0.6

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 6

Frequently Asked Questions

What is Agent Scorecard?

Configurable quality evaluation for AI agent outputs. Define criteria, run evaluations, track quality over time. No LLM-as-judge, no API calls, pattern-based... It is an AI Agent Skill for Claude Code / OpenClaw, with 439 downloads so far.

How do I install Agent Scorecard?

Run "/install agent-scorecard" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Agent Scorecard free?

Yes, Agent Scorecard is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Agent Scorecard support?

Agent Scorecard is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Agent Scorecard?

It is built and maintained by Shadow Rose (@theshadowrose); the current version is v1.0.6.

More Skills

Agent Scorecard