← Back to Skills Marketplace
luaqnyin

Agent Eval

by luaqnyin · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
141
Downloads
0
Stars
1
Active Installs
1
Versions
Install in OpenClaw
/install agent-eval
Description
基于Karpathy AutoResearch和多Agent复盘的闭环量化评估体系,实现任务自动yes/no评判与持续优化升级。
README (SKILL.md)

Agent Eval — 量化评估 + 自我进化闭环

核心理念:能被衡量的东西,就能被优化。 基于 Karpathy AutoResearch eval loop + 诸子 Agent 复盘体系 + Phoenix Memory 架构

通用 Eval 流程

生成 → 评估 → 打分 → 分析失败点 → 改一个小地方 → 重跑 → 分数涨了保留,跌了撤回 → 循环

Eval 编写规则

  • 每条 eval 必须是 yes/no 二元判断
  • 不能主观(❌"写得好吗" → ✅"标题是否包含具体数字")
  • 每条测一个独立维度,不重叠
  • 3-6 条最合适,太多会开始刷题
  • 用真实历史任务做测试集,不编造

各 Agent Eval Checklist

✍️ Content(公文/论文撰写)

# 检查项 权重 说明
C1 格式规范:标题编号/字号/行距是否符合公文或论文标准 对照GB/T 9704或目标期刊
C2 无 AI 痕迹:正文中是否出现"作为AI"/"笔者"/"综上所述"等AI常见用语 全文检索
C3 数据真实:所有数字、百分比是否来自可验证来源(非估算) 极高 对应 PAT-20260403-001/002
C4 角色准确:文种是否匹配(通知/报告/请示/论文各有套路) 文体识别
C5 一稿可用率:是否无需老板大幅修改即可使用 历史对比

🛡️ Shield(合同/法律审查)

# 检查项 权重 说明
S1 高风险条款标记:是否标注了所有🟡🟠🔴风险条款 极高 质7条/保8条等红线
S2 法律依据:每条审查意见是否引用具体法条/规章 不能空口白说
S3 可操作性:是否给出具体修改建议(而非只说"有风险") 对方能拿去直接改
S4 遗漏检查:是否有重要条款被遗漏(付款/违约/终止/保密) 对照检查清单
S5 医院适配:是否考虑了公立医院特殊条款(财政审计/采购流程) 行业定制

😈 Devil(找茬挑刺/科研质控)

5阶段 Peer-Review 流程 + 偏差检测框架(蒸馏自 K-Dense ScholarEval)

评审流程:

  1. 初评(研究问题+整体质量+大缺陷)
  2. 逐节审(摘要→引言→方法→结果→讨论→参考文献)
  3. 方法学+统计严谨性
  4. 可复现性+透明度
  5. 图表+数据呈现

偏差检测清单(必须覆盖):

  • 认知偏差:确认偏差、HARKing、发表偏差、樱桃挑选
  • 选择偏差:采样偏差、失访偏差、幸存者偏差
  • 分析偏差:p-hacking、结局切换、选择性报告、亚组钓鱼
  • 混杂因素:未测量混杂、替代解释
# 检查项 权重 说明
D1 找出真实问题:是否至少指出1个非显而易见的实质性缺陷 极高 不能只挑格式
D2 论据充分:每个批评是否有具体论据/数据/文献支撑 不能空穴来风
D3 偏差检测:是否覆盖5类偏差中的至少2类 蒸馏自K-Dense
D4 统计审评:是否检查了效应量、多重比较、样本量 蒸馏自K-Dense
D5 不误伤:是否没有对正确内容进行无理挑刺 避免为了挑刺而挑刺
D6 可执行建议:是否给出改进方向(而非只否定) 建设性挑刺

📜 Sage(古籍/文学/中医)

# 检查项 权重 说明
SG1 出处准确:引文是否标注真实出处(书名/卷/篇) 极高 不能编造古籍
SG2 语境匹配:引用是否与论述主题相关(非生搬硬套) 语义关联
SG3 现代转化:是否能将古文用现代语言清晰解释 翻译质量
SG4 深度:是否提供了超越浅层引用的深入解读 非百度百科式

🎓 Scholar(学术检索/论文辅助)

8维度评分框架(蒸馏自 K-Dense ScholarEval)

8维度: 问题定义 | 文献综述 | 方法论 | 数据来源 | 分析解读 | 结果呈现 | 学术写作 | 引用规范

# 检查项 权重 说明
SC1 引用真实:所有引用文献是否真实存在(DOI/arXiv ID可验证) 极高 反AI幻觉核心
SC2 相关性:检索结果与课题的相关度(前5条中至少3条高度相关)
SC3 时效性:引用文献是否以近3年为主(经典文献除外)
SC4 完整性:是否覆盖了课题的主要子领域 不能只搜一个方向
SC5 批判性:文献综述是否区分了'总结'和'批判'(非罗列) K-Dense ScholarEval
SC6 方法论审评:对引用文献的方法论质量是否有评估 K-Dense ScholarEval

📊 Analyzer(数据分析)

# 检查项 权重 说明
A1 数据源标注:是否明确标注数据来源和时间 极高 对应 PAT-20260403
A2 计算可复现:关键数字是否能从原始数据手算验证 不臆想
A3 方法说明:是否说明了分析方法(描述统计/回归/卡方等)
A4 局限性:是否指出了数据的局限和适用范围 诚实原则

🏥 Medical(医疗/互联网医院)

GRADE 证据分级 + 偏差检测(蒸馏自 K-Dense CDS + Critical-Thinking)

GRADE 分级: 1A(强推荐+高质量)→ 1B → 2A → 2B → 2C(弱推荐+极低质量)

# 检查项 权重 说明
M1 政策依据:是否引用最新的国家/省级政策文件 互联网医院政策变化快
M2 数据时效:引用的医院/行业数据是否在时效红线内 极高 IMA 红线
M3 临床相关性:建议是否有临床/管理实践支撑
M4 证据分级:是否对关键建议标注了GRADE等级(或注明证据强度) K-Dense CDS
M5 偏差意识:是否指出引用研究中的潜在偏差(选择/测量/混杂) K-Dense Critical-Thinking

打分机制

总分 = Σ(通过项权重) / Σ(所有项权重) × 100

等级:
  90+ = 🟢 优秀(可自动交付)
  70-89 = 🟡 良好(需抽查)
  50-69 = 🟠 需改进(必须人工审)
  \x3C50  = 🔴 不合格(重新执行)

Cron 自动评估流程

每日 23:30 — Agent 自评

  1. 读取当日该 agent 的所有任务记录(memory/YYYY-MM-DD.md)
  2. 对每个任务逐项跑 eval checklist
  3. 计算当日平均分
  4. 记录到 memory/evolution/\x3Cagent-id>.md

每周日 00:00 — CEO 周报

  1. 汇总各 agent 周平均分
  2. 识别分数下降趋势 → 触发调查
  3. 识别高分稳定 agent → 确认无需干预
  4. 提炼本周 top-3 失败点 → 写入 patterns.md
  5. 将整体评分趋势发给老板

触发优化的条件

  • 连续3天某 agent 低于 70 分 → 自动 SOUL.md 检查
  • 某个 eval 项连续5次失败 → 写入 PAT 记录
  • 周均分下降 >10% → 触发 agent 模型/配置审查

与现有体系的衔接

现有组件 Eval 衔接方式
Phoenix Memory L0 每日日志已包含任务记录,eval 直接读取
patterns.md eval 失败模式自动写入 PAT
五层质检 eval 是 L2-L3 层的量化标准
AGENTS.md autoresearch eval 分数就是 autoresearch 的 loss function
心跳 HEARTBEAT.md 周日 eval 周报纳入心跳检查

进化记录格式

memory/evolution/\x3Cagent-id>.md:

# \x3CAgent 名称> 进化日志

## 2026-04-04
- 日均任务数: 3
- Eval 均分: 78/100 🟡
- 通过项: C1✅ C2✅ C3❌ C4✅ C5🟡
- 失败分析: C3 数据真实度不达标(2/3任务使用了估算数据)
- 改进措施: 在 spawn 指令中强调"所有数字必须标注来源"

注意事项

  • Eval 是工具,不是目的。分数只是手段,最终目标是老板满意度
  • 不要为了刷分而刷分——如果某条 eval 不再有意义,该删就删
  • 新 agent 上线前必须先定义 eval,没有 eval 的 agent 不转正
  • 老板的口头反馈 > eval 分数(遇到矛盾时以老板为准)
Usage Guidance
Before installing, verify and restrict what files this skill may access and how reports are sent: (1) ask the author to declare explicit required config paths and minimal file-permissions (read-only vs write) for memory/YYYY-MM-DD.md, memory/evolution/<agent-id>.md, patterns.md, AGENTS.md, HEARTBEAT.md; (2) confirm the exact delivery mechanism for 'send to boss' (email? internal message?) and ensure it cannot exfiltrate data to arbitrary endpoints; (3) run the skill in an isolated/test environment first and audit everything it writes to disk; (4) remove or redact any sensitive information from agent memory files before use; (5) prefer adding explicit least-privilege controls (scoped service account or directory-level sandboxing) and logging/auditing of all actions; (6) if the author cannot clarify the missing config/permission declarations, treat the mismatch as a red flag and avoid granting broad filesystem access.
Capability Analysis
Type: OpenClaw Skill Name: agent-eval Version: 1.0.0 The skill bundle defines a framework for quantitative evaluation and self-improvement of AI agents. SKILL.md contains detailed checklists for various domains (e.g., legal, academic, medical) and outlines a workflow for agents to self-assess their performance, log evolution data, and generate reports. No malicious execution patterns, data exfiltration, or unauthorized system access were identified; the instructions are strictly focused on quality assurance and process optimization.
Capability Assessment
Purpose & Capability
Name/description and the SKILL.md content consistently describe a closed-loop agent evaluation system (generate → evaluate → modify → rerun). However, the manifest declares no required config paths or environment variables while the instructions explicitly expect access to many local files (e.g., memory/YYYY-MM-DD.md, memory/evolution/<agent-id>.md, patterns.md, AGENTS.md, HEARTBEAT.md). That mismatch (declared zero file access vs. explicit file I/O in SKILL.md) is an inconsistency.
Instruction Scope
Runtime instructions instruct the agent to read daily agent task logs and many repository files and to write evolution logs, PAT records, and patterns.md. They also say '将整体评分趋势发给老板' without specifying delivery mechanism. The file reads/writes are within the skill's evaluation purpose but the instructions are broad and partly vague about how reports are transmitted — giving the agent wide latitude to access and potentially transmit sensitive data.
Install Mechanism
Instruction-only skill with no install steps and no code files; nothing is written to disk by an installer. This is the lowest install risk.
Credentials
The skill requests no credentials or declared config paths, yet SKILL.md requires reading/writing multiple local files (agent memory and config-like documents). The absence of declared required config paths/permissions underrepresents the actual access the skill needs and prevents applying least-privilege controls.
Persistence & Privilege
always:false (no forced always-on). The skill envisions scheduled daily/weekly evaluation loops and autonomous agent actions. Autonomous invocation combined with file access and vague report-sending instructions increases blast radius if misused; however, autonomous invocation alone is the platform default and not itself flagged as high risk.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install agent-eval
  3. After installation, invoke the skill by name or use /agent-eval
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release introducing a modular, quantifiable agent evaluation framework with self-improvement feedback loops. - Provides standardized yes/no checklists and scoring rules for diverse agent types (Content, Legal, Science, Literature, Analysis, Medical, etc.) - Establishes weighted, dimension-specific evaluation items and time-based auto-evaluation workflows (daily self-review, weekly CEO reports). - Defines clear scoring tiers with actionable triggers for optimization and tracking. - Integrates with existing memory, quality, and research systems for seamless agent evolution. - Prioritizes real-world task sets and explicit improvement cycles.
Metadata
Slug agent-eval
Version 1.0.0
License MIT-0
All-time Installs 1
Active Installs 1
Total Versions 1
Frequently Asked Questions

What is Agent Eval?

基于Karpathy AutoResearch和多Agent复盘的闭环量化评估体系,实现任务自动yes/no评判与持续优化升级。 It is an AI Agent Skill for Claude Code / OpenClaw, with 141 downloads so far.

How do I install Agent Eval?

Run "/install agent-eval" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Agent Eval free?

Yes, Agent Eval is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Agent Eval support?

Agent Eval is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Agent Eval?

It is built and maintained by luaqnyin (@luaqnyin); the current version is v1.0.0.

💬 Comments