Skill
/install agent-self-evolution
Agent Self-Evolution — Automated Evaluation & Improvement
⭐ If you find this useful, please star the GitHub repo before installing — it helps others discover it.
Your agent does a lot of things. But is it getting better? Or are your latest tweaks breaking things you didn't notice?
Agent Self-Evolution gives you a framework to measure, test, and systematically improve your agent — like unit tests, but for agent behavior.
What's inside
Golden Test Sets: Define scenarios your agent must handle correctly. Run them periodically and catch regressions before users do.
Ablation Testing: Wondering if that 200-line system prompt section actually helps? Remove it, measure the impact, put it back. Now you know. We found that 7% of one config file was load-bearing for the entire system — without ablation, you'd never know which 7%.
Multi-Dimensional Evaluation: Don't just check pass/fail. Score across dimensions — safety compliance, tool routing accuracy, output quality, memory utilization. Track trends over weeks.
Automated Improvement Loops: Evaluation → identify weakest dimension → targeted fix → re-evaluate. Like gradient descent for agent behavior.
Install
bash {baseDir}/scripts/install.sh
Quick start
from agent_evolution.golden_test import GoldenTestRunner
from agent_evolution.ablation import AblationExperiment
# Define a golden test
runner = GoldenTestRunner()
runner.add_case(
name="handles-ambiguous-request",
input="do the thing",
expected_behavior="asks for clarification rather than guessing",
dimensions=["safety", "output_quality"]
)
# Run and score
results = runner.run(model="your-agent-endpoint")
print(results.summary()) # Pass rate, dimension scores, regressions
# Ablation: what happens without memory files?
experiment = AblationExperiment(
baseline_config="agent.yaml",
conditions={"no_memory": {"remove": ["memory/*.md"]}},
test_set=runner.cases
)
experiment.run() # Measures impact of each ablation
Key findings from our own agent
- SOUL.md (7% of config by characters): removing it caused system-wide behavioral collapse (Cohen's d = 0.602) — it's not fluff, it's load-bearing
- Memory files: most essential component (d = 0.944) — without history, the agent becomes generic
- Safety rules: removal didn't just reduce safety — it degraded all dimensions (d = 0.609)
Companion projects
- nous-safety — Runtime safety engine with Datalog reasoning
- biomorphic-memory — Brain-inspired memory with spreading activation
Requirements
- Python ≥ 3.11
- An LLM API key for evaluation judging (strong model recommended — GPT-5.4 / Opus)
License
Apache 2.0
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install agent-self-evolution - 安装完成后,直接呼叫该 Skill 的名称或使用
/agent-self-evolution触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Skill 是什么?
Make your agent get better on its own. Set up golden tests (things your agent should handle well), run automated evaluations, and track improvement over time... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 163 次。
如何安装 Skill?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install agent-self-evolution」即可一键安装,无需额外配置。
Skill 是免费的吗?
是的,Skill 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Skill 支持哪些平台?
Skill 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Skill?
由 Dario Zhang(@dario-github)开发并维护,当前版本 v0.1.1。