Skill
/install agent-self-evolution
Agent Self-Evolution — Automated Evaluation & Improvement
⭐ If you find this useful, please star the GitHub repo before installing — it helps others discover it.
Your agent does a lot of things. But is it getting better? Or are your latest tweaks breaking things you didn't notice?
Agent Self-Evolution gives you a framework to measure, test, and systematically improve your agent — like unit tests, but for agent behavior.
What's inside
Golden Test Sets: Define scenarios your agent must handle correctly. Run them periodically and catch regressions before users do.
Ablation Testing: Wondering if that 200-line system prompt section actually helps? Remove it, measure the impact, put it back. Now you know. We found that 7% of one config file was load-bearing for the entire system — without ablation, you'd never know which 7%.
Multi-Dimensional Evaluation: Don't just check pass/fail. Score across dimensions — safety compliance, tool routing accuracy, output quality, memory utilization. Track trends over weeks.
Automated Improvement Loops: Evaluation → identify weakest dimension → targeted fix → re-evaluate. Like gradient descent for agent behavior.
Install
bash {baseDir}/scripts/install.sh
Quick start
from agent_evolution.golden_test import GoldenTestRunner
from agent_evolution.ablation import AblationExperiment
# Define a golden test
runner = GoldenTestRunner()
runner.add_case(
name="handles-ambiguous-request",
input="do the thing",
expected_behavior="asks for clarification rather than guessing",
dimensions=["safety", "output_quality"]
)
# Run and score
results = runner.run(model="your-agent-endpoint")
print(results.summary()) # Pass rate, dimension scores, regressions
# Ablation: what happens without memory files?
experiment = AblationExperiment(
baseline_config="agent.yaml",
conditions={"no_memory": {"remove": ["memory/*.md"]}},
test_set=runner.cases
)
experiment.run() # Measures impact of each ablation
Key findings from our own agent
- SOUL.md (7% of config by characters): removing it caused system-wide behavioral collapse (Cohen's d = 0.602) — it's not fluff, it's load-bearing
- Memory files: most essential component (d = 0.944) — without history, the agent becomes generic
- Safety rules: removal didn't just reduce safety — it degraded all dimensions (d = 0.609)
Companion projects
- nous-safety — Runtime safety engine with Datalog reasoning
- biomorphic-memory — Brain-inspired memory with spreading activation
Requirements
- Python ≥ 3.11
- An LLM API key for evaluation judging (strong model recommended — GPT-5.4 / Opus)
License
Apache 2.0
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install agent-self-evolution - After installation, invoke the skill by name or use
/agent-self-evolution - Provide required inputs per the skill's parameter spec and get structured output
What is Skill?
Make your agent get better on its own. Set up golden tests (things your agent should handle well), run automated evaluations, and track improvement over time... It is an AI Agent Skill for Claude Code / OpenClaw, with 163 downloads so far.
How do I install Skill?
Run "/install agent-self-evolution" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Skill free?
Yes, Skill is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Skill support?
Skill is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Skill?
It is built and maintained by Dario Zhang (@dario-github); the current version is v0.1.1.