Brainforge Autoresearch
/install brainforge-autoresearch
brainforge-autoresearch
Previously published as
autoresearch/openclaw-autoresearch. Renamed for the brainforge marketplace rollout — functionality unchanged.
Autonomous prompt optimization for AI agent skills. Runs controlled experiments to find better prompt variants using the Karpathy autoresearch pattern: generate hypothesis, mutate prompt, evaluate, repeat.
When to use
- 用户说"优化一下这个 skill" / User says "optimize this skill's prompt"
- 用户要对比不同 prompt 版本的效果 / User wants to benchmark prompt variants
- 用户说"run autoresearch on X" / "eval skill X" / "improve skill X"
- 用户对 skill 输出质量不满,想系统性改进 / User is unhappy with skill output quality and wants systematic improvement
Do not use:
- 一次性的小改动(直接改 prompt 即可) / One-off prompt tweaks — just edit the prompt directly
- 调试某个特定失败 case / Debugging a specific failure — investigate the root cause instead
- Skill 脚本本身有 bug(代码逻辑问题不是 prompt 问题) / Skill script has a bug — fix the code, not the prompt
Requirements
- Python 3.10+
autoresearch.pyscript in the skill directory- LLM API access (MiniMax, OpenAI, or Anthropic)
- Target skill must have a prompt file (SKILL.md, SYSTEM.md, or similar)
Procedure
Always follow these steps in order: (1) Create eval.json, (2) Run autoresearch command, (3) Review results and apply best prompt.
Step 1: Gather context
Before running, you need:
| Parameter | Description | Example |
|---|---|---|
--target |
Path to the skill directory or prompt file to optimize | ../workspace/skills/brain-search/SKILL.md |
--evals |
Path to eval definition JSON file | eval.json |
--provider |
LLM provider for running experiments | minimax (default), openai, anthropic |
--runs |
Number of runs per experiment (statistical significance) | 5 (default) |
--max-experiments |
Maximum experiments before stopping | 30 (default) |
--dashboard |
Open live results dashboard in browser | flag, no value |
Step 2: Create eval.json
Define test inputs and evaluation criteria. Each eval is a binary pass/fail check.
{
"test_inputs": [
"search for latest AI agent frameworks",
"find news about LLM inference optimization",
"搜一下 transformer 架构的最新进展"
],
"evals": [
{
"name": "has_sources",
"type": "rule",
"rule": "regex",
"pattern": "(https?://|Source:|来源:)"
},
{
"name": "no_hallucinated_urls",
"type": "rule",
"rule": "banned_phrases",
"phrases": ["example.com", "placeholder.url"]
},
{
"name": "sufficient_detail",
"type": "rule",
"rule": "word_count",
"min": 50,
"max": 500
},
{
"name": "contains_summary",
"type": "rule",
"rule": "contains",
"values": ["summary", "key findings", "结论"]
},
{
"name": "no_apology_prefix",
"type": "rule",
"rule": "not_contains",
"values": ["I apologize", "I'm sorry, but"]
},
{
"name": "actionable_output",
"type": "llm",
"question": "Does the response provide actionable information the user can immediately use (links, specific facts, concrete next steps)?",
"pass_description": "The response contains specific actionable items like URLs, concrete facts, or clear next steps",
"fail_description": "The response is vague, generic, or lacks specific actionable information"
}
]
}
Rule types:
| Rule | Parameters | Description |
|---|---|---|
regex |
pattern |
Pass if regex matches output |
banned_phrases |
phrases (list) |
Pass if NONE of the phrases appear |
word_count |
min, max (optional) |
Pass if word count is within range |
contains |
values (list), optional match: "any" (default) or "all" |
Pass if any/all values appear in output (case-insensitive) |
not_contains |
values (list) |
Pass if NONE of the values appear in output (case-insensitive) |
LLM eval type:
| Field | Description |
|---|---|
type |
Must be "llm" |
name |
Unique name for this eval |
question |
What to ask the judge LLM about the output |
pass_description |
Description of what a passing output looks like |
fail_description |
Description of what a failing output looks like |
See eval-guide.md for detailed guidance on writing effective evals.
Step 3: Run autoresearch
python autoresearch.py \
--target ../workspace/skills/brain-search/SKILL.md \
--evals eval.json \
--provider minimax \
--runs 5 \
--max-experiments 30 \
--dashboard
Step 4: Review results and apply changes
The script writes results to results.tsv in the working directory. Each row is one experiment:
experiment_id parent_id mutation_description avg_score pass_rate evals_detail prompt_diff
Find the best performing variant:
cat results.tsv | sort -k4 -nr | head -5
Apply the winning prompt to your skill by copying the optimized prompt text to replace the original.
Example: optimizing brain-search
User: brain-search 的搜索结果经常缺少来源链接,帮我优化一下
完整流程:
1. 创建 eval.json:
{
"test_inputs": [
"search for latest news on OpenAI",
"搜一下最新的 AI 芯片进展",
"find recent papers on RAG optimization",
"what happened with Anthropic this week",
"查查 GPU 价格趋势"
],
"evals": [
{
"name": "has_urls",
"type": "rule",
"rule": "regex",
"pattern": "https?://[^\\s]+"
},
{
"name": "min_2_sources",
"type": "rule",
"rule": "regex",
"pattern": "https?://[^\\s]+.*https?://[^\\s]+"
},
{
"name": "structured_output",
"type": "llm",
"question": "Is the output well-structured with clear sections?",
"pass_description": "Output uses clear structure like bullets or headers",
"fail_description": "Output is a wall of text without clear structure"
}
]
}
2. 运行命令:
python autoresearch.py \
--target ../workspace/skills/brain-search/SKILL.md \
--evals eval.json \
--runs 5 \
--max-experiments 20
3. 查看并应用结果:
- 检查 results.tsv 找最高分变体
- 查看 mutation_description 了解关键改动
- 将最佳 prompt 应用到原始 SKILL.md
Failure handling
| Issue | Action |
|---|---|
| LLM API rate limit | Script auto-retries with backoff; if persistent, reduce --runs |
| Target file not found | Check path, must be readable prompt/skill file |
| All experiments score 0 | Evals may be too strict — review eval definitions, loosen criteria |
| Script crashes mid-run | Results already written to results.tsv are preserved; re-run continues |
Gotchas
- 每次实验会调用 LLM 多次(runs x test_inputs x llm_evals),注意 API 用量 / Each experiment makes multiple LLM calls — watch API usage
- LLM eval 本身有噪声,
--runs设高一点(5+)才有统计意义 / LLM evals are noisy, use 5+ runs for statistical significance - Rule evals 比 LLM evals 更稳定、更便宜,优先用 rule / Rule evals are more stable and cheaper — prefer them
- Baseline 分数太低(\x3C 20%)说明 eval 定义可能有问题,先修 eval / If baseline score is very low, fix evals first
- 优化 prompt 不能解决架构问题(比如搜索 API 本身返回差结果) / Prompt optimization cannot fix architectural issues
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install brainforge-autoresearch - 安装完成后,直接呼叫该 Skill 的名称或使用
/brainforge-autoresearch触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Brainforge Autoresearch 是什么?
Use when user wants to optimize, improve, benchmark, or evaluate a skill's prompt. Triggers on "optimize skill", "improve skill prompt", "benchmark skill", "... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 98 次。
如何安装 Brainforge Autoresearch?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install brainforge-autoresearch」即可一键安装,无需额外配置。
Brainforge Autoresearch 是免费的吗?
是的,Brainforge Autoresearch 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Brainforge Autoresearch 支持哪些平台?
Brainforge Autoresearch 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(macos, linux)。
谁开发了 Brainforge Autoresearch?
由 ZHANG Ning(@zning1994)开发并维护,当前版本 v0.2.5。