← 返回 Skills 市场
zning1994

Brainforge Autoresearch

作者 ZHANG Ning · GitHub ↗ · v0.2.5 · MIT-0
macoslinux ✓ 安全检测通过
98
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install brainforge-autoresearch
功能描述
Use when user wants to optimize, improve, benchmark, or evaluate a skill's prompt. Triggers on "optimize skill", "improve skill prompt", "benchmark skill", "...
使用说明 (SKILL.md)

brainforge-autoresearch

Previously published as autoresearch / openclaw-autoresearch. Renamed for the brainforge marketplace rollout — functionality unchanged.

Autonomous prompt optimization for AI agent skills. Runs controlled experiments to find better prompt variants using the Karpathy autoresearch pattern: generate hypothesis, mutate prompt, evaluate, repeat.

When to use

  • 用户说"优化一下这个 skill" / User says "optimize this skill's prompt"
  • 用户要对比不同 prompt 版本的效果 / User wants to benchmark prompt variants
  • 用户说"run autoresearch on X" / "eval skill X" / "improve skill X"
  • 用户对 skill 输出质量不满,想系统性改进 / User is unhappy with skill output quality and wants systematic improvement

Do not use:

  • 一次性的小改动(直接改 prompt 即可) / One-off prompt tweaks — just edit the prompt directly
  • 调试某个特定失败 case / Debugging a specific failure — investigate the root cause instead
  • Skill 脚本本身有 bug(代码逻辑问题不是 prompt 问题) / Skill script has a bug — fix the code, not the prompt

Requirements

  • Python 3.10+
  • autoresearch.py script in the skill directory
  • LLM API access (MiniMax, OpenAI, or Anthropic)
  • Target skill must have a prompt file (SKILL.md, SYSTEM.md, or similar)

Procedure

Always follow these steps in order: (1) Create eval.json, (2) Run autoresearch command, (3) Review results and apply best prompt.

Step 1: Gather context

Before running, you need:

Parameter Description Example
--target Path to the skill directory or prompt file to optimize ../workspace/skills/brain-search/SKILL.md
--evals Path to eval definition JSON file eval.json
--provider LLM provider for running experiments minimax (default), openai, anthropic
--runs Number of runs per experiment (statistical significance) 5 (default)
--max-experiments Maximum experiments before stopping 30 (default)
--dashboard Open live results dashboard in browser flag, no value

Step 2: Create eval.json

Define test inputs and evaluation criteria. Each eval is a binary pass/fail check.

{
  "test_inputs": [
    "search for latest AI agent frameworks",
    "find news about LLM inference optimization",
    "搜一下 transformer 架构的最新进展"
  ],
  "evals": [
    {
      "name": "has_sources",
      "type": "rule",
      "rule": "regex",
      "pattern": "(https?://|Source:|来源:)"
    },
    {
      "name": "no_hallucinated_urls",
      "type": "rule",
      "rule": "banned_phrases",
      "phrases": ["example.com", "placeholder.url"]
    },
    {
      "name": "sufficient_detail",
      "type": "rule",
      "rule": "word_count",
      "min": 50,
      "max": 500
    },
    {
      "name": "contains_summary",
      "type": "rule",
      "rule": "contains",
      "values": ["summary", "key findings", "结论"]
    },
    {
      "name": "no_apology_prefix",
      "type": "rule",
      "rule": "not_contains",
      "values": ["I apologize", "I'm sorry, but"]
    },
    {
      "name": "actionable_output",
      "type": "llm",
      "question": "Does the response provide actionable information the user can immediately use (links, specific facts, concrete next steps)?",
      "pass_description": "The response contains specific actionable items like URLs, concrete facts, or clear next steps",
      "fail_description": "The response is vague, generic, or lacks specific actionable information"
    }
  ]
}

Rule types:

Rule Parameters Description
regex pattern Pass if regex matches output
banned_phrases phrases (list) Pass if NONE of the phrases appear
word_count min, max (optional) Pass if word count is within range
contains values (list), optional match: "any" (default) or "all" Pass if any/all values appear in output (case-insensitive)
not_contains values (list) Pass if NONE of the values appear in output (case-insensitive)

LLM eval type:

Field Description
type Must be "llm"
name Unique name for this eval
question What to ask the judge LLM about the output
pass_description Description of what a passing output looks like
fail_description Description of what a failing output looks like

See eval-guide.md for detailed guidance on writing effective evals.

Step 3: Run autoresearch

python autoresearch.py \
  --target ../workspace/skills/brain-search/SKILL.md \
  --evals eval.json \
  --provider minimax \
  --runs 5 \
  --max-experiments 30 \
  --dashboard

Step 4: Review results and apply changes

The script writes results to results.tsv in the working directory. Each row is one experiment:

experiment_id  parent_id  mutation_description  avg_score  pass_rate  evals_detail  prompt_diff

Find the best performing variant:

cat results.tsv | sort -k4 -nr | head -5

Apply the winning prompt to your skill by copying the optimized prompt text to replace the original.

Example: optimizing brain-search

User: brain-search 的搜索结果经常缺少来源链接,帮我优化一下

完整流程:

1. 创建 eval.json:
   {
     "test_inputs": [
       "search for latest news on OpenAI",
       "搜一下最新的 AI 芯片进展",
       "find recent papers on RAG optimization",
       "what happened with Anthropic this week",
       "查查 GPU 价格趋势"
     ],
     "evals": [
       {
         "name": "has_urls",
         "type": "rule",
         "rule": "regex",
         "pattern": "https?://[^\\s]+"
       },
       {
         "name": "min_2_sources",
         "type": "rule",
         "rule": "regex",
         "pattern": "https?://[^\\s]+.*https?://[^\\s]+"
       },
       {
         "name": "structured_output",
         "type": "llm",
         "question": "Is the output well-structured with clear sections?",
         "pass_description": "Output uses clear structure like bullets or headers",
         "fail_description": "Output is a wall of text without clear structure"
       }
     ]
   }

2. 运行命令:
   python autoresearch.py \
     --target ../workspace/skills/brain-search/SKILL.md \
     --evals eval.json \
     --runs 5 \
     --max-experiments 20

3. 查看并应用结果:
   - 检查 results.tsv 找最高分变体
   - 查看 mutation_description 了解关键改动
   - 将最佳 prompt 应用到原始 SKILL.md

Failure handling

Issue Action
LLM API rate limit Script auto-retries with backoff; if persistent, reduce --runs
Target file not found Check path, must be readable prompt/skill file
All experiments score 0 Evals may be too strict — review eval definitions, loosen criteria
Script crashes mid-run Results already written to results.tsv are preserved; re-run continues

Gotchas

  • 每次实验会调用 LLM 多次(runs x test_inputs x llm_evals),注意 API 用量 / Each experiment makes multiple LLM calls — watch API usage
  • LLM eval 本身有噪声,--runs 设高一点(5+)才有统计意义 / LLM evals are noisy, use 5+ runs for statistical significance
  • Rule evals 比 LLM evals 更稳定、更便宜,优先用 rule / Rule evals are more stable and cheaper — prefer them
  • Baseline 分数太低(\x3C 20%)说明 eval 定义可能有问题,先修 eval / If baseline score is very low, fix evals first
  • 优化 prompt 不能解决架构问题(比如搜索 API 本身返回差结果) / Prompt optimization cannot fix architectural issues
安全使用建议
This skill appears to do what it claims: mutate and test prompts by calling LLM APIs. Before installing or running: (1) be aware that the target prompt, your test inputs, and generated variants will be sent to the LLM provider you supply (OpenAI/Minimax/Anthropic or a compatible endpoint); do NOT point it at prompts containing secrets, credentials, or private data you don't want transmitted. (2) Limit which API key you provide and monitor usage/costs because the tool runs many model calls. (3) Review autoresearch.py (network endpoints and logging) if you need to verify no unexpected external hosts are used; you can set OPENAI_BASE_URL to a self-hosted compatible endpoint if you prefer. (4) The script backs up the original SKILL.md to SKILL.md.baseline, but still treat file writes as local modifications and run in a controlled workspace. Overall the components are coherent and proportionate to the stated purpose.
功能分析
Type: OpenClaw Skill Name: brainforge-autoresearch Version: 0.2.5 The brainforge-autoresearch skill bundle is a legitimate tool designed for autonomous prompt optimization using an iterative mutation and evaluation loop (the Karpathy autoresearch pattern). The core logic in autoresearch.py is implemented using only the Python standard library and interacts with official LLM APIs (OpenAI, Anthropic, MiniMax) to evaluate and improve prompt variants. While the tool possesses high-privilege capabilities such as modifying local files and utilizing API keys, these actions are transparently documented and essential for its stated purpose of optimizing AI agent skills. No evidence of data exfiltration, unauthorized execution, or malicious prompt injection was found in the code or instructions.
能力标签
cryptocan-make-purchasesrequires-sensitive-credentials
能力评估
Purpose & Capability
Name/description (prompt optimizer) match the included script and docs. Requesting an LLM API key (OPENAI_API_KEY / MINIMAX_API_KEY / ANTHROPIC_API_KEY) and Python is appropriate for running experiments against LLM providers.
Instruction Scope
SKILL.md and autoresearch.py instruct the agent to read a target SKILL.md (or prompt file), generate mutations, run them against test inputs, and upload those requests to the configured LLM provider. This is within scope, but important to note: the tool will transmit the target prompt, test inputs, and generated mutations to third-party LLM endpoints as part of evaluation (expected behavior for this tool).
Install Mechanism
Instruction-only skill with a bundled Python script; no external install downloads or package managers are used. The script uses only stdlib urllib/ssl for network calls — no high-risk remote install steps observed.
Credentials
Metadata declares any of MINIMAX_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY (with OPENAI_API_KEY as primary) and optional OPENAI_BASE_URL — these align with the providers implemented in the script. No unrelated secrets or excessive environment access are requested.
Persistence & Privilege
always is false and the skill does not request elevated platform privileges. It writes output artifacts (results.tsv, dashboard.html, SKILL.md.baseline) to the working directory, which is expected for its function and scoped to its own outputs.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install brainforge-autoresearch
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /brainforge-autoresearch 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v0.2.5
Renamed skill to brainforge-autoresearch. Old slug openclaw-autoresearch will be merged as redirect.
元数据
Slug brainforge-autoresearch
版本 0.2.5
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Brainforge Autoresearch 是什么?

Use when user wants to optimize, improve, benchmark, or evaluate a skill's prompt. Triggers on "optimize skill", "improve skill prompt", "benchmark skill", "... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 98 次。

如何安装 Brainforge Autoresearch?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install brainforge-autoresearch」即可一键安装,无需额外配置。

Brainforge Autoresearch 是免费的吗?

是的,Brainforge Autoresearch 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Brainforge Autoresearch 支持哪些平台?

Brainforge Autoresearch 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(macos, linux)。

谁开发了 Brainforge Autoresearch?

由 ZHANG Ning(@zning1994)开发并维护,当前版本 v0.2.5。

💬 留言讨论