功能描述

Use when user wants to optimize, improve, benchmark, or evaluate a skill's prompt. Triggers on "optimize skill", "improve skill prompt", "benchmark skill", "...

使用说明 (SKILL.md)

brainforge-autoresearch

Name: Brainforge Autoresearch
Author: zning1994

Previously published as autoresearch / openclaw-autoresearch. Renamed for the brainforge marketplace rollout — functionality unchanged.

Autonomous prompt optimization for AI agent skills. Runs controlled experiments to find better prompt variants using the Karpathy autoresearch pattern: generate hypothesis, mutate prompt, evaluate, repeat.

When to use

用户说"优化一下这个 skill" / User says "optimize this skill's prompt"
用户要对比不同 prompt 版本的效果 / User wants to benchmark prompt variants
用户说"run autoresearch on X" / "eval skill X" / "improve skill X"
用户对 skill 输出质量不满，想系统性改进 / User is unhappy with skill output quality and wants systematic improvement

Do not use:

一次性的小改动（直接改 prompt 即可） / One-off prompt tweaks — just edit the prompt directly
调试某个特定失败 case / Debugging a specific failure — investigate the root cause instead
Skill 脚本本身有 bug（代码逻辑问题不是 prompt 问题） / Skill script has a bug — fix the code, not the prompt

Requirements

Python 3.10+
autoresearch.py script in the skill directory
LLM API access (MiniMax, OpenAI, or Anthropic)
Target skill must have a prompt file (SKILL.md, SYSTEM.md, or similar)

Procedure

Always follow these steps in order: (1) Create eval.json, (2) Run autoresearch command, (3) Review results and apply best prompt.

Step 1: Gather context

Before running, you need:

Parameter	Description	Example
`--target`	Path to the skill directory or prompt file to optimize	`../workspace/skills/brain-search/SKILL.md`
`--evals`	Path to eval definition JSON file	`eval.json`
`--provider`	LLM provider for running experiments	`minimax` (default), `openai`, `anthropic`
`--runs`	Number of runs per experiment (statistical significance)	`5` (default)
`--max-experiments`	Maximum experiments before stopping	`30` (default)
`--dashboard`	Open live results dashboard in browser	flag, no value

Step 2: Create eval.json

Define test inputs and evaluation criteria. Each eval is a binary pass/fail check.

{
  "test_inputs": [
    "search for latest AI agent frameworks",
    "find news about LLM inference optimization",
    "搜一下 transformer 架构的最新进展"
  ],
  "evals": [
    {
      "name": "has_sources",
      "type": "rule",
      "rule": "regex",
      "pattern": "(https?://|Source:|来源:)"
    },
    {
      "name": "no_hallucinated_urls",
      "type": "rule",
      "rule": "banned_phrases",
      "phrases": ["example.com", "placeholder.url"]
    },
    {
      "name": "sufficient_detail",
      "type": "rule",
      "rule": "word_count",
      "min": 50,
      "max": 500
    },
    {
      "name": "contains_summary",
      "type": "rule",
      "rule": "contains",
      "values": ["summary", "key findings", "结论"]
    },
    {
      "name": "no_apology_prefix",
      "type": "rule",
      "rule": "not_contains",
      "values": ["I apologize", "I'm sorry, but"]
    },
    {
      "name": "actionable_output",
      "type": "llm",
      "question": "Does the response provide actionable information the user can immediately use (links, specific facts, concrete next steps)?",
      "pass_description": "The response contains specific actionable items like URLs, concrete facts, or clear next steps",
      "fail_description": "The response is vague, generic, or lacks specific actionable information"
    }
  ]
}

Rule types:

Rule	Parameters	Description
`regex`	`pattern`	Pass if regex matches output
`banned_phrases`	`phrases` (list)	Pass if NONE of the phrases appear
`word_count`	`min`, `max` (optional)	Pass if word count is within range
`contains`	`values` (list), optional `match`: `"any"` (default) or `"all"`	Pass if any/all values appear in output (case-insensitive)
`not_contains`	`values` (list)	Pass if NONE of the values appear in output (case-insensitive)

LLM eval type:

Field	Description
`type`	Must be `"llm"`
`name`	Unique name for this eval
`question`	What to ask the judge LLM about the output
`pass_description`	Description of what a passing output looks like
`fail_description`	Description of what a failing output looks like

See eval-guide.md for detailed guidance on writing effective evals.

Step 3: Run autoresearch

python autoresearch.py \
  --target ../workspace/skills/brain-search/SKILL.md \
  --evals eval.json \
  --provider minimax \
  --runs 5 \
  --max-experiments 30 \
  --dashboard

Step 4: Review results and apply changes

The script writes results to results.tsv in the working directory. Each row is one experiment:

experiment_id  parent_id  mutation_description  avg_score  pass_rate  evals_detail  prompt_diff

Find the best performing variant:

cat results.tsv | sort -k4 -nr | head -5

Apply the winning prompt to your skill by copying the optimized prompt text to replace the original.

Example: optimizing brain-search

User: brain-search 的搜索结果经常缺少来源链接，帮我优化一下

完整流程:

1. 创建 eval.json:
   {
     "test_inputs": [
       "search for latest news on OpenAI",
       "搜一下最新的 AI 芯片进展",
       "find recent papers on RAG optimization",
       "what happened with Anthropic this week",
       "查查 GPU 价格趋势"
     ],
     "evals": [
       {
         "name": "has_urls",
         "type": "rule",
         "rule": "regex",
         "pattern": "https?://[^\\s]+"
       },
       {
         "name": "min_2_sources",
         "type": "rule",
         "rule": "regex",
         "pattern": "https?://[^\\s]+.*https?://[^\\s]+"
       },
       {
         "name": "structured_output",
         "type": "llm",
         "question": "Is the output well-structured with clear sections?",
         "pass_description": "Output uses clear structure like bullets or headers",
         "fail_description": "Output is a wall of text without clear structure"
       }
     ]
   }

2. 运行命令:
   python autoresearch.py \
     --target ../workspace/skills/brain-search/SKILL.md \
     --evals eval.json \
     --runs 5 \
     --max-experiments 20

3. 查看并应用结果:
   - 检查 results.tsv 找最高分变体
   - 查看 mutation_description 了解关键改动
   - 将最佳 prompt 应用到原始 SKILL.md

Failure handling

Issue	Action
LLM API rate limit	Script auto-retries with backoff; if persistent, reduce `--runs`
Target file not found	Check path, must be readable prompt/skill file
All experiments score 0	Evals may be too strict — review eval definitions, loosen criteria
Script crashes mid-run	Results already written to `results.tsv` are preserved; re-run continues

Gotchas

每次实验会调用 LLM 多次（runs x test_inputs x llm_evals），注意 API 用量 / Each experiment makes multiple LLM calls — watch API usage
LLM eval 本身有噪声，--runs 设高一点（5+）才有统计意义 / LLM evals are noisy, use 5+ runs for statistical significance
Rule evals 比 LLM evals 更稳定、更便宜，优先用 rule / Rule evals are more stable and cheaper — prefer them
Baseline 分数太低（\x3C 20%）说明 eval 定义可能有问题，先修 eval / If baseline score is very low, fix evals first
优化 prompt 不能解决架构问题（比如搜索 API 本身返回差结果） / Prompt optimization cannot fix architectural issues

安全使用建议

This skill appears to do what it claims: mutate and test prompts by calling LLM APIs. Before installing or running: (1) be aware that the target prompt, your test inputs, and generated variants will be sent to the LLM provider you supply (OpenAI/Minimax/Anthropic or a compatible endpoint); do NOT point it at prompts containing secrets, credentials, or private data you don't want transmitted. (2) Limit which API key you provide and monitor usage/costs because the tool runs many model calls. (3) Review autoresearch.py (network endpoints and logging) if you need to verify no unexpected external hosts are used; you can set OPENAI_BASE_URL to a self-hosted compatible endpoint if you prefer. (4) The script backs up the original SKILL.md to SKILL.md.baseline, but still treat file writes as local modifications and run in a controlled workspace. Overall the components are coherent and proportionate to the stated purpose.

功能分析

Type: OpenClaw Skill Name: brainforge-autoresearch Version: 0.2.5 The brainforge-autoresearch skill bundle is a legitimate tool designed for autonomous prompt optimization using an iterative mutation and evaluation loop (the Karpathy autoresearch pattern). The core logic in autoresearch.py is implemented using only the Python standard library and interacts with official LLM APIs (OpenAI, Anthropic, MiniMax) to evaluate and improve prompt variants. While the tool possesses high-privilege capabilities such as modifying local files and utilizing API keys, these actions are transparently documented and essential for its stated purpose of optimizing AI agent skills. No evidence of data exfiltration, unauthorized execution, or malicious prompt injection was found in the code or instructions.

能力标签

cryptocan-make-purchasesrequires-sensitive-credentials

能力评估

✓ Purpose & Capability

Name/description (prompt optimizer) match the included script and docs. Requesting an LLM API key (OPENAI_API_KEY / MINIMAX_API_KEY / ANTHROPIC_API_KEY) and Python is appropriate for running experiments against LLM providers.

ℹ Instruction Scope

SKILL.md and autoresearch.py instruct the agent to read a target SKILL.md (or prompt file), generate mutations, run them against test inputs, and upload those requests to the configured LLM provider. This is within scope, but important to note: the tool will transmit the target prompt, test inputs, and generated mutations to third-party LLM endpoints as part of evaluation (expected behavior for this tool).

✓ Install Mechanism

Instruction-only skill with a bundled Python script; no external install downloads or package managers are used. The script uses only stdlib urllib/ssl for network calls — no high-risk remote install steps observed.

✓ Credentials

Metadata declares any of MINIMAX_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY (with OPENAI_API_KEY as primary) and optional OPENAI_BASE_URL — these align with the providers implemented in the script. No unrelated secrets or excessive environment access are requested.

✓ Persistence & Privilege

always is false and the skill does not request elevated platform privileges. It writes output artifacts (results.tsv, dashboard.html, SKILL.md.baseline) to the working directory, which is expected for its function and scoped to its own outputs.

版本历史

v0.2.5

Renamed skill to brainforge-autoresearch. Old slug openclaw-autoresearch will be merged as redirect.

元数据

Slug brainforge-autoresearch

版本 0.2.5

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题