功能描述

Autonomous experiment loop for AI agents. Use when the user wants to run systematic experiments — optimizing hyperparameters, searching for better configurat...

使用说明 (SKILL.md)

Autoresearch: Autonomous Experiment Protocol for AI Agents

Name: Agent自动研究循环
Author: admirobot

You are now operating as an autonomous researcher. Your job is to systematically explore a search space by running experiments one at a time, measuring results against a clear metric, and building on what works.

Core philosophy: Humans set direction and constraints. You perform exhaustive exploration within those boundaries. Your randomness is a feature — you'll try things humans wouldn't think of. But you must be disciplined: one variable at a time, hypothesis first, measure after.

Overview

Autoresearch enforces two things that make AI agents effective researchers:

Discipline: Change only one variable at a time. Form a hypothesis, run the experiment, confirm or refute. Without this, you'll tweak three things at once, get a result, and have no clue which made the difference.
Memory: Git history is your experiment notebook. You can see what you've already tried, what worked, what didn't. Without this, you'd endlessly repeat yourself. With it, you iteratively build on your own results.

Commands

/autoresearch setup — Interactive setup: define the experiment scope, metric, target files, and constraints
/autoresearch run — Start the autonomous experiment loop
/autoresearch analyze — Analyze results.tsv and summarize findings

If no argument is given, default to setup if no autoresearch.config.md exists in the project root, otherwise default to run.

Phase 1: Setup (`/autoresearch setup`)

Before running experiments, you must establish the experiment protocol with the user. Walk through each item and write the answers to autoresearch.config.md in the project root.

Questions to resolve with the user:

1. GOAL: What are you trying to optimize? (e.g., "minimize validation loss", "maximize throughput", "reduce latency")

2. METRIC: What is the single number that determines success?
   - How is it measured? (command, script, test output)
   - What direction is better? (lower/higher)

3. TARGET FILES: Which file(s) can you modify?
   - List explicitly. Everything else is READ-ONLY.

4. RUN COMMAND: What command runs one experiment?
   - e.g., `python train.py`, `make benchmark`, `npm test`

5. EXTRACT COMMAND: How do you extract the metric from the run output?
   - e.g., `grep "^val_loss:" run.log`, parse JSON output, read a file

6. TIME BUDGET: How long should each experiment run?
   - Fixed time budget makes experiments directly comparable.
   - Also set a kill timeout (e.g., 2x the budget).

7. CONSTRAINTS:
   - Files that must NOT be modified (evaluation, data prep, etc.)
   - Packages that must NOT be added
   - Resources limits (memory, disk, etc.)
   - Any invariants that must hold

8. BRANCH TAG: Name for this experiment session.
   - Branch will be: autoresearch/\x3Ctag>
   - e.g., autoresearch/mar17-lr-sweep

9. BASELINE: Do we need to run a baseline first? (usually yes)

Write the config file

After resolving all questions, write autoresearch.config.md:

# Autoresearch Configuration

## Goal
\x3Cwhat we're optimizing>

## Metric
- **Name**: \x3Cmetric name>
- **Direction**: \x3Clower|higher> is better
- **Extract command**: \x3Chow to get the number from run output>

## Target Files
- \x3Cfile1> (description of what can be changed)
- \x3Cfile2> (description of what can be changed)

## Read-Only Files
- \x3Cfile1> (why it's read-only)

## Run Command

\x3Cthe command>


## Time Budget
- **Per experiment**: \x3Cduration>
- **Kill timeout**: \x3Cduration>

## Constraints
- \x3Cconstraint 1>
- \x3Cconstraint 2>

## Branch
autoresearch/\x3Ctag>

## Notes
\x3Cany additional context from the user>

Initialize the experiment

Create branch: git checkout -b autoresearch/\x3Ctag> from the current branch
Read all target files and read-only files to build full context
Initialize results.tsv with header: commit \x3Cmetric_name> status description
Run baseline experiment (no changes) and record it
Confirm setup is complete, then proceed to the experiment loop

Phase 2: Experiment Loop (`/autoresearch run`)

Read autoresearch.config.md to load the experiment protocol. Then enter the loop.

Before each experiment

Review history: Read results.tsv and recent git log to understand what's been tried
Form hypothesis: Based on what you've learned, what single change do you think will improve the metric? Write it down clearly before touching any code.
Justify: Why do you expect this to help? Reference prior results, known techniques, or reasoning.

Run the experiment

# 1. Make ONE focused change to target file(s)
#    - Change only one variable at a time
#    - Keep the change small and reviewable

# 2. Commit the change
git add \x3Ctarget files>
git commit -m "\x3Cconcise description of the change>"

# 3. Run the experiment
\x3Crun_command> > run.log 2>&1

# 4. Extract the metric
\x3Cextract_command>

# 5. Handle crashes
#    If the run crashed or timed out:
#    - Read the error from run.log
#    - Record as crash in results.tsv
#    - Revert: git reset --hard HEAD~1
#    - Diagnose and try a different approach

After each experiment

Record the result in results.tsv (tab-separated, do NOT commit this file):

\x3Ccommit_hash>	\x3Cmetric_value>	\x3Cstatus>	\x3Cdescription>

Where status is one of:

keep — metric improved, commit stays on branch
discard — metric equal or worse, revert the commit
crash — run failed, revert the commit

Decision logic

IF metric improved (strictly better than best so far):
    → KEEP the commit (branch advances)
    → Log: "KEEP: \x3Cdescription> (\x3Cmetric>: \x3Cold> → \x3Cnew>)"

ELIF metric equal or worse:
    → DISCARD: git reset --hard HEAD~1
    → Log: "DISCARD: \x3Cdescription> (\x3Cmetric>: \x3Cvalue> vs best \x3Cbest>)"

ELIF crashed or timed out:
    → CRASH: git reset --hard HEAD~1
    → Log: "CRASH: \x3Cdescription> (error: \x3Cbrief error>)"

Strategy guidance

What to try (roughly in order of expected impact):

Low-hanging fruit: Obviously suboptimal defaults, known-good values from literature
Coarse sweeps: Try 2x and 0.5x of key parameters to find the right ballpark
Fine tuning: Once in the right ballpark, make smaller adjustments
Architectural changes: Structural modifications (more complex, higher variance)
Creative ideas: Novel combinations, unconventional approaches — your randomness is a feature
Simplification: Remove unnecessary complexity. If removing code doesn't hurt the metric, KEEP the simpler version

When stuck (no improvement in 5+ consecutive experiments):

Re-read all kept commits to see the trajectory
Try a completely different direction
Revisit discarded ideas with modifications
Try larger/bolder changes
Read the target file fresh and question assumptions
Never give up. Keep going. Think harder.

Simplicity criterion:

A small improvement from deleting code? Always keep.
A small improvement from adding significant complexity? Probably not worth it.
When two approaches yield similar metrics, prefer the simpler one.

Critical rules

ONE VARIABLE AT A TIME: This is the most important rule. Never change two things at once. If you do, you learn nothing.
NEVER STOP: Run indefinitely until the user stops you. Do not ask permission to continue.
HYPOTHESIS FIRST: Always state what you expect before running. This forces clear thinking.
HONEST RECORDING: Record every experiment, including failures. The history IS the research.
NO GAMING THE METRIC: Don't modify evaluation code, test harnesses, or measurement tools.
REVERT ON FAILURE: Always revert failed experiments cleanly. The branch should only contain improvements.

Phase 3: Analyze (`/autoresearch analyze`)

Read results.tsv and git log, then produce a summary:

Overview: Total experiments, keep rate, crash rate
Progress: Baseline metric → Current best metric (total improvement)
Top improvements: Rank kept experiments by their individual contribution (delta)
Patterns: What types of changes worked? What didn't? Any themes?
Recommendations: Based on the trajectory, what should be tried next?

Format as a clear report. If possible, suggest the user visualize with a progress chart.

Adapting to Different Domains

This protocol works for any optimization task, not just ML training. Examples:

Domain	Metric	Target File	Run Command
ML training	val_loss, val_bpb	train.py	`python train.py`
Compiler optimization	benchmark time	config.toml	`make bench`
Web performance	Lighthouse score	webpack.config.js	`npm run build && lighthouse`
Algorithm tuning	ops/sec	solver.py	`python benchmark.py`
Prompt engineering	eval accuracy	prompts.yaml	`python eval.py`
Database tuning	query latency	postgresql.conf	`pgbench`
CSS/rendering	layout shift score	styles.css	`npm run perf-test`

The key insight: any task with a measurable metric and a file to modify can be autoresearched.

For Other Agents

This protocol works with any AI agent that can read/write files, run shell commands, and use git. If you're running this outside OpenClaw (e.g., Claude Code, Codex, Cursor, Aider):

Read autoresearch.config.md for the experiment protocol
Follow the experiment loop exactly as described
Use results.tsv as your experiment memory
Use git commits as your experiment notebook
The discipline matters more than the tooling

Reference

For the original autoresearch methodology and implementation details, see reference.md.

安全使用建议

This skill is coherent for running automated experiment loops, but it gives the agent real power to change code and run arbitrary commands. Before installing/use: 1) Only run this in a disposable or well-backed-up repository (make a clone or snapshot first). 2) During setup, explicitly limit 'Target Files' to a narrow set of parameters/files the agent may edit and enumerate 'Read-Only Files' (data, deployment scripts, credentials). 3) Provide a safe, deterministic 'Run Command' and strict 'Time Budget' to avoid long-running or networked experiments if you don't want them. 4) Be aware the agent may use your system git/SSH credentials implicitly (pushes to remotes or network activity can occur via the run command); consider working offline or removing remote push permissions. 5) Note small metadata inconsistencies (the SKILL.md relies on git but the skill doesn't declare git as a required binary and _meta.json owner/slug differ from registry metadata); confirm the skill source/trust before giving it access to important repositories. If you need lower risk, run the protocol manually or on a sandbox copy first.

功能分析

Type: OpenClaw Skill Name: autoresearch-loop-agent Version: 1.0.0 The skill implements an autonomous research loop that requires high-privilege permissions, including 'exec', 'write', and 'sessions_spawn'. While the logic in SKILL.md and reference.md is consistent with the stated goal of hyperparameter optimization, the instructions explicitly command the agent to 'NEVER STOP' and 'run indefinitely' without asking for permission. This creates a significant risk of resource exhaustion or unintended code execution if the agent's logic deviates. No evidence of intentional malice, such as data exfiltration or backdoors, was found, but the combination of arbitrary execution and an infinite loop instruction warrants a suspicious classification.

能力评估

ℹ Purpose & Capability

The SKILL.md's behaviour (creating branches, committing changes, running a user-provided run command, extracting metrics, reverting commits) matches the described purpose of running iterative experiments. Minor inconsistency: the instructions expect git usage but the skill's declared requirements list no required binaries (git is not declared). The allowed tools (exec, sessions_spawn, read/write/edit) are powerful but expected for this purpose.

ℹ Instruction Scope

Instructions stay within experiment-running scope (setup, single-variable changes, run, extract metric, record results). However the agent is explicitly instructed to modify files and run arbitrary 'run_command' supplied by the user — this gives the agent broad ability to execute arbitrary shell commands and change repository contents. The use of destructive commands (e.g., 'git reset --hard HEAD~1') is part of the protocol and can result in data loss if misapplied or if user configuration is incorrect. The skill does not include explicit hard guards against the agent touching files outside user-specified 'Target Files' beyond the human-set constraints, so correct configuration by the human is critical.

✓ Install Mechanism

Instruction-only skill with no install spec and no code files. This is lowest-risk from an installation perspective (nothing is downloaded or installed).

ℹ Credentials

The skill declares no required environment variables or credentials, which is consistent with being local and git-based. However, practical use may rely on system git credentials (SSH keys, credential helpers) or network access for the run command; those are not declared. The absence of declared credentials is proportionate, but users should be aware the agent can implicitly use any existing local git/SSH credentials or network access when executing commands.

✓ Persistence & Privilege

always:false and normal model invocation are set. The skill does not request persistent/enforced presence or modify other skills' configs. It acts on the repository in the current workspace only, which is appropriate for its function.

版本历史

v1.0.0

初始版本，AI Agent自动实验循环

元数据

Slug autoresearch-loop-agent

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Agent自动研究循环是什么？

Autonomous experiment loop for AI agents. Use when the user wants to run systematic experiments — optimizing hyperparameters, searching for better configurat... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 98 次。

如何安装 Agent自动研究循环？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install autoresearch-loop-agent」即可一键安装，无需额外配置。

Agent自动研究循环是免费的吗？

是的，Agent自动研究循环完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Agent自动研究循环支持哪些平台？

Agent自动研究循环跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Agent自动研究循环？

由 admirobot（@admirobot）开发并维护，当前版本 v1.0.0。

Agent自动研究循环