功能描述

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based...

使用说明 (SKILL.md)

autoresearch

Name: Karpathy Autoresearch
Author: alannjaf

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology.

Triggers

Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.

Description

Autonomous prompt/strategy optimization using Karpathy's autoresearch pattern. Mutate → evaluate → keep improvements. Works on anything with a measurable score: trading strategies, content scripts, thumbnails, ad copy, email subjects.

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  1. BASELINE │────▶│  2. MUTATE   │────▶│  3. EVALUATE │────▶│  4. DECIDE   │
│  Score the   │     │  Change one  │     │  Run scoring │     │  Better?     │
│  current     │     │  thing       │     │  function    │     │  Keep : Revert│
│  version     │     │              │     │              │     │              │
└─────────────┘     └─────────────┘     └─────────────┘     └──────┬───────┘
                                                                    │
                                                              Loop back to 2

Instructions

Step 1: Identify the Mutable File

The mutable file is the thing you're optimizing. It can be:

A SKILL.md prompt/instructions
A trading strategy config (thresholds, parameters)
A content template (YouTube script format, ad copy structure)
Any text file where changes produce measurable differences

Create or identify this file. Example:

my-skill/
├── SKILL.md          ← this is your mutable file
├── eval/
│   ├── test_cases.json
│   └── score.py

Step 2: Create an Evaluation Function

Your eval function must:

Take the current mutable file as input
Run it against test cases
Return a numeric score (higher = better)

The eval can be anything:

LLM-as-judge: Send output to an LLM, ask it to score 1-100
Backtest: Run a strategy against historical data, measure Sharpe/returns
A/B metrics: CTR, engagement, conversion rate
Binary pass/fail: Count how many test cases pass out of N

Template eval function (customize for your domain):

# eval/score.py
import json
import sys

def evaluate(mutable_file_path: str, test_cases_path: str) -> float:
    """
    Score the current version of the mutable file.
    Returns a float — higher is better.
    """
    with open(mutable_file_path) as f:
        current_version = f.read()
    
    with open(test_cases_path) as f:
        test_cases = json.load(f)
    
    scores = []
    for case in test_cases:
        # YOUR SCORING LOGIC HERE
        # Example: run the prompt, compare output to expected
        score = run_and_score(current_version, case)
        scores.append(score)
    
    return sum(scores) / len(scores)

if __name__ == "__main__":
    score = evaluate(sys.argv[1], sys.argv[2])
    print(f"SCORE: {score}")

Step 3: Run the Autoresearch Loop

The loop follows this exact pattern:

1. Git init (if not already) — every experiment is a commit
2. Run eval on current version → get BASELINE score
3. For each experiment (1..N):
   a. Read the current mutable file
   b. Generate a MUTATION (change one thing — a threshold, a phrase, a rule)
   c. Write the mutated version
   d. Run eval → get NEW score
   e. If NEW > BASELINE:
      - Git commit with message: "exp-{N}: {description} | score: {baseline} → {new}"
      - Update BASELINE = NEW
      - Log: "✅ KEPT — improvement"
   f. If NEW \x3C= BASELINE:
      - Git checkout the mutable file (revert)
      - Log: "❌ REVERTED — no improvement"
4. Print final summary: experiments run, improvements found, final score

Agent Instructions for Running the Loop

When the user says "run autoresearch on X", follow this procedure:

Locate the mutable file — ask the user or infer from context
Locate or create the eval function — the user must have a way to score
Initialize git tracking in the project directory
Run baseline eval — record the starting score
Begin experiment loop:
- Read the mutable file
- Think about what single change might improve the score
- Make the change (be specific — change ONE thing per experiment)
- Run eval
- Keep or revert based on score
- Log the result
Continue for N experiments (default: 20, or until user stops)
Report results:
- Starting score → Final score
- Number of experiments run
- Number of improvements kept
- Summary of what changes worked

Mutation Strategy

Good mutations change ONE thing at a time:

Numeric parameters: Adjust thresholds, weights, window sizes
Prompt wording: Rephrase instructions, add/remove constraints
Structure: Reorder sections, add examples, remove redundancy
Rules: Add a new rule, tighten an existing one, relax a constraint

Bad mutations change everything at once — you can't learn what worked.

Step 4: Git Tracking

Every experiment MUST be tracked in git:

# Before starting
git init
git add -A
git commit -m "baseline: score {X}"

# After each successful mutation
git add -A
git commit -m "exp-{N}: {what changed} | {old_score} → {new_score}"

# After each failed mutation
git checkout -- {mutable_file}

This gives you:

Full history of every experiment
Ability to diff any two versions
Easy rollback if something breaks
A log of what mutations worked vs didn't

Proven Results

Case Study 1: Gold Trading Strategy

Task: Optimize XAUUSD trading parameters
Mutable file: Strategy config (EMA periods, momentum threshold, position sizing)
Eval function: Backtest on historical data → Sharpe ratio
Baseline: Sharpe 5.80
Experiments: 86 in 25 minutes
Final: Sharpe 12.23 (+111%)
Key discoveries: Momentum threshold 0.003→0, EMA 8/24→5/11, position sizing optimization
See: references/gold-results.md

Case Study 2: YouTube Shorts Scripts

Task: Optimize script-writing prompt for higher quality scores
Mutable file: SKILL.md prompt instructions
Eval function: LLM judge scoring 1-100
Baseline: 94.3/100
Experiments: 11
Final: 96.7/100 (+2.5%)
Key discoveries: Atomic sentences, strict 40-50 word range, stronger negative examples
See: references/youtube-results.md

Example Usage

User: "Run autoresearch on my email subject line skill"

Agent workflow:

Read the skill's SKILL.md (mutable file)
Create eval: generate 20 test emails → score subject lines with LLM judge (1-100 on open-rate prediction)
Baseline: 72.4/100
Experiment 1: Add "use numbers in subject lines" → 74.1 ✅ KEPT
Experiment 2: Add "max 6 words" → 71.8 ❌ REVERTED
Experiment 3: Add "start with a verb" → 75.3 ✅ KEPT
... continue for 20 experiments
Final: 79.2/100 (+9.4%)

User: "Optimize my trading strategy config"

Agent workflow:

Read strategy.json (mutable file)
Eval: run backtest script → Sharpe ratio
Baseline: Sharpe 2.1
Experiment 1: Lower stop-loss from 2% to 1.5% → Sharpe 2.3 ✅
Experiment 2: Increase EMA fast period 12→15 → Sharpe 1.9 ❌
... continue
Final: Sharpe 3.8 (+81%)

安全使用建议

High-level points to consider before installing or running this skill: - Functionality is coherent: it implements the mutate→evaluate→keep loop and includes reference scripts (loop.py, evaluate.py). That said, the package metadata omits some practical requirements — verify you have git and a safe working directory available. - Review and control the 'mutable file' and working directory: the agent will read, write, and git-commit whatever file you point it at. Do NOT point it at system configs, secrets, SSH keys, or any repository containing credentials. Run experiments in an isolated project or sandbox. - Eval commands run arbitrary subprocesses: loop.py runs whatever eval command you supply (via shell=True) and parses numeric output. Ensure your eval harness is trusted and does not perform unwanted network calls or exfiltration. Treat the eval command as code you must review. - evaluate.py requires you to implement score_one(); by default it raises NotImplementedError. If you implement LLM-based judging or backtests, you will likely need API keys and data access that the skill does not declare — keep credentials out of the mutable file and out of experiment commits. - The skill's README includes a crypto payment address and Telegram contact for a paid 'Pro' tier. This is external monetization and unrelated to the skill code — be cautious when sending funds or contacting external handles. - Best practices: run the skill in an isolated environment, inspect and possibly modify the provided scripts before running, keep a separate git repo or sandbox for experiments, and avoid letting the agent autonomously run experiments on repositories containing sensitive data. If you plan to use LLM judges or external services, create and scope API keys appropriately (principle of least privilege).

功能分析

Type: OpenClaw Skill Name: karpathy-autoresearch Version: 1.0.0 The skill bundle implements an autonomous 'mutate-evaluate-commit' loop that requires high-privilege file system and shell access. A significant vulnerability exists in 'scripts/loop.py', which uses 'subprocess.run(shell=True)' to execute evaluation commands, creating a risk of arbitrary command injection (RCE) during the autonomous mutation process. While the logic appears aligned with the stated 'autoresearch' purpose, the combination of autonomous code modification and unsanitized shell execution, alongside a 'Pro' tier requesting USDT payments to a specific wallet (TYMGjMAs3t5qaEU5xBqLQNpnGeLQ4G6gmM), presents a high-risk profile.

能力评估

ℹ Purpose & Capability

The name/description (autoresearch loop) aligns with included code (loop.py, evaluate.py) and README. However the skill metadata declares no required binaries or credentials while the scripts implicitly require git and a working shell environment and realistically will need LLM/backtest tooling and possibly API keys to implement evals — a mild incoherence between declared requirements and what is needed in practice.

ℹ Instruction Scope

SKILL.md precisely instructs the agent to locate a mutable file, create or use an eval function, initialize git, mutate the file, run evals, and commit/revert. That scope is consistent with the purpose. Caution: these instructions give the agent permission to modify and commit arbitrary files in the target workdir — if the 'mutable file' or working directory is pointed at sensitive configs or system files the loop could change them. The reference loop expects interactive or agent-driven mutations and runs arbitrary eval commands provided by the user.

✓ Install Mechanism

No install spec (instruction-only) and included scripts are a reference implementation. This is low install risk — nothing is downloaded from arbitrary URLs. The files will exist on disk as part of the skill bundle, which is expected.

ℹ Credentials

The skill declares no required env vars, but realistic use (LLM-judge, backtest data, external scoring harnesses) will likely require API keys, data access, or other credentials that are not declared. The README also asks users to pay $99 USDT to a provided crypto address and DM a Telegram handle to unlock a 'Pro' tier — this is external monetization/contact and not a credential leak, but it is unrelated to skill functionality and could be a red flag for some users.

ℹ Persistence & Privilege

always:false (normal). The skill can be invoked autonomously (default). Combined with its ability to modify files and run arbitrary eval commands (shell subprocess with shell=True), autonomous runs could have a wide blast radius if the agent is allowed to operate on sensitive directories. The skill does not request persistent system-level privileges or modify other skills' configs.

版本历史

v1.0.0

Initial release: autonomous prompt/strategy optimization. Gold trading +111% Sharpe, YouTube Shorts +2.5% quality.

元数据

Slug karpathy-autoresearch

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Karpathy Autoresearch 是什么？

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 117 次。

如何安装 Karpathy Autoresearch？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install karpathy-autoresearch」即可一键安装，无需额外配置。

Karpathy Autoresearch 是免费的吗？

是的，Karpathy Autoresearch 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Karpathy Autoresearch 支持哪些平台？

Karpathy Autoresearch 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Karpathy Autoresearch？

由 Alannjaf（@alannjaf）开发并维护，当前版本 v1.0.0。

Karpathy Autoresearch