← 返回 Skills 市场
demo112

Ab Test Runner

作者 demo112 · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
132
总下载
0
收藏
1
当前安装
1
版本数
在 OpenClaw 中安装
/install ab-test-runner
功能描述
Design and execute A/B testing experiments for LLM prompts, agent behaviors, and content production. Activate when user says "run an AB test", "design an exp...
使用说明 (SKILL.md)

AB Test Runner

Run structured A/B experiments: hypothesis → design → execute → archive → update findings.


Workflow

Hypothesis → 操作化定义 → 变体设计 → 执行 → 分析 → 归档 → 模板更新

Step 1: Hypothesis

Input from user: a question or claim to test

Your task: Formalize it into the standard format:

domain: \x3Cprompt|behavior|engineering|content>
variable: \x3Cwhat you're changing>
hypothesis: \x3C具体可检验的因果/差异陈述>

Examples:

  • "自然语言 vs 申论化语言哪个效果好" → domain: prompt, variable: 语言风格, hypothesis: 自然语言比申论化语言产出质量更高
  • "语速+15% vs 语速+5%" → domain: content, variable: TTS语速, hypothesis: 语速+15%比+5%完播率更高

If the user hasn't specified: Ask 3 questions:

  1. 你要改变哪个变量?(A 是什么 vs B 是什么)
  2. 你怎么判断哪个更好?(Cooper 主观评判 / 客观指标 / 两者都有)
  3. 每组需要多少样本?

Step 2: 操作化定义

Create the experiment manifest before running:

{
  "experiment_id": "hyp-XXX",
  "executed_at": "\x3CISO timestamp>",
  "domain": "...",
  "variable": "...",
  "hypothesis": "...",
  "n_per_group": \x3C10-30>,
  "rubric": {
    "\x3C维度A>": "0-3 — \x3C标准>",
    "\x3C维度B>": "0-3 — \x3C标准>",
    "\x3C维度C>": "0-3 — \x3C标准>",
    "\x3C篇幅合规>": "0-1 — \x3C标准>"
  },
  "success_criteria": "\x3C假设成立的标准>"
}

Rules:

  • 每组样本 ≥10(低于 10 明确标注"方向性信号,非统计显著")
  • rubric 多维(3 个维度 + 1 个合规项)比单一分数更抗漂移
  • 评分方法必须说明:self(自评)/ cross(交叉评)/ external(独立 agent)

Step 3: 变体设计

Define exactly what each variant receives:

Variant A — Control 组
- prompt: \x3C完整对照 prompt>
- 样本任务: \x3C3 个代表性任务>

Variant B — Treatment 组  
- prompt: \x3C只改一个变量的 treatment prompt>
- 样本任务: \x3C与 A 完全相同的 3 个任务>

铁律: 每次只改一个变量。其他所有元素(任务、温度、max_tokens)完全一致。

质量方差: 每个 variant 内的任务要有难度差异,确保输出有高/中/低分布。


Step 4: 执行

并行 subagent 模板

Spawn N 个 subagent(每组一个,或每个任务一个),task 包含:
1. 实验 ID + variant label
2. 完整 rubric(让 agent 知道评分标准)
3. 任务描述
4. 执行指令:生成输出 → 按 rubric 评分 → 记录 self_score + reasoning
5. 输出格式:{id, variant, output, self_score, reasoning}

批量限制: 每次实验最多并发 3 个 subagent,避免 429 限流

交叉评分(如果用 self 评分)

当所有 self 评分完成:

再 spawn 1 个 subagent 做盲评:
- 输入:所有输出的匿名版本(A/B label 打乱)
- 任务:对每个输出按 rubric 评分,不知道哪个对应哪个 variant
- 输出:{id, cross_score, reasoning}

数据收集

汇总所有结果到 memory/experiments/auto-ab-results.json

{
  "experiment_id": "hyp-XXX",
  "executed_at": "...",
  "variant_a": { "label": "...", "samples": N, "avg_score": X.X },
  "variant_b": { "label": "...", "samples": N, "avg_score": X.X },
  "winner": "A|B|none",
  "conclusion": "..."
}

Step 5: 分析

根据结果判断:

条件 结论
A 显著优于 B(效应量大) confirmed
B 显著优于 A refuted(假设方向错误)
部分维度成立 partially_confirmed
无差异,样本够 inconclusive
样本 \x3C10 insufficient_sample

计算均值差异 + 效应量方向,给出具体结论。


Step 6: 归档

写入 hypotheses registry

memory/experiments/auto-ab-hypotheses.json:追加/更新对应 hyp 条目

写入详细分析

memory/experiments/hyp-XXX.md:包含完整 rubric、样本统计、结论、后续实验建议

更新模板

memory/experiments/AB-test-design-template.md

  • 新发现 → 追加到 Section 10 "核心发现"
  • 新坑点 → 追加到 Section 11 "已知坑点"

Step 7: 回报给 Cooper

简洁回报格式:

实验 hyp-XXX | \x3Cdomain>
假设: \x3C一句话假设>
结果: A 胜 / B 胜 / 无差异
核心发现: \x3C一句话>
结论: \x3CCooper 是否应该改变现有做法>
下一步: \x3C如果要继续,下一步是什么>

关键配置

  • 数据文件: memory/experiments/auto-ab-results.json
  • 假设注册表: memory/experiments/auto-ab-hypotheses.json
  • 模板: memory/experiments/AB-test-design-template.md
  • 坑点: memory/experiments/AB-test-design-template.md Section 11

已知坑点(执行前必读)

  1. API token=0: 执行前健康检查,失败立即重试
  2. 自评膨胀: 不依赖 self_score 作为唯一指标,用 cross_score 校正
  3. 迭代拐点: 超过 3 轮迭代质量下降,报告时标注
  4. 输出非确定性: 每组至少 10 样本抵消随机性

基于 Batch 2(6实验,190样本)+ Hypothesis系列(5假设)实战经验

安全使用建议
This skill appears to implement a sensible A/B testing workflow, but there are a few things to check before installing: (1) Confirm what the platform "memory/experiments" path maps to and whether you’re comfortable having experiment inputs/outputs (possibly sensitive text) persisted there. (2) Ask the author to declare any required config paths or environment variables — the doc references an "API token" health check and rate limits but metadata lists none; clarify which API/token would be used and whether the skill will use your account/credits. (3) Understand that the skill spawns subagents (concurrency up to 3) and performs automated scoring — this can consume model calls/credits and produce persistent artifacts. (4) If you need stricter data control, request that the skill explicitly expose config options for storage location, concurrency limits, and whether outputs are retained or redacted. If the author provides those clarifications, the skill would be coherent; until then treat it cautiously.
功能分析
Type: OpenClaw Skill Name: ab-test-runner Version: 1.0.0 The ab-test-runner skill is a well-structured tool designed for conducting A/B experiments on LLM prompts and behaviors. It follows a scientific workflow (hypothesis, design, execution, analysis) and uses subagents for parallel processing and cross-evaluation. All file operations are confined to a local 'memory/experiments/' directory for logging and archiving results, and no indicators of data exfiltration, malicious execution, or prompt injection were found in SKILL.md or _meta.json.
能力评估
Purpose & Capability
The name, description, and SKILL.md all consistently implement an A/B test workflow for prompts/agent outputs/content. That high-level purpose matches the instructions (design variants, run agents, score, archive). However, the SKILL.md expects the agent to write/read from memory/experiments/* files and to perform an "API token" health check, yet the skill metadata lists no required config paths or credentials. This mismatch is unexpected and should be clarified.
Instruction Scope
Runtime instructions direct the agent to spawn multiple subagents, perform self- and cross-scoring, and persist results to memory/experiments/auto-ab-*.json and .md files. The instructions explicitly reference system-like paths (memory/...) and instruct file writes/updates, but the skill metadata did not declare any config paths. The instructions also state an "API token=0" health check and rate-limit handling (avoid 429) without declaring which external API/token is involved. These gaps create ambiguity about what the agent will read/write and which credentials or external services will be used.
Install Mechanism
This is instruction-only with no install spec and no code files. That is low-risk from an installation perspective — nothing is downloaded or written by an installer step.
Credentials
Declared requirements list no environment variables or credentials, but the SKILL.md references an "API token" health check and implies calls that could be rate-limited (429). The skill will also persist potentially sensitive outputs into memory/experiments files. The lack of declared env/config requirements is disproportionate to the operational behaviors described and makes it unclear whether the skill will implicitly use the agent's LLM/API credentials or other tokens.
Persistence & Privilege
always:false (good). The skill instructs persistent storage to memory/experiments/* which grants it ongoing access to its own experiment history. That is reasonable for an experiment runner, but users should be aware that experiment inputs/outputs (potentially sensitive) will be written to persistent memory by default.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install ab-test-runner
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /ab-test-runner 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release of ab-test-runner: a structured workflow for A/B testing LLM prompts, agent behaviors, and content. - Guides users through hypothesis setup, variable selection, and metric definition. - Ensures standardized experiment design, including sample size rules and multi-dimensional rubrics. - Automates experiment execution via subagents, supports self/cross/external scoring. - Aggregates results, calculates effects, archives findings, and recommends next steps. - Includes detailed templates and pitfall documentation to improve testing consistency and result reliability.
元数据
Slug ab-test-runner
版本 1.0.0
许可证 MIT-0
累计安装 1
当前安装数 1
历史版本数 1
常见问题

Ab Test Runner 是什么?

Design and execute A/B testing experiments for LLM prompts, agent behaviors, and content production. Activate when user says "run an AB test", "design an exp... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 132 次。

如何安装 Ab Test Runner?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install ab-test-runner」即可一键安装,无需额外配置。

Ab Test Runner 是免费的吗?

是的,Ab Test Runner 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Ab Test Runner 支持哪些平台?

Ab Test Runner 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Ab Test Runner?

由 demo112(@demo112)开发并维护,当前版本 v1.0.0。

💬 留言讨论