Description

多agent双盲 A/B 测试工作流。对多个 AI model/Agent 进行多轮次、双盲对照测试。核心role：coordinate者（Coordinator）、受测者 A/B（Contestant）、评测者（Judge）。 trigger场景："A/B 测试"、"双盲测试"、"比较 AI model"、"...

README (SKILL.md)

\r \r

A/B Test Agent Workflow\r

Name: AB Test Agent Workflow (EN)
Author: johnsmithfan

\r 多agent双盲 A/B 测试工作流 — coordinate者主导、受测者并行、评测者盲评。\r \r

何时使用\r

\r ✅ 用户说以下内容时trigger本 Skill：\r

"A/B 测试"\r
"双盲测试"\r
"比较 AI model"\r
"model评测"\r
"run a blind test"\r \r ❌ 不适用：单modelassess、简单问答、快速原型verify。\r \r

工作流架构\r

\r

┌─────────────────────────────────────────────────────────┐\r
│                   coordinate者 (Coordinator)                   │\r
│  ① 接收任务 + 轮次配置                                   │\r
│  ② 向 Contestant A 发送 Prompt                          │\r
│  ③ 向 Contestant B 发送 Prompt                          │\r
│  ④ 收集输出 → 匿名化为"plan1"/"plan2"                    │\r
│  ⑤ 向 Judge 发送匿名plan                                 │\r
│  ⑥ 收集评分 → record结果                                   │\r
│  ⑦ 重复 ④-⑥ N 轮                                       │\r
│  ⑧ 汇总 → 揭示身份 → 输出结构化report                      │\r
└─────────────────────────────────────────────────────────┘\r
        ↓                    ↓                    ↓\r
  ┌──────────┐        ┌──────────┐        ┌──────────┐\r
  │Contestant│        │Contestant│        │  Judge   │\r
  │    A     │        │    B     │        │  (盲评)  │\r
  └──────────┘        └──────────┘        └──────────┘\r
```\r
\r
## roleDefinition\r
\r
### 1. coordinate者（Coordinator）— 主会话\r
- 接收用户输入（任务、轮次、受测model/Rubric）\r
- 调度子 Agent 并收集输出\r
- execute匿名化handle\r
- 汇总结果，输出最终report\r
\r
### 2. 受测者 A/B（Contestant A / B）\r
- 各接收相同的 Prompt\r
- 独立生成输出\r
- 不知道自己正在与谁比较\r
- 由 `sessions_spawn` 隔离execute（`runtime=subagent`）\r
\r
### 3. 评测者（Judge）\r
- 仅收到"plan1"和"plan2"（不知道来源）\r
- 根据 Rubric 打分\r
- 提供评语和胜出方建议\r
- 由 `sessions_spawn` 隔离execute（`runtime=subagent`）\r
\r
## execute方式\r
\r
### 方式1：纯 AI coordinate（推荐）\r
直接在本会话中按工作流execute，无需脚本。\r
\r
**Prompt 模板（发给 Contestant A — 普通任务）：**\r
```\r
你是 Contestant A。请完成以下任务，只输出结果，不要Description你是谁、不要加前缀：\r
\r
[TASK]\r
\r
输出格式（严格遵守）：\r
[CONTENT_A]\r
[你的完整输出]\r
[/CONTENT_A]\r
```\r
\r
**Prompt 模板（发给 Contestant B — 普通任务）：**\r
```\r
你是 Contestant B。请完成以下任务，只输出结果，不要Description你是谁、不要加前缀：\r
\r
[TASK]\r
\r
输出格式（严格遵守）：\r
[CONTENT_B]\r
[你的完整输出]\r
[/CONTENT_B]\r
```\r
\r
**Prompt 模板（发给 Contestant A — 代码生成任务）：**\r
```\r
你是 Contestant A。请完成以下任务。\r
\r
任务：[TASK]\r
\r
⚠️ 重要要求：先输出完整代码，再输出运行结果。代码必须在 [CONTENT_A] 标签内完整呈现，即使超时也优先返回代码。\r
\r
输出格式（严格遵守）：\r
[CONTENT_A]\r
【代码】\r
```python\r
[你的完整代码]\r
```\r
\r
【运行结果】\r
[如有，运行结果]\r
[/CONTENT_A]\r
```\r
\r
**Prompt 模板（发给 Contestant B — 代码生成任务）：**\r
```\r
你是 Contestant B。请完成以下任务。\r
\r
任务：[TASK]\r
\r
⚠️ 重要要求：先输出完整代码，再输出运行结果。代码必须在 [CONTENT_B] 标签内完整呈现，即使超时也优先返回代码。\r
\r
输出格式（严格遵守）：\r
[CONTENT_B]\r
【代码】\r
```python\r
[你的完整代码]\r
```\r
\r
【运行结果】\r
[如有，运行结果]\r
[/CONTENT_B]\r
```\r
\r
**Prompt 模板（发给 Judge）：**\r
```\r
你是1位严格公正的评测专家。请对以下两个匿名plan进行打分。\r
\r
评测任务：[TASK]\r
\r
评分维度（满分 10 分）：\r
1. 准确性（答案是否正确）\r
2. 完整性（是否覆盖所有要点）\r
3. 表达质量（语言是否流畅、清晰）\r
4. 创意/深度（是否有独到见解）\r
\r
plan1：\r
[SOLUTION_1]\r
\r
plan2：\r
[SOLUTION_2]\r
\r
输出格式（严格遵守）：\r
[SCORES]\r
plan1-准确性: X/10（简短理由）\r
plan2-准确性: X/10（简短理由）\r
plan1-完整性: X/10（简短理由）\r
plan2-完整性: X/10（简短理由）\r
plan1-表达质量: X/10（简短理由）\r
plan2-表达质量: X/10（简短理由）\r
plan1-创意/深度: X/10（简短理由）\r
plan2-创意/深度: X/10（简短理由）\r
[/SCORES]\r
[TOTAL_A]4项得分之和[/TOTAL_A]\r
[TOTAL_B]4项得分之和[/TOTAL_B]\r
[WINNER]plan1 或 plan2 或 平局[/WINNER]\r
[COMMENT]总体评语（150字以内）[/COMMENT]\r
```\r
\r
### 方式2：脚本驱动\r
```\r
python scripts/runner.py --prompt "写1首关于春天的诗" --rounds 3 --model-a claude-sonnet-4 --model-b gpt-4o\r
```\r
\r
## executeprocess详解\r
\r
### 第 1 步：接收配置\r
```\r
用户输入：\r
  - 任务 Prompt\r
  - 测试轮次（默认 3）\r
  - 评分维度（可自Definition Rubric）\r
  - 可选：指定受测model\r
```\r
\r
### 第 2 步：双盲分发\r
```\r
Round N：\r
  → 向 Contestant A 发送 Prompt（A 的专属版本）\r
  → 向 Contestant B 发送 Prompt（B 的专属版本）\r
  并行等待，两方互不知道对方的存在\r
```\r
\r
### 第 3 步：匿名化\r
```\r
收集 A 的输出 → 记为 S1\r
收集 B 的输出 → 记为 S2\r
随机决定展示顺序（防顺序bias）\r
→ 发给 Judge\r
```\r
\r
### 第 4 步：盲评\r
```\r
Judge 收到 S1、S2（无来源信息）\r
按 Rubric 逐项打分\r
输出分数 + 评语 + 胜出方\r
```\r
\r
### 第 5 步：结果record\r
```\r
Round N 结果：\r
  S1 = [A 的输出]\r
  S2 = [B 的输出]\r
  Judge 分数：S1=X, S2=Y\r
  胜出方：Z\r
```\r
\r
### 第 6 步：汇总\r
```\r
所有轮次完成后：\r
  - 汇总各轮得分\r
  - 计算胜率\r
  - 揭示身份\r
  - 输出最终report\r
```\r
\r
## 结果report模板\r
\r
```json\r
{\r
  "test_summary": {\r
    "task": "...",\r
    "rounds": 3,\r
    "contestant_a": "Model A / Agent A",\r
    "contestant_b": "Model B / Agent B",\r
    "rubric": ["准确性", "完整性", "表达质量", "创意"]\r
  },\r
  "rounds": [\r
    {\r
      "round": 1,\r
      "contestant_a_output": "...",\r
      "contestant_b_output": "...",\r
      "judge_scores": {\r
        "contestant_a": [9, 8, 9, 7],\r
        "contestant_b": [8, 9, 8, 8]\r
      },\r
      "winner": "contestant_a",\r
      "judge_comment": "..."\r
    }\r
  ],\r
  "final_result": {\r
    "total_score_a": 83,\r
    "total_score_b": 80,\r
    "wins_a": 2,\r
    "wins_b": 1,\r
    "winner": "Model A",\r
    "confidence": "中（各胜 1 轮，建议增加轮次）"\r
  }\r
}\r
```\r
\r
## 文件结构\r
\r
```\r
ab-test-agent-workflow/\r
├── SKILL.md                    ← 本文件（工作流Description）\r
├── scripts/\r
│   ├── runner.py               ← 多轮驱动引擎 + 自测模式\r
│   ├── judge_prompts.py       ← Judge 提示词build + 解析\r
│   └── anonymizer.py          ← 匿名化工具（过滤身份标识）\r
└── references/\r
    ├── rubric_templates.md      ← 各任务类型评分模板\r
    └── workflow_guide.md        ← 详细executestep指南\r
```\r
\r
## 自测命令\r
\r
```bash\r
# 自测模式（无需 subagent，verify工作流逻辑）\r
python scripts/runner.py --test --rounds 3\r
\r
# 预览 Prompt（不实际execute）\r
python scripts/runner.py --prompt "写1首关于春天的诗" --skip-spawn\r
```\r
\r
## Rubric 模板速查\r
\r
| 任务类型 | 推荐评分维度 |\r
|---------|------------|\r
| 写作/文案 | 准确性、完整性、表达、创意 |\r
| 代码生成 | 正确性、可读性、效率、security性 |\r
| 逻辑推理 | 准确性、推理深度、解释清晰度 |\r
| 知识问答 | 准确性、完整性、可信度 |\r
| 创意写作 | 原创性、文学性、主题契合度 |\r
\r
## 已知问题与handle技巧\r
\r
### 超时handle\r
- **现象**：子 Agent 在 57s 超时边缘可能只输出运行日志，未返回完整代码。\r
- **resolve**：代码任务 Prompt 中明确要求"**先输出完整代码，再输出运行结果**"，即使超时也优先返回代码。\r
- **超时重试**：Judge 如果在 60s 内无输出，可重新 spawn 1个新的 Judge session。\r
\r
### 匿名化risk\r
- 如果输出内容包含参赛者名称（如"作为 Claude"）或明确署名，Judge 容易猜出来源。\r
- **resolve**：使用 `scripts/anonymizer.py` 预handle，移除身份标识词（Claude/GPT/Gemini/参赛者A/参赛者B 等）。\r
- Judge prompt 中明确声明："你不知道plan1来自哪个参赛者"。\r
\r
### 评分解析失败\r
- 如果 Judge 输出格式不standard（缺少 `[SCORES]` 等标签），解析器会 fallback 到智能提取。\r
- **建议**：Judge prompt 中用 `[SCORES]...[/SCORES]` 严格Constraint输出格式。\r
\r
### 同model测试\r
- 使用相同model（如同为 qclaw/modelroute）测试时，输出相似度高，Judge 倾向于判平。\r
- 这是正常现象，不代表工作流有问题。\r
- **建议**：对比不同model时才容易拉开差距。\r

Usage Guidance

This skill's instructions describe scripts (runner.py, anonymizer.py) and subagent spawning but those files are not included — don't assume safe tooling is provided. Before using: (1) Verify your platform actually isolates subagents (sessions_spawn/runtime=subagent) — if not, the 'blind' labels may leak. (2) Do not execute code produced by Contestant agents on your machine without sandboxing; the prompts explicitly ask for runnable code and results which could be malicious or access local resources. (3) If you intend to use script-driven mode, obtain or review runner.py and anonymizer.py from a trusted source and inspect them for data exfiltration, network calls, or unsafe file/system access. (4) If you rely on anonymization, implement or verify an anonymizer that actually removes model/agent identifiers rather than trusting prompts alone. (5) Consider testing this workflow in a restricted/sandbox environment and add the missing scripts to the package (or remove references) so the package contents match its documentation.

Capability Analysis

Type: OpenClaw Skill Name: ab-test-agent-workflow Version: 1.1.0-en2 The skill bundle defines a legitimate A/B testing workflow for comparing AI models using a coordinator-contestant-judge architecture. The instructions in SKILL.md are well-structured, focusing on task distribution, output anonymization, and objective scoring using sub-agents. No evidence of data exfiltration, malicious code execution, or harmful prompt injection was found in the provided documentation (SKILL.md) or metadata (_meta.json).

Capability Assessment

⚠ Purpose & Capability

The skill's name/description (A/B blind test orchestration) matches the prompts and workflow. However the SKILL.md lists a repository layout including scripts/runner.py and scripts/anonymizer.py whereas the package contains only SKILL.md (no code files). Claiming helper scripts that are not included is an incoherence: users might expect executable tooling but must implement or obtain it elsewhere.

ℹ Instruction Scope

Instructions stay within the A/B testing scope (coordination, anonymization, blind judging). They assume the ability to spawn isolated subagents (sessions_spawn / runtime=subagent) and to run code-generation tasks that return runnable code/results. Those assumptions are platform-dependent; without guaranteed subagent isolation or a provided anonymizer, the workflow could leak origin information or cause generated code to be run in an unsafe environment.

✓ Install Mechanism

No install spec or external downloads are present (instruction-only). This is low-risk from an install/execution supply-chain perspective — the main risk is behavioral (see instruction scope).

✓ Credentials

The skill requests no environment variables, no credentials, and no config paths. That is proportionate for a coordination/workflow instruction-only skill.

✓ Persistence & Privilege

The skill does not request always:true and is user-invocable only. It does not ask to modify other skills or system settings. Autonomous invocation is allowed (default) but not accompanied by other broad privileges.

Version History

v1.1.0-en2

Full body English translation

v1.1.0-en

English version

v1.1.0

**Summary:** Version 1.1.0 introduces an extensive, clearly documented, multi-agent, double-blind A/B testing workflow for comparing multiple AI models/agents across multiple rounds. - Full SKILL.md documentation added, detailing workflow architecture, roles, prompt templates, execution steps, and reporting format. - Supports both in-chat (AI-coordinated) and script-driven execution methods. - Provides standardized prompts for contestants and judges, including code generation and evaluation tasks. - Includes anonymization tools and fallback strategies for format/parsing issues. - Reference rubric templates and troubleshooting guidance provided for robust, repeatable A/B testing.

Metadata

Slug ab-test-agent-workflow

Version 1.1.0-en2

License MIT-0

All-time Installs 1

Active Installs 1

Total Versions 3

Frequently Asked Questions

What is AB Test Agent Workflow (EN)?

多agent双盲 A/B 测试工作流。对多个 AI model/Agent 进行多轮次、双盲对照测试。核心role：coordinate者（Coordinator）、受测者 A/B（Contestant）、评测者（Judge）。 trigger场景："A/B 测试"、"双盲测试"、"比较 AI model"、"... It is an AI Agent Skill for Claude Code / OpenClaw, with 133 downloads so far.

How do I install AB Test Agent Workflow (EN)?

Run "/install ab-test-agent-workflow" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is AB Test Agent Workflow (EN) free?

Yes, AB Test Agent Workflow (EN) is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does AB Test Agent Workflow (EN) support?

AB Test Agent Workflow (EN) is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created AB Test Agent Workflow (EN)?

It is built and maintained by JohnSmithfan (@johnsmithfan); the current version is v1.1.0-en2.

More Skills

AB Test Agent Workflow (EN)