← Back to Skills Marketplace
johnsmithfan

AB Test Agent Workflow (EN)

by JohnSmithfan · GitHub ↗ · v1.1.0-en2 · MIT-0
cross-platform ⚠ suspicious
133
Downloads
0
Stars
1
Active Installs
3
Versions
Install in OpenClaw
/install ab-test-agent-workflow
Description
多agent双盲 A/B 测试工作流。对多个 AI model/Agent 进行多轮次、双盲对照测试。 核心role:coordinate者(Coordinator)、受测者 A/B(Contestant)、评测者(Judge)。 trigger场景:"A/B 测试"、"双盲测试"、"比较 AI model"、"...
README (SKILL.md)

\r \r

A/B Test Agent Workflow\r

\r 多agent双盲 A/B 测试工作流 — coordinate者主导、受测者并行、评测者盲评。\r \r

何时使用\r

\r ✅ 用户说以下内容时trigger本 Skill:\r

  • "A/B 测试"\r
  • "双盲测试"\r
  • "比较 AI model"\r
  • "model评测"\r
  • "run a blind test"\r \r ❌ 不适用:单modelassess、简单问答、快速原型verify。\r \r

工作流架构\r

\r

┌─────────────────────────────────────────────────────────┐\r
│                   coordinate者 (Coordinator)                   │\r
│  ① 接收任务 + 轮次配置                                   │\r
│  ② 向 Contestant A 发送 Prompt                          │\r
│  ③ 向 Contestant B 发送 Prompt                          │\r
│  ④ 收集输出 → 匿名化为"plan1"/"plan2"                    │\r
│  ⑤ 向 Judge 发送匿名plan                                 │\r
│  ⑥ 收集评分 → record结果                                   │\r
│  ⑦ 重复 ④-⑥ N 轮                                       │\r
│  ⑧ 汇总 → 揭示身份 → 输出结构化report                      │\r
└─────────────────────────────────────────────────────────┘\r
        ↓                    ↓                    ↓\r
  ┌──────────┐        ┌──────────┐        ┌──────────┐\r
  │Contestant│        │Contestant│        │  Judge   │\r
  │    A     │        │    B     │        │  (盲评)  │\r
  └──────────┘        └──────────┘        └──────────┘\r
```\r
\r
## roleDefinition\r
\r
### 1. coordinate者(Coordinator)— 主会话\r
- 接收用户输入(任务、轮次、受测model/Rubric)\r
- 调度子 Agent 并收集输出\r
- execute匿名化handle\r
- 汇总结果,输出最终report\r
\r
### 2. 受测者 A/B(Contestant A / B)\r
- 各接收相同的 Prompt\r
- 独立生成输出\r
- 不知道自己正在与谁比较\r
- 由 `sessions_spawn` 隔离execute(`runtime=subagent`)\r
\r
### 3. 评测者(Judge)\r
- 仅收到"plan1"和"plan2"(不知道来源)\r
- 根据 Rubric 打分\r
- 提供评语和胜出方建议\r
- 由 `sessions_spawn` 隔离execute(`runtime=subagent`)\r
\r
## execute方式\r
\r
### 方式1:纯 AI coordinate(推荐)\r
直接在本会话中按工作流execute,无需脚本。\r
\r
**Prompt 模板(发给 Contestant A — 普通任务):**\r
```\r
你是 Contestant A。请完成以下任务,只输出结果,不要Description你是谁、不要加前缀:\r
\r
[TASK]\r
\r
输出格式(严格遵守):\r
[CONTENT_A]\r
[你的完整输出]\r
[/CONTENT_A]\r
```\r
\r
**Prompt 模板(发给 Contestant B — 普通任务):**\r
```\r
你是 Contestant B。请完成以下任务,只输出结果,不要Description你是谁、不要加前缀:\r
\r
[TASK]\r
\r
输出格式(严格遵守):\r
[CONTENT_B]\r
[你的完整输出]\r
[/CONTENT_B]\r
```\r
\r
**Prompt 模板(发给 Contestant A — 代码生成任务):**\r
```\r
你是 Contestant A。请完成以下任务。\r
\r
任务:[TASK]\r
\r
⚠️ 重要要求:先输出完整代码,再输出运行结果。代码必须在 [CONTENT_A] 标签内完整呈现,即使超时也优先返回代码。\r
\r
输出格式(严格遵守):\r
[CONTENT_A]\r
【代码】\r
```python\r
[你的完整代码]\r
```\r
\r
【运行结果】\r
[如有,运行结果]\r
[/CONTENT_A]\r
```\r
\r
**Prompt 模板(发给 Contestant B — 代码生成任务):**\r
```\r
你是 Contestant B。请完成以下任务。\r
\r
任务:[TASK]\r
\r
⚠️ 重要要求:先输出完整代码,再输出运行结果。代码必须在 [CONTENT_B] 标签内完整呈现,即使超时也优先返回代码。\r
\r
输出格式(严格遵守):\r
[CONTENT_B]\r
【代码】\r
```python\r
[你的完整代码]\r
```\r
\r
【运行结果】\r
[如有,运行结果]\r
[/CONTENT_B]\r
```\r
\r
**Prompt 模板(发给 Judge):**\r
```\r
你是1位严格公正的评测专家。请对以下两个匿名plan进行打分。\r
\r
评测任务:[TASK]\r
\r
评分维度(满分 10 分):\r
1. 准确性(答案是否正确)\r
2. 完整性(是否覆盖所有要点)\r
3. 表达质量(语言是否流畅、清晰)\r
4. 创意/深度(是否有独到见解)\r
\r
plan1:\r
[SOLUTION_1]\r
\r
plan2:\r
[SOLUTION_2]\r
\r
输出格式(严格遵守):\r
[SCORES]\r
plan1-准确性: X/10(简短理由)\r
plan2-准确性: X/10(简短理由)\r
plan1-完整性: X/10(简短理由)\r
plan2-完整性: X/10(简短理由)\r
plan1-表达质量: X/10(简短理由)\r
plan2-表达质量: X/10(简短理由)\r
plan1-创意/深度: X/10(简短理由)\r
plan2-创意/深度: X/10(简短理由)\r
[/SCORES]\r
[TOTAL_A]4项得分之和[/TOTAL_A]\r
[TOTAL_B]4项得分之和[/TOTAL_B]\r
[WINNER]plan1 或 plan2 或 平局[/WINNER]\r
[COMMENT]总体评语(150字以内)[/COMMENT]\r
```\r
\r
### 方式2:脚本驱动\r
```\r
python scripts/runner.py --prompt "写1首关于春天的诗" --rounds 3 --model-a claude-sonnet-4 --model-b gpt-4o\r
```\r
\r
## executeprocess详解\r
\r
### 第 1 步:接收配置\r
```\r
用户输入:\r
  - 任务 Prompt\r
  - 测试轮次(默认 3)\r
  - 评分维度(可自Definition Rubric)\r
  - 可选:指定受测model\r
```\r
\r
### 第 2 步:双盲分发\r
```\r
Round N:\r
  → 向 Contestant A 发送 Prompt(A 的专属版本)\r
  → 向 Contestant B 发送 Prompt(B 的专属版本)\r
  并行等待,两方互不知道对方的存在\r
```\r
\r
### 第 3 步:匿名化\r
```\r
收集 A 的输出 → 记为 S1\r
收集 B 的输出 → 记为 S2\r
随机决定展示顺序(防顺序bias)\r
→ 发给 Judge\r
```\r
\r
### 第 4 步:盲评\r
```\r
Judge 收到 S1、S2(无来源信息)\r
按 Rubric 逐项打分\r
输出分数 + 评语 + 胜出方\r
```\r
\r
### 第 5 步:结果record\r
```\r
Round N 结果:\r
  S1 = [A 的输出]\r
  S2 = [B 的输出]\r
  Judge 分数:S1=X, S2=Y\r
  胜出方:Z\r
```\r
\r
### 第 6 步:汇总\r
```\r
所有轮次完成后:\r
  - 汇总各轮得分\r
  - 计算胜率\r
  - 揭示身份\r
  - 输出最终report\r
```\r
\r
## 结果report模板\r
\r
```json\r
{\r
  "test_summary": {\r
    "task": "...",\r
    "rounds": 3,\r
    "contestant_a": "Model A / Agent A",\r
    "contestant_b": "Model B / Agent B",\r
    "rubric": ["准确性", "完整性", "表达质量", "创意"]\r
  },\r
  "rounds": [\r
    {\r
      "round": 1,\r
      "contestant_a_output": "...",\r
      "contestant_b_output": "...",\r
      "judge_scores": {\r
        "contestant_a": [9, 8, 9, 7],\r
        "contestant_b": [8, 9, 8, 8]\r
      },\r
      "winner": "contestant_a",\r
      "judge_comment": "..."\r
    }\r
  ],\r
  "final_result": {\r
    "total_score_a": 83,\r
    "total_score_b": 80,\r
    "wins_a": 2,\r
    "wins_b": 1,\r
    "winner": "Model A",\r
    "confidence": "中(各胜 1 轮,建议增加轮次)"\r
  }\r
}\r
```\r
\r
## 文件结构\r
\r
```\r
ab-test-agent-workflow/\r
├── SKILL.md                    ← 本文件(工作流Description)\r
├── scripts/\r
│   ├── runner.py               ← 多轮驱动引擎 + 自测模式\r
│   ├── judge_prompts.py       ← Judge 提示词build + 解析\r
│   └── anonymizer.py          ← 匿名化工具(过滤身份标识)\r
└── references/\r
    ├── rubric_templates.md      ← 各任务类型评分模板\r
    └── workflow_guide.md        ← 详细executestep指南\r
```\r
\r
## 自测命令\r
\r
```bash\r
# 自测模式(无需 subagent,verify工作流逻辑)\r
python scripts/runner.py --test --rounds 3\r
\r
# 预览 Prompt(不实际execute)\r
python scripts/runner.py --prompt "写1首关于春天的诗" --skip-spawn\r
```\r
\r
## Rubric 模板速查\r
\r
| 任务类型 | 推荐评分维度 |\r
|---------|------------|\r
| 写作/文案 | 准确性、完整性、表达、创意 |\r
| 代码生成 | 正确性、可读性、效率、security性 |\r
| 逻辑推理 | 准确性、推理深度、解释清晰度 |\r
| 知识问答 | 准确性、完整性、可信度 |\r
| 创意写作 | 原创性、文学性、主题契合度 |\r
\r
## 已知问题与handle技巧\r
\r
### 超时handle\r
- **现象**:子 Agent 在 57s 超时边缘可能只输出运行日志,未返回完整代码。\r
- **resolve**:代码任务 Prompt 中明确要求"**先输出完整代码,再输出运行结果**",即使超时也优先返回代码。\r
- **超时重试**:Judge 如果在 60s 内无输出,可重新 spawn 1个新的 Judge session。\r
\r
### 匿名化risk\r
- 如果输出内容包含参赛者名称(如"作为 Claude")或明确署名,Judge 容易猜出来源。\r
- **resolve**:使用 `scripts/anonymizer.py` 预handle,移除身份标识词(Claude/GPT/Gemini/参赛者A/参赛者B 等)。\r
- Judge prompt 中明确声明:"你不知道plan1来自哪个参赛者"。\r
\r
### 评分解析失败\r
- 如果 Judge 输出格式不standard(缺少 `[SCORES]` 等标签),解析器会 fallback 到智能提取。\r
- **建议**:Judge prompt 中用 `[SCORES]...[/SCORES]` 严格Constraint输出格式。\r
\r
### 同model测试\r
- 使用相同model(如同为 qclaw/modelroute)测试时,输出相似度高,Judge 倾向于判平。\r
- 这是正常现象,不代表工作流有问题。\r
- **建议**:对比不同model时才容易拉开差距。\r
Usage Guidance
This skill's instructions describe scripts (runner.py, anonymizer.py) and subagent spawning but those files are not included — don't assume safe tooling is provided. Before using: (1) Verify your platform actually isolates subagents (sessions_spawn/runtime=subagent) — if not, the 'blind' labels may leak. (2) Do not execute code produced by Contestant agents on your machine without sandboxing; the prompts explicitly ask for runnable code and results which could be malicious or access local resources. (3) If you intend to use script-driven mode, obtain or review runner.py and anonymizer.py from a trusted source and inspect them for data exfiltration, network calls, or unsafe file/system access. (4) If you rely on anonymization, implement or verify an anonymizer that actually removes model/agent identifiers rather than trusting prompts alone. (5) Consider testing this workflow in a restricted/sandbox environment and add the missing scripts to the package (or remove references) so the package contents match its documentation.
Capability Analysis
Type: OpenClaw Skill Name: ab-test-agent-workflow Version: 1.1.0-en2 The skill bundle defines a legitimate A/B testing workflow for comparing AI models using a coordinator-contestant-judge architecture. The instructions in SKILL.md are well-structured, focusing on task distribution, output anonymization, and objective scoring using sub-agents. No evidence of data exfiltration, malicious code execution, or harmful prompt injection was found in the provided documentation (SKILL.md) or metadata (_meta.json).
Capability Assessment
Purpose & Capability
The skill's name/description (A/B blind test orchestration) matches the prompts and workflow. However the SKILL.md lists a repository layout including scripts/runner.py and scripts/anonymizer.py whereas the package contains only SKILL.md (no code files). Claiming helper scripts that are not included is an incoherence: users might expect executable tooling but must implement or obtain it elsewhere.
Instruction Scope
Instructions stay within the A/B testing scope (coordination, anonymization, blind judging). They assume the ability to spawn isolated subagents (sessions_spawn / runtime=subagent) and to run code-generation tasks that return runnable code/results. Those assumptions are platform-dependent; without guaranteed subagent isolation or a provided anonymizer, the workflow could leak origin information or cause generated code to be run in an unsafe environment.
Install Mechanism
No install spec or external downloads are present (instruction-only). This is low-risk from an install/execution supply-chain perspective — the main risk is behavioral (see instruction scope).
Credentials
The skill requests no environment variables, no credentials, and no config paths. That is proportionate for a coordination/workflow instruction-only skill.
Persistence & Privilege
The skill does not request always:true and is user-invocable only. It does not ask to modify other skills or system settings. Autonomous invocation is allowed (default) but not accompanied by other broad privileges.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install ab-test-agent-workflow
  3. After installation, invoke the skill by name or use /ab-test-agent-workflow
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.1.0-en2
Full body English translation
v1.1.0-en
English version
v1.1.0
**Summary:** Version 1.1.0 introduces an extensive, clearly documented, multi-agent, double-blind A/B testing workflow for comparing multiple AI models/agents across multiple rounds. - Full SKILL.md documentation added, detailing workflow architecture, roles, prompt templates, execution steps, and reporting format. - Supports both in-chat (AI-coordinated) and script-driven execution methods. - Provides standardized prompts for contestants and judges, including code generation and evaluation tasks. - Includes anonymization tools and fallback strategies for format/parsing issues. - Reference rubric templates and troubleshooting guidance provided for robust, repeatable A/B testing.
Metadata
Slug ab-test-agent-workflow
Version 1.1.0-en2
License MIT-0
All-time Installs 1
Active Installs 1
Total Versions 3
Frequently Asked Questions

What is AB Test Agent Workflow (EN)?

多agent双盲 A/B 测试工作流。对多个 AI model/Agent 进行多轮次、双盲对照测试。 核心role:coordinate者(Coordinator)、受测者 A/B(Contestant)、评测者(Judge)。 trigger场景:"A/B 测试"、"双盲测试"、"比较 AI model"、"... It is an AI Agent Skill for Claude Code / OpenClaw, with 133 downloads so far.

How do I install AB Test Agent Workflow (EN)?

Run "/install ab-test-agent-workflow" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is AB Test Agent Workflow (EN) free?

Yes, AB Test Agent Workflow (EN) is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does AB Test Agent Workflow (EN) support?

AB Test Agent Workflow (EN) is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created AB Test Agent Workflow (EN)?

It is built and maintained by JohnSmithfan (@johnsmithfan); the current version is v1.1.0-en2.

💬 Comments