功能描述

Interpret A/B test results for ecommerce campaigns and pages by checking statistical significance, practical effect size, and next-step recommendations.

使用说明 (SKILL.md)

A/B Interpreter

Name: A/B Interpreter
Author: leooooooow

Most ecommerce A/B tests get called too early, too late, or on the wrong metric — and the team ships whichever variant "looks like" it won without a clean read on whether the lift was real, large enough to matter, or durable past the novelty window. This skill turns raw test results into a disciplined go or no-go verdict with statistical significance, practical effect size, segment checks, and a prescribed next step so every test actually moves the business.

Quick Reference

Decision	Strong signal	Acceptable	Weak / Redesign
Statistical significance	p \x3C 0.05, two-tailed, sample powered to 80%	p between 0.05 and 0.10 with large sample	p > 0.10 or sample underpowered
Practical effect size	Lift exceeds MDE and the lower bound of the 95% CI is positive	Lift exceeds MDE but CI barely crosses zero	Lift below MDE even if "significant"
Test duration	Ran across at least two full weekly cycles	Ran 10 to 14 days, no major anomalies	Under 7 days or overlaps a holiday/promo
Sample ratio	Observed split within 1% of planned	Split within 2%	SRM > 2% — suspect tracking
Guardrail metrics	All guardrails flat or positive	One guardrail flat, one minor regression	Any revenue, refund, or AOV guardrail regressed
Segment stability	Lift positive across all key segments	Lift positive in 3 of 4 segments	Contradicting segments (new vs returning, device, geo)
Novelty risk	Lift holds in final week of test	Lift decays mildly over time	Lift front-loaded and decays past week two

Solves

Teams ship winners that were noise, then can't reproduce the lift in production
Stakeholders disagree on whether a 3% lift with p = 0.08 is a "win"
Growth managers call tests after 3 days on a heavy-traffic weekend and miss weekday reality
Novelty-effect lifts decay in production and no one notices until the quarter closes
Segment-level contradictions (tablet users regress) get averaged away in the topline
Guardrail metrics like refund rate silently offset the headline conversion win
Test readouts lack a prescribed next step, so learnings do not compound

Workflow

Ingest the test setup. Read the hypothesis, variants, traffic split, start and end dates, primary metric, minimum detectable effect (MDE), and guardrail list. Flag any required field that is missing and stop if the primary metric is not defined in advance.
Sanity-check the data. Confirm the observed traffic split matches the planned split (SRM check). Look for outlier days, tracking gaps, or bot spikes. If the split is off by more than 2%, mark the test inconclusive and recommend re-running.
Run the core statistical check. Use a two-proportion z-test for conversion rate, Welch's t-test for continuous revenue metrics. Compute the p-value, the observed lift, and a 95% confidence interval around the lift.
Compare lift against MDE. A significant result that falls below MDE is practically meaningless; a non-significant result that exceeds MDE suggests under-powering and an extension. The lower bound of the CI must exceed zero for a confident ship call.
Check segments and guardrails. Split by new vs. returning, device, traffic source, and geography. Any segment that contradicts the aggregate direction should be called out. Any guardrail regression (bounce rate, refund rate, add-to-cart) should be weighed against the primary lift.
Assess novelty and seasonality. If the test ran under 14 days, flag novelty risk. Compare week one to week two. If lift concentrated in week one, recommend extending or running a repeat test after a cooldown.
Deliver the verdict and next step. Write a plain-language verdict (ship, kill, extend, redesign), the reasoning a skeptical analyst would demand, and the prescribed follow-up test or monitoring metric. Every readout ends with one concrete action.

Example 1: Shopify PDP button copy test

Inputs. Hypothesis: "Buy Now" outperforms "Add to Cart" on the product page. 14-day test, 50/50 split, 48,200 visitors each arm. Control conversions: 1,446 (3.00%). Variant conversions: 1,640 (3.40%). MDE set at 0.25 percentage points. Guardrails: refund rate and AOV.

Analysis. Two-proportion z-test gives p = 0.02 with a 95% CI on the lift of +0.08 to +0.72 pp. Observed lift of 0.40 pp exceeds MDE of 0.25. Sample split was 50.1 / 49.9. New vs. returning both show directional lift. Refund rate flat. AOV down 1.2% (not significant, CI crosses zero). Week one lift 0.45, week two 0.35 — mild decay, within tolerance.

Verdict. Ship. The result is statistically and practically significant; guardrails are intact. Next step: monitor AOV weekly for four weeks post-rollout and re-test "Buy Now" against a bundled offer CTA in Q3.

Example 2: Email subject line test with under-powered readout

Inputs. Two subject lines sent to 8,000 recipients each. Opens: A = 1,440 (18.0%), B = 1,560 (19.5%). MDE of 2 percentage points was required to justify the winner for list-wide rollout.

Analysis. Two-proportion z-test p = 0.022 — "significant". Observed lift of 1.5 pp is below the 2.0 pp MDE. 95% CI on the lift is +0.2 to +2.8, straddling the MDE. No segment breakdown provided.

Verdict. Do not ship yet. The result is statistically significant but falls short of the practical threshold the team pre-committed to. Extend the test or re-run with a larger list to narrow the CI. Do not rationalize the "win" just because p \x3C 0.05 — the pre-committed MDE exists precisely to prevent that.

Common mistakes

Peeking. Calling the test early the moment p drops below 0.05 inflates false positive rates dramatically. Pre-commit to a sample size and honor it.
Ignoring MDE. A statistically significant but practically tiny lift does not justify a rollout. The MDE is the business's pre-committed threshold.
No SRM check. A broken randomizer can fabricate lifts. Always verify the observed split matches the planned split.
Averaging away segments. A +2% topline with a -5% tablet regression is not a clean win. Pull the segment cuts before celebrating.
One-week tests. Weekday vs. weekend behavior differs, and novelty effects dominate week one. Run at least two full business cycles.
Mixing up absolute and relative lift. "20% lift" on a 3% base rate means +0.6 pp, not +20 pp. Always state which you are reporting.
Treating p-values as probability the hypothesis is true. A p-value is the probability of the observed data under the null, not the probability the variant is better.
No guardrails declared upfront. Retrofitting guardrails after seeing results invites cherry-picking. Name them before the test starts.
Ignoring post-ship drift. Novelty decay and seasonality can erode a shipped win. Always prescribe a post-ship monitoring window.

Resources

references/output-template.md — The four-block readout structure
references/statistical-tests.md — Choosing the right test for the metric type
references/segment-playbook.md — Standard segments to cut and what each contradiction signals
assets/ab-readout-checklist.md — Pre-flight checklist for every readout you deliver

安全使用建议

This skill appears coherent and focused on legitimate A/B readout tasks. Before installing or running it: (1) confirm where the skill came from (no homepage/source is listed) and consider requesting provenance or a maintainer contact; (2) only supply the minimal data needed — prefer aggregated counts and CIs where possible, or anonymize per-user rows before providing them, because some recommended tests call for per-user revenue series; (3) ensure human review of any 'Ship' decisions (the skill gives recommendations, not business authorization); and (4) if you plan to run it autonomously, restrict the data the agent can access to avoid accidental exposure of PII or other sensitive datasets.

功能分析

Type: OpenClaw Skill Name: ab-interpreter Version: 1.1.0 The skill bundle is a well-documented framework for analyzing A/B test results using statistical methods. It contains instructions (SKILL.md), quality checklists (assets/ab-readout-checklist.md), and educational references (references/statistical-tests.md) that guide an AI agent through a disciplined analytical workflow. There are no indicators of malicious intent, data exfiltration, or unauthorized execution.

能力标签

cryptocan-make-purchases

能力评估

✓ Purpose & Capability

The name/description (A/B Interpreter for ecommerce) match the SKILL.md and reference docs: it explains which tests to run (two-proportion z, Welch's t, segment cuts), what inputs are required (hypothesis, MDE, per-arm counts or per-user revenue series), and what outputs to produce. No unexpected binaries, env vars, or config paths are requested.

ℹ Instruction Scope

Instructions are narrowly focused on ingesting test setup and computing statistics, segment and guardrail checks, novelty assessment, and producing a verdict/template. There is no instruction to read system files, environment variables, or to call external endpoints. One important runtime expectation: some checks (e.g., Welch's t for revenue) require per-user revenue series rather than just aggregated summaries — that implies the agent will need raw test data supplied at runtime. Treat that as a data-scope requirement rather than a platform access issue.

✓ Install Mechanism

No install spec and no code files — this is an instruction-only skill. Nothing is downloaded or written to disk by the skill package itself.

ℹ Credentials

The skill requests no environment variables, credentials, or config paths (proportionate). However, some recommended analyses need raw per-user data (potentially sensitive), so the primary privacy/secret concern is the dataset you provide at runtime, not the skill asking for secrets.

✓ Persistence & Privilege

The skill is not marked always:true and does not request persistent system changes. It uses the platform's normal autonomous-invocation setting (disable-model-invocation:false), which is expected and not, by itself, a problem.

版本历史

v1.1.0

**A/B Interpreter v1.1.0 Changelog** - Added comprehensive quick-reference table to clarify strong, acceptable, and weak test signals. - Expanded workflow steps with detailed sanity checks, statistical analysis, segmentation review, and novelty/seasonality assessment. - Provided two practical example analyses (Shopify PDP test and email subject line test) with inputs, step-by-step reasoning, and verdict. - Added section of common mistakes with practical warnings and corrections for real-world A/B testing. - Included a new resources section with references to output templates, statistical test guides, segment playbooks, and a checklist. - Added 4 supporting files: output template, statistical test guide, segment analysis guide, and pre-readout checklist.

v1.0.0

Initial release.

元数据

Slug ab-interpreter

版本 1.1.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 2

常见问题

A/B Interpreter 是什么？

Interpret A/B test results for ecommerce campaigns and pages by checking statistical significance, practical effect size, and next-step recommendations. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 91 次。

如何安装 A/B Interpreter？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install ab-interpreter」即可一键安装，无需额外配置。

A/B Interpreter 是免费的吗？

是的，A/B Interpreter 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

A/B Interpreter 支持哪些平台？

A/B Interpreter 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 A/B Interpreter？

由 LeroyCreates（@leooooooow）开发并维护，当前版本 v1.1.0。

A/B Interpreter

A/B Interpreter

Quick Reference

Solves

Workflow

Example 1: Shopify PDP button copy test

Example 2: Email subject line test with under-powered readout

Common mistakes

Resources

A/B Interpreter 是什么？

如何安装 A/B Interpreter？

A/B Interpreter 是免费的吗？

A/B Interpreter 支持哪些平台？

谁开发了 A/B Interpreter？

💬 留言讨论