功能描述

Cognitive discipline for AI-native scientific experimentation. Trigger when setting up controlled experiments with LLM agents, designing reproducible evaluat...

使用说明 (SKILL.md)

research-harness

Name: Research Harness v1.3
Author: zhelunsun

Version: 1.3.0 Cognitive discipline for AI-native scientific experimentation — guardrails, not recipes.

When to Use

Trigger this skill when the user:

Sets up a new AI-native research experiment repo
Designs controlled experiments with LLM agents
Needs reproducible evaluation, statistics, and error analysis
Wants to structure a research workspace for long-running agent collaboration
Wants agent-safe research governance that prevents overclaiming
Says anything like: "research harness", "experiment framework", "AI科研", "对照实验", "评分体系", "可复现性", "科研workflow", "agent协作科研", "可控实验", "效应量"

Core Philosophy

This skill does not prescribe what experiment to run. It prescribes how to think while running it.

Research agents fail not because they lack capability, but because they:

Scale before validating the minimum loop
Overclaim what the data proves
Treat surprising results as methodology failures before checking the execution chain
Delete failed runs to make progress look cleaner
Change baselines or rubrics silently
Invent designs from thin air without tracing them to published precedents or stated assumptions

The antidote is cognitive discipline — a set of non-negotiable mental habits enforced by repo structure, not by prompt reminders. Detailed reasoning for each discipline is in references/scientific-thinking.md.

A critical addition from real-world practice: governance before scale. When a closed loop produces strong signals, the most tempting mistake is to expand immediately. The correct move is to lock down claim boundaries, audit artifacts, fix provenance gaps, and only then scale. A dedicated calibration phase between closed-loop and full-scale expansion is a sign of maturity, not delay.

Six Cognitive Disciplines

#	Discipline	Core Question	Deep dive
1	Minimum Closed Loop Before Scale	Can the smallest version produce distinguishable signals?	`references/experiment-design.md`
2	Isolated Variables & Attributable Baselines	Does each group add exactly one variable?	`references/experiment-design.md`
3	Dual-Track Validation	Do two independent scoring systems agree?	`references/scoring-statistics.md`
4	Effect Size Over Significance	What is the magnitude, not just the p-value?	`references/scoring-statistics.md`
5	Pipeline Before Interpretation	Was the execution chain verified before the hypothesis was questioned?	`references/scientific-thinking.md`
6	Theoretical Grounding Before Design	Can every design decision trace to a published precedent or an explicitly stated hypothesis?	`references/methodology-grounding.md`

Disciplines 1-2: experiment design. 3-4: scoring & statistics. 5: critical reasoning. 6: methodology accountability.

Seven Governance Rules

#	Rule	Principle
1	Human Owns Direction; Agent Owns Execution	Agent cannot change research questions, promote evidence without review, or make academic decisions
2	Evidence Has Status; AI Output Is Not Fact	All AI-generated evidence starts as `candidate`; only back-to-source verification promotes to `verified`
3	Failed Runs Are Data, Not Trash	Register every run in the manifest; failures are process evidence against survivorship bias
4	Protected Surfaces Change Only By Proposal	Baselines, rubrics, raw results, and schema require version bump + documented proposal
5	Every Handoff Needs an Alignment Doc	Short doc replaces long chat history for agent onboarding
6	Calibrate Before Scaling	After a closed loop produces strong signals, lock down claim boundaries, audit artifacts, and fix provenance before expanding to full scale
7	Methodology Review Gate	Any "novel" design (no direct precedent) must have a Design Justification Document with ≥2 published precedents or explicit assumption declarations before entering experiment execution

Details in references/agent-collaboration.md.

Phase Workflow

Phase 0 · Scaffold

Goal: Set up the three-layer repo and root entry files.

thinking-space/ — research direction, claims, decisions (human)
execution-layer/ — briefs, logs, results, drafts (agent)
code-workshop/ — runnable artifacts, packages

Root files: AGENTS.md (workspace map), PLAN.md (phase panel), WORKFLOW.md (procedure), harness/README.md (governance).

Directory skeleton and rationale: references/repo-architecture.md.

Phase 1 · Harden

Goal: Make the repo self-checking before formal execution.

Module contracts — Each core module gets a CONTRACT.md (purpose, inputs, outputs, invariants, local validator). Template in references/repo-architecture.md.
Local validators — scripts/validate_\x3Cmodule>.py per module; scripts/validate_repo_state.py as aggregator. Gate rule: 0 FAIL before any formal run.
Multi-level audit — Structure validation as layered gates: schema-valid → evidence-governed → planning-useful → experiment-ready. Use a default mode (WARN only) during early phases and a strict mode (FAIL on gaps) before scaling. This keeps the repo moving without compromising the expansion gate.
Experiment manifest — experiments/results/manifest.csv as run-level provenance ledger (run_id, wave, task_id, group, model, version metadata, status, retry_of, git_commit).
Protected surfaces — Baselines, rubrics, raw results, scoring config, schema. Require version bump + proposal to change.

Phase 2 · Design

Goal: Design attributable controlled experiments with grounded methodology.

Progressive building: minimum artifacts → schema validation → small task set → dry run → scoring → expand. Design details in references/experiment-design.md.
Controlled groups: Baseline → incremental treatments. Adjacent groups differ by exactly one variable.
Gold checklists: Every task has must_include, forbidden, and scoring_notes. For multi-group experiments, separate gold checklists into planning_gold (items all groups can achieve) and evidence_gold (items only augmented groups can access).
Metric fairness annotation: Each evaluation metric must declare evidence_access_required (true/false). Score families with different access requirements must be reported separately, never mixed into a single total. Details in references/scoring-statistics.md.
Artifact QA: Creating knowledge artifacts (cards, schemas, tasks) follows a scaffold→validate→commit cycle. Validators catch syntax errors, enum mismatches, and registry sync issues before they accumulate. Details in references/repo-architecture.md.
Output contract: Agent output follows a strict schema (YAML/JSON). The scorer and analysis pipeline depend on this contract.
Design justification (Discipline 6): Before implementing any non-trivial schema, scoring system, or experiment protocol, write a Design Justification Document. Template and grounding framework in references/methodology-grounding.md. This is required, not optional.

Phase 3 · Execute & Analyze

Goal: Run experiments, score, compute statistics, analyze errors.

Preflight gate: local validators must pass. Then:

Dry run — print prompt, no API call
Smoke run — 1 task × 2 groups, verify output parsing
Wave 1 — small set × all groups, minimum viable data
Scoring: Track A (rule-based) + Track B (semantic) cross-validation. Details in references/scoring-statistics.md.
Statistics: Cohen's d primary, 95% CI, paired t, Wilcoxon. --reproduce flag for one-click reproducibility.
Error analysis: hallucination, output depth, specificity, task appropriateness.

Phase 4 · Handoff & Writing

Goal: Package results for the next phase or agent.

Claim-safe memo: After any closed loop that produces strong signals, write a memo separating: supported findings, positive signals, not-yet-supported claims, and required next evidence. This prevents overclaiming and directly drives the next phase. Never jump to expansion without calibrating the claim boundary first.
Alignment doc: ~1 page with state, entry files, new surfaces, preflight commands, protected surfaces. Never pass chat history.
Upstream proposals: Any insight affecting direction goes to sync/upstream_proposals/ first. Template in references/agent-collaboration.md.
Writing markers: [REF-MISSING], [CRITICAL-CHECK], [TODO]. Never use AI numbers without verification.

Non-Negotiables

No unverified citation becomes a research fact
No debug result becomes a formal result
No agent changes baseline, rubric, or metric definitions without a proposal
No raw result is overwritten
No failed experiment is deleted
No phase gate passes before validators report zero FAIL
No closed loop expands to full scale before calibration: lock down claim boundaries, audit artifacts, and fix provenance gaps first
No non-trivial design enters execution without a Design Justification Document (Discipline 6)

References

references/repo-architecture.md — three-layer repo, module contracts, manifest, validators
references/experiment-design.md — progressive building, controlled groups, gold checklists
references/scoring-statistics.md — dual-track validation, effect size, reproducibility
references/scientific-thinking.md — cognitive disciplines for agent-led research
references/agent-collaboration.md — governance, evidence status, alignment docs
references/methodology-grounding.md — theoretical grounding for schema, scoring, and experiment design

安全使用建议

Install if you want agents to follow a stricter research workflow. Expect it to influence how agents organize experiment repos, write validation scripts, preserve failed runs, and request human review before changing research claims, rubrics, or schemas.

能力评估

✓ Purpose & Capability

The stated purpose is a research experiment harness, and the artifacts consistently provide methodology, repo structure, validation, scoring, and human-review guardrails.

ℹ Instruction Scope

The skill tells agents to create research artifacts, run local validators, and in some cases run experiments automatically, but these actions are purpose-aligned and bounded by human-review and protected-surface rules.

✓ Install Mechanism

The package contains markdown documentation and references only; SkillSpector reports no executable scripts and dependency/static scans are clean.

✓ Credentials

Expected file creation is limited to research workspace structure, manifests, validators, proposals, and reports; no broad local indexing, credential use, or unrelated system access is described.

✓ Persistence & Privilege

No startup hooks, background workers, privilege escalation, credential/session handling, or persistent external service behavior appears in the artifacts.

版本历史

v1.3.0

v1.3.0: Added Discipline 6 (Theoretical Grounding Before Design), Governance Rule 7 (Methodology Review Gate), and methodology-grounding.md reference document

元数据

Slug agentic-research-harness

版本 1.3.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题