Description

Use when building a Planner→Generator→Evaluator multi-agent harness with the Claude SDK. Triggers: "build a harness", "multi-agent pipeline", "agent loop", "...

README (SKILL.md)

Long-Running App Harness — SDK Implementation

Name: long-run-harness
Author: is-xins-xiaobai

Produces a runnable harness that orchestrates Claude agents via claude_agent_sdk. You are writing the harness, not running inside it.

Use query() + ClaudeAgentOptions for agentic loops; tool() + create_sdk_mcp_server() for structured output. Never anthropic.Anthropic() directly.

pip install claude-agent-sdk

Output structure:

harness/
  harness.py; config.yaml; config.py; log.py
  agents/ planner.py; generator.py; evaluator.py
  models/ state.py
  prompts/ planner.md; generator.md; evaluator.md

Routing

User Signal	Route
"build a harness / pipeline"	Start at Phase 1
"add an evaluator"	Jump to Phase 4
"add state / handoff"	Jump to Phase 5
"looping forever / broken"	Check feedback loop termination in Phase 5
"just explain what a harness does"	Explain concept, don't write code

Phase 1: Design the Harness

Load: $SKILL_DIR/instructions/planner-questions.md

⚠️ HARD GATE: Ask the design questions. Get answers to 1–3 before writing any code:

What does the harness build? (sets Generator tools + Evaluator rubric)
Python or TypeScript? (default: Python)
Models per agent? (default: all claude-opus-4-7; non-defaults → config.yaml)

Create skeleton:

mkdir -p harness/agents harness/models harness/prompts harness/harness-logs
touch harness/harness.py harness/log.py harness/agents/__init__.py harness/models/__init__.py

config.yaml + config.py — all tunable parameters here; never hardcode in agent files. Load: $SKILL_DIR/instructions/config.md for the full HarnessConfig dataclass.

cfg = HarnessConfig.load(Path(__file__).parent / "config.yaml")
# Always: cfg.agents.generator_model  — never: "claude-opus-4-7"

models/state.py — write first; all other files import from it. Load: $SKILL_DIR/instructions/context-handoff.md (HandoffState, EvalResult, format_handoff_for_prompt). Load: $SKILL_DIR/instructions/sprint-contracts.md (SprintContract + negotiation protocol).

log.py — dual stdout + timestamped file under harness-logs/. Load: $SKILL_DIR/instructions/logging.md for full implementation.

log.setup(PROJECT_DIR, label="run")  # once in main()
logger = log.get()                   # in every agent

Phase 2: Planner Agent

Load: $SKILL_DIR/instructions/planner-questions.md for system prompt template. Load: $SKILL_DIR/instructions/agent-patterns.md for full run_planner implementation.

run_planner(brief, session_id, cfg) → (reply, new_session_id). ClaudeAgentOptions(resume=session_id) continues session without resending history.

spec, session_id = "", None
while "SPEC_COMPLETE" not in spec:
    user_input = input("[Planner asks]: ").strip() if session_id else initial_brief
    spec, session_id = run_planner(user_input, session_id, cfg)
SPEC_PATH.write_text(spec.replace("SPEC_COMPLETE", "").strip())

Phase 3: Generator Agent

Load: $SKILL_DIR/instructions/agent-patterns.md for run_generator + self_assess implementations.

def run_generator(
    spec, contract, project_dir,
    handoff=None, strategic_framing=None, cfg=None,
) -> str: ...

ClaudeAgentOptions(
    model=cfg.agents.generator_model,
    allowed_tools=["Write", "Read", "Edit", "Bash", "Glob"],
    cwd=str(project_dir), permission_mode="bypassPermissions",
)

After generation, call self_assess() — catches gaps before the Evaluator via submit_assessment MCP tool. If not confident → extra pass with concerns as strategic_framing.

Phase 4: Evaluator Agent

Load: $SKILL_DIR/instructions/agent-patterns.md for full implementation. Load: $SKILL_DIR/instructions/evaluation-rubrics.md for system prompt + rubric criteria.

Two roles: run_evaluator() (post-generation gate) + review_contract() (pre-sprint criteria review).

# submit_grade schema: contract_results[{id, status, evidence}], rubric_scores{id: 1–5}, feedback
def run_evaluator(spec, contract, app_url, rubric_track="A", cfg=None) -> EvalResult: ...

⚠️ Deterministic verdict: Never trust verdict from the LLM. Recompute in _build_eval_result() from contract_results + rubric_scores using cfg.verdict.* thresholds.

Phase 5: Harness Loop

Load: $SKILL_DIR/instructions/iteration-loop.md for run_sprint, strategic_decision, git_commit.

def main():
    cfg = HarnessConfig.load(Path(__file__).parent / "config.yaml")
    log.setup(PROJECT_DIR, label="run")

def run_sprint(spec, contract, project_dir, handoff=None, cfg=None):
    while iteration \x3C cfg.loop.max_iterations:
        # 1. Generate — try/except; crash is a valid (poor) outcome
        # 2. Self-assess — extra pass if not confident
        # 3. git_commit("wip: sprint N iter I")
        # 4. Evaluate → EvalResult
        # 5a. Pass + iteration \x3C min_iterations → quality-improvement continue
        #     Pass + min_iterations met → git_commit("feat") + return
        # 5b. Fail → strategic_decision() → REFINE or PIVOT → set strategic_framing
    # Exhausted: input() if isatty() else return last result

Git checkpoints (see iteration-loop.md for git_commit() helper):

Event	Message
SPEC written	`feat: generate SPEC.md`
Contract negotiated	`chore: sprint N contract`
Each iteration	`wip: sprint N iteration I`
Sprint passes	`feat: sprint N complete`

Setup: pip install claude-agent-sdk && export ANTHROPIC_API_KEY=sk-... Verify: python -c "from agents.planner import run_planner; print('OK')"

Common Mistakes

Mistake	Fix
Trusting LLM's `verdict` field	Recompute in `_build_eval_result()` from `contract_results` + `rubric_scores`
Hardcoding model names	Use `cfg.agents.generator_model` — never a string literal
Not calling `handoff.save()` before Evaluator	On crash, Evaluator result is lost
Using `input()` in CI	Guard with `sys.stdin.isatty()` first
Accumulating messages across sprints	Each sprint is a fresh `query()` call — no cross-sprint history
Marking `completed_features` from Generator claim	Only promote after Evaluator PASS verdict

When to Simplify

Component	Remove / simplify when
Planner agent	User provides SPEC directly
Contract negotiation	Human has strong opinions; use config-file mode
Generator self-assessment	Evaluator consistently passes first attempt
`max_iterations` → 3	Correctness-only task, no quality/aesthetic goal
`min_iterations` → 1	Early passes are always good enough
Refine/pivot `strategic_decision`	Single sprint or correctness task
`HandoffState`	Sprint fits in one context window
Evaluator	Task within Generator's reliable baseline

Usage Guidance

Use this skill only if you intend to create an autonomous code-writing harness. Before running the generated harness on real work, pin the SDK dependency, run it in a sandbox or disposable repo, change bypassed permissions to approval-based permissions where possible, restrict Bash, and review handoff files and logs for secrets or misleading instructions.

Capability Analysis

Type: OpenClaw Skill Name: long-run-harness Version: 1.0.0 The skill bundle implements a multi-agent orchestration harness that utilizes high-risk capabilities, specifically arbitrary shell execution via the 'Bash' tool and explicit permission bypassing ('bypassPermissions') in SKILL.md and agent-patterns.md. While these features are plausibly required for the stated purpose of autonomous application development, they create a significant attack surface for remote code execution. No evidence of intentional malice, data exfiltration, or unauthorized persistence was found.

Capability Tags

cryptorequires-sensitive-credentials

Capability Assessment

ℹ Purpose & Capability

The stated purpose—generating a Planner→Generator→Evaluator app-building harness—matches the multi-agent SDK code, loops, evaluator, handoff state, and logging instructions. No hidden exfiltration endpoint or destructive intent is evident in the provided artifacts.

⚠ Instruction Scope

The generated Generator agent is instructed to use Write, Read, Edit, Bash, and Glob with permission_mode="bypassPermissions". That is purpose-aligned for an app-building harness, but it removes normal per-action review for high-impact local actions.

ℹ Install Mechanism

The skill is instruction-only and tells the user to install claude-agent-sdk with an unpinned pip command. This is central to the purpose, but users should pin and verify the dependency.

⚠ Credentials

The generated harness can repeatedly modify project files, run shell commands, and make git commits inside a user-supplied project directory. This is powerful enough that it should be run only in a sandbox or disposable repository unless carefully reviewed.

ℹ Persistence & Privilege

The skill intentionally creates logs, handoff JSON files, session resumes, and bounded long-running loops. These are disclosed and purpose-aligned, but they persist model-generated context and tool inputs that users should inspect and sanitize.

Version History

v1.0.0

Initial release of the skill for building multi-agent Claude SDK harnesses. - Guides users through designing and implementing a Planner→Generator→Evaluator orchestrator using `claude-agent-sdk`. - Enforces best practices: config-driven parameters, explicit agent looping, and robust evaluation logic. - Provides detailed, phase-based instructions for harness structure, agent implementation, and iteration management. - Highlights common mistakes and when to simplify the harness.

Metadata

Slug long-run-harness

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is long-run-harness?

Use when building a Planner→Generator→Evaluator multi-agent harness with the Claude SDK. Triggers: "build a harness", "multi-agent pipeline", "agent loop", "... It is an AI Agent Skill for Claude Code / OpenClaw, with 57 downloads so far.

How do I install long-run-harness?

Run "/install long-run-harness" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is long-run-harness free?

Yes, long-run-harness is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does long-run-harness support?

long-run-harness is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created long-run-harness?

It is built and maintained by 小白 (@is-xins-xiaobai); the current version is v1.0.0.

More Skills

long-run-harness