← Back to Skills Marketplace
is-xins-xiaobai

long-run-harness

by 小白 · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
57
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install long-run-harness
Description
Use when building a Planner→Generator→Evaluator multi-agent harness with the Claude SDK. Triggers: "build a harness", "multi-agent pipeline", "agent loop", "...
README (SKILL.md)

Long-Running App Harness — SDK Implementation

Produces a runnable harness that orchestrates Claude agents via claude_agent_sdk. You are writing the harness, not running inside it.

Use query() + ClaudeAgentOptions for agentic loops; tool() + create_sdk_mcp_server() for structured output. Never anthropic.Anthropic() directly.

pip install claude-agent-sdk

Output structure:

harness/
  harness.py; config.yaml; config.py; log.py
  agents/ planner.py; generator.py; evaluator.py
  models/ state.py
  prompts/ planner.md; generator.md; evaluator.md

Routing

User Signal Route
"build a harness / pipeline" Start at Phase 1
"add an evaluator" Jump to Phase 4
"add state / handoff" Jump to Phase 5
"looping forever / broken" Check feedback loop termination in Phase 5
"just explain what a harness does" Explain concept, don't write code

Phase 1: Design the Harness

Load: $SKILL_DIR/instructions/planner-questions.md

⚠️ HARD GATE: Ask the design questions. Get answers to 1–3 before writing any code:

  1. What does the harness build? (sets Generator tools + Evaluator rubric)
  2. Python or TypeScript? (default: Python)
  3. Models per agent? (default: all claude-opus-4-7; non-defaults → config.yaml)

Create skeleton:

mkdir -p harness/agents harness/models harness/prompts harness/harness-logs
touch harness/harness.py harness/log.py harness/agents/__init__.py harness/models/__init__.py

config.yaml + config.py — all tunable parameters here; never hardcode in agent files. Load: $SKILL_DIR/instructions/config.md for the full HarnessConfig dataclass.

cfg = HarnessConfig.load(Path(__file__).parent / "config.yaml")
# Always: cfg.agents.generator_model  — never: "claude-opus-4-7"

models/state.py — write first; all other files import from it. Load: $SKILL_DIR/instructions/context-handoff.md (HandoffState, EvalResult, format_handoff_for_prompt). Load: $SKILL_DIR/instructions/sprint-contracts.md (SprintContract + negotiation protocol).

log.py — dual stdout + timestamped file under harness-logs/. Load: $SKILL_DIR/instructions/logging.md for full implementation.

log.setup(PROJECT_DIR, label="run")  # once in main()
logger = log.get()                   # in every agent

Phase 2: Planner Agent

Load: $SKILL_DIR/instructions/planner-questions.md for system prompt template. Load: $SKILL_DIR/instructions/agent-patterns.md for full run_planner implementation.

run_planner(brief, session_id, cfg)(reply, new_session_id). ClaudeAgentOptions(resume=session_id) continues session without resending history.

spec, session_id = "", None
while "SPEC_COMPLETE" not in spec:
    user_input = input("[Planner asks]: ").strip() if session_id else initial_brief
    spec, session_id = run_planner(user_input, session_id, cfg)
SPEC_PATH.write_text(spec.replace("SPEC_COMPLETE", "").strip())

Phase 3: Generator Agent

Load: $SKILL_DIR/instructions/agent-patterns.md for run_generator + self_assess implementations.

def run_generator(
    spec, contract, project_dir,
    handoff=None, strategic_framing=None, cfg=None,
) -> str: ...

ClaudeAgentOptions(
    model=cfg.agents.generator_model,
    allowed_tools=["Write", "Read", "Edit", "Bash", "Glob"],
    cwd=str(project_dir), permission_mode="bypassPermissions",
)

After generation, call self_assess() — catches gaps before the Evaluator via submit_assessment MCP tool. If not confident → extra pass with concerns as strategic_framing.


Phase 4: Evaluator Agent

Load: $SKILL_DIR/instructions/agent-patterns.md for full implementation. Load: $SKILL_DIR/instructions/evaluation-rubrics.md for system prompt + rubric criteria.

Two roles: run_evaluator() (post-generation gate) + review_contract() (pre-sprint criteria review).

# submit_grade schema: contract_results[{id, status, evidence}], rubric_scores{id: 1–5}, feedback
def run_evaluator(spec, contract, app_url, rubric_track="A", cfg=None) -> EvalResult: ...

⚠️ Deterministic verdict: Never trust verdict from the LLM. Recompute in _build_eval_result() from contract_results + rubric_scores using cfg.verdict.* thresholds.


Phase 5: Harness Loop

Load: $SKILL_DIR/instructions/iteration-loop.md for run_sprint, strategic_decision, git_commit.

def main():
    cfg = HarnessConfig.load(Path(__file__).parent / "config.yaml")
    log.setup(PROJECT_DIR, label="run")

def run_sprint(spec, contract, project_dir, handoff=None, cfg=None):
    while iteration \x3C cfg.loop.max_iterations:
        # 1. Generate — try/except; crash is a valid (poor) outcome
        # 2. Self-assess — extra pass if not confident
        # 3. git_commit("wip: sprint N iter I")
        # 4. Evaluate → EvalResult
        # 5a. Pass + iteration \x3C min_iterations → quality-improvement continue
        #     Pass + min_iterations met → git_commit("feat") + return
        # 5b. Fail → strategic_decision() → REFINE or PIVOT → set strategic_framing
    # Exhausted: input() if isatty() else return last result

Git checkpoints (see iteration-loop.md for git_commit() helper):

Event Message
SPEC written feat: generate SPEC.md
Contract negotiated chore: sprint N contract
Each iteration wip: sprint N iteration I
Sprint passes feat: sprint N complete

Setup: pip install claude-agent-sdk && export ANTHROPIC_API_KEY=sk-... Verify: python -c "from agents.planner import run_planner; print('OK')"


Common Mistakes

Mistake Fix
Trusting LLM's verdict field Recompute in _build_eval_result() from contract_results + rubric_scores
Hardcoding model names Use cfg.agents.generator_model — never a string literal
Not calling handoff.save() before Evaluator On crash, Evaluator result is lost
Using input() in CI Guard with sys.stdin.isatty() first
Accumulating messages across sprints Each sprint is a fresh query() call — no cross-sprint history
Marking completed_features from Generator claim Only promote after Evaluator PASS verdict

When to Simplify

Component Remove / simplify when
Planner agent User provides SPEC directly
Contract negotiation Human has strong opinions; use config-file mode
Generator self-assessment Evaluator consistently passes first attempt
max_iterations → 3 Correctness-only task, no quality/aesthetic goal
min_iterations → 1 Early passes are always good enough
Refine/pivot strategic_decision Single sprint or correctness task
HandoffState Sprint fits in one context window
Evaluator Task within Generator's reliable baseline
Usage Guidance
Use this skill only if you intend to create an autonomous code-writing harness. Before running the generated harness on real work, pin the SDK dependency, run it in a sandbox or disposable repo, change bypassed permissions to approval-based permissions where possible, restrict Bash, and review handoff files and logs for secrets or misleading instructions.
Capability Analysis
Type: OpenClaw Skill Name: long-run-harness Version: 1.0.0 The skill bundle implements a multi-agent orchestration harness that utilizes high-risk capabilities, specifically arbitrary shell execution via the 'Bash' tool and explicit permission bypassing ('bypassPermissions') in SKILL.md and agent-patterns.md. While these features are plausibly required for the stated purpose of autonomous application development, they create a significant attack surface for remote code execution. No evidence of intentional malice, data exfiltration, or unauthorized persistence was found.
Capability Tags
cryptorequires-sensitive-credentials
Capability Assessment
Purpose & Capability
The stated purpose—generating a Planner→Generator→Evaluator app-building harness—matches the multi-agent SDK code, loops, evaluator, handoff state, and logging instructions. No hidden exfiltration endpoint or destructive intent is evident in the provided artifacts.
Instruction Scope
The generated Generator agent is instructed to use Write, Read, Edit, Bash, and Glob with permission_mode="bypassPermissions". That is purpose-aligned for an app-building harness, but it removes normal per-action review for high-impact local actions.
Install Mechanism
The skill is instruction-only and tells the user to install claude-agent-sdk with an unpinned pip command. This is central to the purpose, but users should pin and verify the dependency.
Credentials
The generated harness can repeatedly modify project files, run shell commands, and make git commits inside a user-supplied project directory. This is powerful enough that it should be run only in a sandbox or disposable repository unless carefully reviewed.
Persistence & Privilege
The skill intentionally creates logs, handoff JSON files, session resumes, and bounded long-running loops. These are disclosed and purpose-aligned, but they persist model-generated context and tool inputs that users should inspect and sanitize.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install long-run-harness
  3. After installation, invoke the skill by name or use /long-run-harness
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of the skill for building multi-agent Claude SDK harnesses. - Guides users through designing and implementing a Planner→Generator→Evaluator orchestrator using `claude-agent-sdk`. - Enforces best practices: config-driven parameters, explicit agent looping, and robust evaluation logic. - Provides detailed, phase-based instructions for harness structure, agent implementation, and iteration management. - Highlights common mistakes and when to simplify the harness.
Metadata
Slug long-run-harness
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is long-run-harness?

Use when building a Planner→Generator→Evaluator multi-agent harness with the Claude SDK. Triggers: "build a harness", "multi-agent pipeline", "agent loop", "... It is an AI Agent Skill for Claude Code / OpenClaw, with 57 downloads so far.

How do I install long-run-harness?

Run "/install long-run-harness" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is long-run-harness free?

Yes, long-run-harness is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does long-run-harness support?

long-run-harness is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created long-run-harness?

It is built and maintained by 小白 (@is-xins-xiaobai); the current version is v1.0.0.

💬 Comments