Description

Use when building, designing, or reviewing a multi-agent system for production — routing agents, orchestrating subagents, guarding tools with permissions, ma...

README (SKILL.md)

Production Agent Design

Name: Agent Guru
Author: weixuanjiang

Core Principle

The LLM is the reasoning engine. Your code is the execution engine. The loop is the contract between them.

Every production concern — safety, cost, retries, logging, permissions — lives in the harness, not the prompt. A prompt that says "be careful with deletions" is a suggestion. A GuardedToolNode that intercepts delete_* calls is a guarantee.

When to Use This Skill

Designing a new multi-agent system from scratch
Adding safety, cost controls, or observability to an existing agent
Debugging runaway cost, infinite loops, or context window exhaustion
Choosing between single-agent vs multi-agent topology
Implementing human-in-the-loop (HITL) for irreversible actions
Setting up session persistence and resumption

Architecture at a Glance

INGRESS (HTTP / CLI / Webhook / Schedule)
    │
ROUTER LAYER          — classify intent, dispatch cheaply
    │
ORCHESTRATOR          — decompose tasks, delegate to specialists
    ├── Agent A (scoped tools)
    └── Agent B (scoped tools)
         │
TOOL LAYER            — validate schema → check permission → execute → truncate
         │
CROSS-CUTTING CONCERNS
    ├── MEMORY         (short-term / working / long-term)
    ├── OBSERVABILITY  (traces, cost, session replay)
    └── RESILIENCE     (retry, circuit breaker, loop guard)
         │
PERSISTENCE           — checkpoints (Redis / Postgres) + audit log

Single Agent vs Multi-Agent

Task scoped to ONE domain?
  YES → Single ReAct agent with appropriate tools
  NO  → Independent subtasks?
          YES → Parallel multi-agent (supervisor + specialists)
          NO  → Sequential / hierarchical orchestrator
                  │
              Any irreversible step requiring human review?
                YES → Plan-then-execute with HITL interrupt
                NO  → Orchestrator with auto-delegation

Rule: Start with a single agent. Add multi-agent complexity only when you hit a concrete limit — context window size, tool set sprawl, latency, or accuracy.

Framework Selection

Need	Use
Complex branching, HITL, durable persistence, fine-grained control	LangGraph
Simple loop, minimal boilerplate, rapid prototype, leaf agents	Strands
Orchestration graph + simple leaf agents	LangGraph + Strands hybrid

Reference Files

Load these on demand using the triggers listed below. Do not load all of them upfront.

File	Load when...
references/router-layer.md	Designing intent routing, building a classifier node, handling misrouting
references/orchestrator-layer.md	Decomposing tasks, spawning subagents, implementing plan-then-execute
references/tool-safety-layer.md	Designing tools, adding permission rules, implementing HITL or killswitch
references/memory-layer.md	Context window approaching limit, adding long-term memory, injecting project context
references/observability-layer.md	Adding tracing, tracking token cost, debugging agent behavior, setting up alerts
references/resilience-layer.md	Adding retry logic, circuit breakers, preventing infinite loops
references/persistence-layer.md	Choosing a checkpointer, implementing session resume, session branching
references/production-checklist.md	Before deploying to production — full ~40-point readiness checklist

Quick Reference

Pattern	Key implementation	Reference
Intent routing	`conditional_edges` + confidence threshold	`router-layer.md`
Scoped subagents	`create_react_agent` with tool subset	`orchestrator-layer.md`
Plan-then-execute	Two nodes, read-only tools in plan phase	`orchestrator-layer.md`
Tool schema	`args_schema=PydanticModel` on `@tool`	`tool-safety-layer.md`
Permission guard	`GuardedToolNode` with `PermissionRule` list	`tool-safety-layer.md`
HITL interrupt	`interrupt()` + `Command(resume=...)`	`tool-safety-layer.md`
Runtime concurrency	`is_concurrency_safe(input)` per tool call	`tool-safety-layer.md`
Abort hierarchy	Query-level abort + sibling-level child abort	`tool-safety-layer.md`
Tiered compaction	budget → snip → microcompact → autocompact	`memory-layer.md`
Auto-compaction	Summarization node at 80% context	`memory-layer.md`
Context injection	`AGENT.md` loaded into system prompt	`memory-layer.md`
Full trace	`BaseCallbackHandler` + structured events	`observability-layer.md`
Cost tracking	Per-turn token accounting in callback	`observability-layer.md`
Config snapshot	Freeze all feature flags at query entry	`observability-layer.md`
Diminishing returns	Track token deltas; stop if delta \x3C 500 × 2	`resilience-layer.md`
Output limit escalation	Escalate to 64k tokens before compaction	`resilience-layer.md`
Streaming cleanup	Tombstone partial messages on fallback	`resilience-layer.md`
Error-as-observation	`try/except` → `ToolMessage`	`resilience-layer.md`
Circuit breaker	State machine wrapping tool fn	`resilience-layer.md`
Session resume	Checkpointer + stable `thread_id`	`persistence-layer.md`

Gotchas

Safety rules must be code, not prompts. A prompt saying "don't delete production data" is not a safety control.
Never dump the full parent message history into a subagent. Pass only the specific task and relevant data — context pollution degrades performance and wastes tokens.
InMemorySaver is for development only. Use Redis or Postgres checkpointers in production.
interrupt() pauses the graph. Resume it by calling graph.invoke(Command(resume=...), config=config) — forgetting this leaves the agent stuck.
Tool result truncation is mandatory. Large tool outputs (file reads, search results) will exhaust the context window if not truncated before returning.
Always set max_iterations. Without a loop guard, a miscalibrated agent runs indefinitely and incurs unbounded cost.
Apply compaction in tiers. Budget tool results → snip → microcompact → autocompact. Jumping straight to full summarization wastes tokens when a cheaper step would suffice.
Track diminishing returns, not just token budget. An agent can burn through its iteration budget producing nearly empty continuations. Stop when the last 2 deltas are both below ~500 tokens.
Snapshot config at query entry. Never re-read feature flags or env vars mid-turn — a remote config change during a 30-second response causes inconsistent behavior within a single turn.
Concurrency safety must be checked at runtime. Schema metadata cannot determine if a bash command is safe — inspect the actual input string at call time. Fail conservatively (serial) if parsing fails.

Usage Guidance

This is a content-rich, instruction-only playbook for production multi-agent systems — it appears coherent with that purpose. Before you copy or run any examples: (1) review and remove hardcoded credentials and replace with secured secrets; (2) sandbox code that reads files (AGENT.md, ~/.agent, /etc) to avoid unintentionally exposing local secrets; (3) validate any remote endpoints before allowing the agent to call them (the remote killswitch example calls an internal config URL); (4) adopt the GuardedToolNode / HITL patterns for any destructive tooling; and (5) if you need higher assurance, ask the publisher for provenance (homepage, repo) or run the code in an isolated dev environment. If you want a deeper risk review, provide the publisher/source URL or say which code snippets you intend to reuse.

Capability Analysis

Type: OpenClaw Skill Name: agent-guru Version: 1.0.0 The skill bundle is a comprehensive architectural guide and reference library for building production-grade, safe, and observable multi-agent systems using frameworks like LangGraph. It contains high-quality code examples for critical safety patterns such as human-in-the-loop (HITL) interrupts, permission guards (GuardedToolNode), context window management (auto-compaction), and cost tracking. There is no evidence of malicious intent, data exfiltration, or obfuscation; rather, the content focuses on preventing common agent failure modes like infinite loops and unauthorized tool execution.

Capability Assessment

✓ Purpose & Capability

The name/description (production multi-agent design) aligns with the content: detailed architecture patterns, tooling, and code examples for routing, orchestration, safety, memory, observability and persistence. It does not request unrelated credentials, binaries, or installs.

ℹ Instruction Scope

SKILL.md and the reference files contain runnable examples that read local files (e.g., AGENT.md from working_dir, ~/.agent, /etc/agent/global), connect to DBs/Redis/Postgres (example connection strings), spin up an HTTP endpoint, and fetch remote policy via httpx.get. Those are appropriate for the stated purpose (production agent harnesses) but they do instruct accessing filesystem and network resources — review and sandbox any copied examples before running.

✓ Install Mechanism

Instruction-only skill with no install spec or shipped code — lowest install risk. Example pip install lines appear in docs (langgraph, langgraph-supervisor) but no code is downloaded by the skill itself.

ℹ Credentials

The skill does not declare required env vars or credentials, but examples reference environment-driven config (os.getenv), DB URLs, Redis/Postgres connection examples, and snapshotting of MAX_OUTPUT_TOKENS etc. These are reasonable for production guidance but you should not copy hardcoded credentials (e.g., 'postgresql://user:pass@db:5432/agents') into real deployments and should limit which env vars or secrets are used.

✓ Persistence & Privilege

always is false and there is no install-time persistence or privileged modification of other skills. The guidance describes persistent components (checkpointers, vector stores) that are normal in production — the skill itself does not request permanent platform privileges.

Version History

v1.0.0

Initial release: Comprehensive multi-agent production design patterns and reference for LangGraph-based frameworks. - Introduces best practices and architectural patterns for scalable, safe, and observable agent systems. - Includes decision trees for agent topology and framework selection (LangGraph, Strands). - Provides modular, on-demand reference files for each system layer (routing, orchestrator, tools, memory, observability, resilience, persistence, checklist). - Documents concrete implementation tips, safeguards, and gotchas for production reliability and cost control. - Emphasizes code-level enforcement of safety, memory management, error handling, and concurrency controls.

Metadata

Slug agent-guru

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Agent Guru?

Use when building, designing, or reviewing a multi-agent system for production — routing agents, orchestrating subagents, guarding tools with permissions, ma... It is an AI Agent Skill for Claude Code / OpenClaw, with 100 downloads so far.

How do I install Agent Guru?

Run "/install agent-guru" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Agent Guru free?

Yes, Agent Guru is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Agent Guru support?

Agent Guru is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Agent Guru?

It is built and maintained by Weixuan Jiang (@weixuanjiang); the current version is v1.0.0.

More Skills

Agent Guru