Description

Evaluate everything the PA agent manages — tasks, skills, PA network health, billing, calendar connections, and memory quality. Use when: owner asks for an e...

README (SKILL.md)

Load Local Context

CONTEXT_FILE="/opt/ocana/openclaw/workspace/skills/eval/.context"
[ -f "$CONTEXT_FILE" ] && source "$CONTEXT_FILE"
# Then use: $OWNER_PHONE, $WORKSPACE, $TASKS_FILE, $MONDAY_TOKEN_FILE, $GOG_CREDS, etc.

Eval Skill

Name: Eval
Author: netanel-abergel

Structured evaluation of everything the agent manages.

When to Use

Trigger phrases:

"run eval"
"what's working and what isn't"
"rate yourself"
"check everything"

Pre-Eval Behavioral Checks (Always)

React 👍 when owner triggers eval
React ✅ when report is complete
PA directory source: /opt/ocana/openclaw/workspace/PA_LIST.md
Calendar check: use direct API (NOT gog CLI)

Eval Report Format

📋 Full Eval — [DATE]

━━━ SELF PERFORMANCE ━━━
Execution:      [1-5] [comment]
Accuracy:       [1-5] [comment]
Memory:         [1-5] [comment]
Proactivity:    [1-5] [comment]
Communication:  [1-5] [comment]
TOTAL: [X]/25

━━━ ACTIVE TASKS ━━━
✅ Done today:   [count]
🟡 In progress:  [count]
❌ Stalled:      [count] — [list stalled tasks]

━━━ PA NETWORK ━━━
✅ Working:  [list]
⚠️ Issues:   [list with issue]
❌ Down:     [list]

━━━ SKILLS ━━━
Installed: [count]
Used today: [list]
Unused (7+ days): [list]

━━━ INTEGRATIONS ━━━
Calendar (owner):     [connected ✅ / broken ❌ / unknown ?]
monday.com:           [connected ✅ / broken ❌]
Email (gog):          [connected ✅ / broken ❌]
GitHub backup:        [last push: X ago]
WhatsApp:             [connected ✅ / disconnected ❌]

━━━ MEMORY HEALTH ━━━
Daily notes:     [today's file exists? ✅/❌]
Long-term:       [MEMORY.md size — OK / bloated]
Learnings:       [count this week]
Last backup:     [X ago]

━━━ RECOMMENDATIONS ━━━
1. [Most important thing to fix]
2. [Second priority]
3. [Optional improvement]

Running the Eval

Step 1 — Self Performance Score

Score each dimension 1–5 based on today's activity:

Execution (1–5):
- 5: All tasks completed without reminders
- 3: Most tasks done, some follow-up needed
- 1: Multiple tasks missed or forgotten

Accuracy (1–5):
- 5: No corrections from owner
- 3: 1–2 corrections
- 1: Multiple errors or wrong outputs

Memory (1–5):
- 5: Recalled context correctly every time
- 3: Missed some context, caught on
- 1: Repeated the same mistakes

Proactivity (1–5):
- 5: Acted before being asked multiple times
- 3: Responded to requests, minimal initiative
- 1: Only reacted, no proactive actions

Communication (1–5):
- 5: Clear, concise, no unnecessary narration
- 3: Occasionally verbose or unclear
- 1: Shared reasoning, listed options, narrated steps

Step 2 — Task Audit

TASKS_FILE="$HOME/.openclaw/workspace/memory/tasks.md"

echo "Tasks done:"
grep -c "\[x\]" "$TASKS_FILE" 2>/dev/null || echo 0

echo "Tasks in progress:"
grep -c "\[ \]" "$TASKS_FILE" 2>/dev/null || echo 0

# Stalled = in progress for 2+ days
echo "Stalled tasks (2+ days old):"
grep "\[ \]" "$TASKS_FILE" | grep -v "$(date +%Y-%m-%d)" | grep -v "$(date -u -d '1 day ago' +%Y-%m-%d 2>/dev/null)" || echo "none"

Step 3 — PA Network Health

BILLING_FILE="$HOME/.openclaw/workspace/memory/billing-status.json"

echo "PA Network Status:"
python3 \x3C\x3C 'PYEOF'
import json
data = json.load(open('/opt/ocana/openclaw/workspace/memory/billing-status.json'))
for pa in data['issues']:
    status = "✅" if pa['status'] == 'resolved' else "⚠️"
    print(f"  {status} {pa['pa']} ({pa['owner']}): {pa['status']}")
PYEOF

Step 4 — Skills Audit

SKILLS_DIR="$HOME/.openclaw/workspace/skills"

echo "Installed skills:"
ls "$SKILLS_DIR" | grep -v README | wc -l

echo "Skills list:"
ls "$SKILLS_DIR" | grep -v README

Step 5 — Integration Health

# Test Anthropic billing
API_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  -H "x-api-key: ${ANTHROPIC_API_KEY:-none}" \
  -H "anthropic-version: 2023-06-01" \
  https://api.anthropic.com/v1/models 2>/dev/null)

# Interpret result
if [ "$API_STATUS" = "200" ]; then echo "Billing: ✅ OK"
elif [ "$API_STATUS" = "402" ]; then echo "Billing: ❌ OUT OF CREDITS"
elif [ "$API_STATUS" = "401" ]; then echo "Billing: ❌ Invalid key"
else echo "Billing: ? HTTP $API_STATUS"
fi

# Test GitHub backup
LAST_PUSH=$(git -C "$HOME/.openclaw/workspace" log -1 --format="%ar" 2>/dev/null)
echo "Last backup: $LAST_PUSH"

# Test monday.com
if [ -f "$HOME/.credentials/monday-api-token.txt" ]; then
  MONDAY_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
    -X POST https://api.monday.com/v2 \
    -H "Authorization: $(cat $HOME/.credentials/monday-api-token.txt)" \
    -H "Content-Type: application/json" \
    -d '{"query": "{ me { id } }"}' 2>/dev/null)
  [ "$MONDAY_STATUS" = "200" ] && echo "monday.com: ✅" || echo "monday.com: ❌ ($MONDAY_STATUS)"
else
  echo "monday.com: ? (no token found)"
fi

Step 6 — Memory Health

TODAY=$(date -u +%Y-%m-%d)
WORKSPACE="$HOME/.openclaw/workspace"

# Check daily notes exist
[ -f "$WORKSPACE/memory/$TODAY.md" ] \
  && echo "Daily notes: ✅" \
  || echo "Daily notes: ❌ not created yet"

# Check MEMORY.md size (warn if >200 lines)
MEMORY_LINES=$(wc -l \x3C "$WORKSPACE/MEMORY.md" 2>/dev/null || echo 0)
if [ "$MEMORY_LINES" -gt 200 ]; then
  echo "MEMORY.md: ⚠️ Large ($MEMORY_LINES lines) — consider pruning"
else
  echo "MEMORY.md: ✅ ($MEMORY_LINES lines)"
fi

# Count learnings this week
LEARNINGS=$(grep -c "^##" "$WORKSPACE/.learnings/LEARNINGS.md" 2>/dev/null || echo 0)
echo "Total learnings logged: $LEARNINGS"

Recommendations Logic

After running all steps, generate recommendations:

If any PA has billing_error AND status != resolved:
  → "Fix billing for [PA list] — they can't function"

If any task has status in_progress for 2+ days:
  → "Follow up on stalled task: [task name]"

If MEMORY.md > 200 lines:
  → "Prune MEMORY.md — it's getting bloated"

If daily notes don't exist:
  → "Create today's memory file"

If last backup > 6 hours ago:
  → "Run git backup"

If API billing = 402:
  → "My own API key is out of credits — alert the admin immediately"

Scheduling

Run eval:

On demand — when owner asks
Weekly — every Sunday at 09:00
After major incidents — billing crisis, WA disconnect, etc.

Cost Tips

Cheap: Reading files, scoring, formatting — any small model
Expensive: Summarizing large memory files — skip if not asked
Avoid: Running all API health checks every hour — cache for 30 min
Batch: Run all health checks in one pass, not one at a time

Minimum Model

Any model that can:

Read files
Apply if/then scoring rules
Format a structured report

No advanced reasoning needed.

PA Performance Scoring (Merged from pa-eval skill)

Use this section when evaluating individual PA agents (weekly self-eval or on-demand when owner gives feedback).

Scoring Dimensions (1–5 each, max 40 points)

Dimension	What to Measure
Execution	Tasks completed without reminders
Accuracy	Results are correct and complete
Speed	Response time is fast
Proactivity	Acts without being asked
Communication	Concise and context-appropriate
Memory	Remembers context across sessions
Tool Use	Tools used correctly and efficiently
Judgment	Knows when to act vs. when to ask

Grade: A (36–40), B (28–35), C (20–27), D (\x3C20)

Owner Feedback Signals

Log these automatically when detected:

Signal	Action
👍 reaction / "thanks" / "great"	Log +1 positive
👎 reaction / "wrong" / "not good"	Log -1, record the correction
Owner re-asks the same question	Log -1 memory gap
Owner does the task themselves	Log -1 initiative gap
Owner surprised by proactive action	Log +2 proactivity

Rule: Log feedback signals immediately — don't batch them.

Weekly Eval File

Save to .learnings/eval/YYYY-MM-DD.md with: scores table, owner feedback, tasks completed/failed, what went well, what to improve, actions for next week.

Benchmark Tests (Run Monthly)

Task Completion Rate: completed / assigned × 100% — Target: >90%
Accuracy Rate: (tasks - corrections) / tasks × 100% — Target: >95%
Memory Retention: Ask about something discussed 7+ days ago — Target: >80% recall

Usage Guidance

This skill will read many local files and tokens (billing-status.json, workspace files, .context, $HOME/.credentials/monday-api-token.txt, and environment variables like the Anthropic API key) but the package metadata doesn't declare those requirements — that's a red flag. Before installing or enabling this skill: 1) Ask the publisher to explicitly list required env vars and config paths and justify why each is needed. 2) Inspect the .context file and any referenced credential files to see what secrets would be read; remove or rotate secrets you don't want the skill to access. 3) Run the skill in a sandboxed/test account or container with read-only copies of workspace files and fake tokens first. 4) If you must run it in production, restrict the skill's access to the minimal files and tokens required and consider rotating any tokens used for testing. If the publisher provides a clear, documented list of required inputs and a justification, this assessment could be re-evaluated; as-is the omission of declared credentials/config paths makes the skill suspicious.

Capability Assessment

⚠ Purpose & Capability

The skill's stated purpose is to evaluate the agent's tasks, integrations, billing, calendar, and memory — that purpose legitimately requires checking local state and integration tokens. However, the skill metadata declares no required environment variables, credentials, or config paths while the instructions reference many local files and tokens (e.g., /opt/ocana/... files, $HOME/.credentials/monday-api-token.txt, ANT HROPIC_API_KEY). The omission of these required inputs in metadata is disproportionate and misleading.

⚠ Instruction Scope

SKILL.md explicitly instructs the agent to source a local .context file and to read numerous files and run shell/python/curl/git commands against paths like /opt/ocana/openclaw/workspace/* and $HOME/.openclaw/workspace/*, and to read token files and env vars to test APIs. These actions go beyond a simple checklist and access potentially sensitive credentials and owner data (owner phone, tokens, billing JSON). The instructions grant broad discretion to read system state and secrets without documenting limits.

✓ Install Mechanism

There is no install spec and no code files — the skill is instruction-only. That reduces the risk of arbitrary code being fetched or executed from untrusted sources.

⚠ Credentials

Although the skill metadata lists no required env vars or config paths, the runtime steps depend on env vars and files (e.g., ANT HROPIC_API_KEY, $HOME/.credentials/monday-api-token.txt, various workspace files and .context values like GOG_CREDS or MONDAY_TOKEN_FILE). Requesting access to multiple local credential sources without declaring them is disproportionate and increases the potential for secret exposure.

✓ Persistence & Privilege

The skill is not always-on and is user-invocable (defaults). It does not request permanent inclusion or declare modifications to other skills or system-wide settings. Autonomous invocation is allowed by platform default, which increases blast radius in general, but this skill does not request extra persistence privileges.

Version History

v1.1.1

**Minor update with behavioral checks and context loading improvements.** - Added explicit "Load Local Context" step for sourcing environment variables. - Introduced "Pre-Eval Behavioral Checks" section, including automatic reactions when eval is triggered and completed. - Clarified use of PA directory file and requirement to use direct calendar API. - Removed non-English trigger phrases for clarity and consistency. - No structural changes to main eval logic or scoring.

v1.1.0

Skill consolidation 2026-04-02: merged redundant skills, improved descriptions, added production lessons

v1.0.0

First release of the eval skill for PA agent performance and health auditing. - Provides structured self-assessment with quality scoring across key categories. - Audits tasks, skills, integrations, and PA network health in a single report. - Runs on demand, weekly, or after major incidents. - Includes actionable recommendations logic based on detected issues. - Report results are formatted for easy review and follow-up.

Metadata

Slug eval

Version 1.1.1

License MIT-0

All-time Installs 1

Active Installs 1

Total Versions 3

Frequently Asked Questions

What is Eval?

Evaluate everything the PA agent manages — tasks, skills, PA network health, billing, calendar connections, and memory quality. Use when: owner asks for an e... It is an AI Agent Skill for Claude Code / OpenClaw, with 141 downloads so far.

How do I install Eval?

Run "/install eval" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Eval free?

Yes, Eval is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Eval support?

Eval is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Eval?

It is built and maintained by Netanel Abergel (@netanel-abergel); the current version is v1.1.1.

More Skills

Eval