← Back to Skills Marketplace
yh22e

OpenClaw Smartness Eval

by 圆规 · GitHub ↗ · v0.3.2 · MIT-0
cross-platform ⚠ suspicious
319
Downloads
1
Stars
0
Active Installs
5
Versions
Install in OpenClaw
/install openclaw-smartness-eval
Description
OpenClaw 智能度综合评伌技能。围绕 14 个维度(含规划能力、幻觉控制)输出综合评分、证据、风险与趋势。对齐 CLEAR/T-Eval/Anthropic 行业标准。
Usage Guidance
This skill is broadly coherent with its stated purpose (a workspace-centered evaluation tool), but before installing: 1) Review eval.py (especially validate_command(), subprocess.run usage, and any code paths that could enable network or write outside its stated output dir). 2) Confirm you trust the other workspace scripts it will invoke (many test commands call scripts that are not bundled). Those external scripts could read secrets or make network calls. 3) Treat the .reasoning/reasoning-store.sqlite and state logs as sensitive — if you don't want those inspected, do not install or run the skill. 4) If you enable --llm-judge, confirm exactly what summary fields are sent and test in an isolated environment first. 5) As a safe practice, run python3 scripts/check.py and then run eval.py in a sandboxed workspace (or with --no-probes / dry-run options) to observe behavior before granting it unfettered access or allowing autonomous invocations.
Capability Analysis
Type: OpenClaw Skill Name: openclaw-smartness-eval Version: 0.3.2 The openclaw-smartness-eval bundle is a comprehensive framework designed to measure AI agent performance across 14 dimensions. While the core logic in `scripts/eval.py` utilizes `subprocess` to execute test commands, it implements a robust security gate (`validate_command`) that enforces a whitelist of allowed path prefixes, restricts the interpreter to `python3`, and explicitly blocks inline code execution (`-c`, `exec`), absolute paths, and path traversal. Data collection is limited to local workspace state files and logs, and the optional network access for LLM-based scoring is documented and requires user-provided API keys. The bundle demonstrates high transparency and security-conscious design, including anti-gaming probes and integrity checks.
Capability Assessment
Purpose & Capability
The skill claims to produce a 14‑dimension evaluation and to read runtime state/logs; the commands and listed state files align with that purpose. However many task commands reference other workspace scripts (message-analyzer-v5.py, security-config-audit.py, etc.) that are not bundled with the skill and thus require a full OpenClaw environment. This dependency-on-host-scripts is plausible but should be noted by installers.
Instruction Scope
SKILL.md and docs state the tool is read-only (reads many state/*.json and .reasoning/reasoning-store.sqlite) and only writes to state/smartness-eval/. The runtime also spawns subprocesses to run tests. While the manifest claims a validate_command() gate, executing other workspace scripts (via allowed prefixes like 'scripts/') can cause those scripts to read network, secrets, or modify state — the skill's safety depends on both its validate_command implementation and trustworthiness of other workspace scripts. Verify validate_command and inspect eval.py before granting execution privileges.
Install Mechanism
No external install spec or remote downloads; the package is instruction/code-only and uses only bundled Python scripts. This is low-risk from supply-chain/download perspective.
Credentials
The skill declares no required env vars; optional LLM judge requires DEEPSEEK_API_KEY or OPENAI_API_KEY only when explicitly enabled. However it reads potentially sensitive local artifacts (.reasoning/reasoning-store.sqlite, message-analyzer logs, etc.). Those reads are coherent for an evaluator but are sensitive — ensure you are comfortable exposing the reasoning DB and logs to the skill runtime.
Persistence & Privilege
always:false and docs state it writes only to its own state/smartness-eval/ directory. Autonomous invocation is enabled by default (platform normal). Combined with the skill's read access to internal logs and ability to run workspace scripts, autonomous invocation increases blast radius — consider whether you want the agent to be able to run this skill without manual approval each run.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install openclaw-smartness-eval
  3. After installation, invoke the skill by name or use /openclaw-smartness-eval
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v0.3.2
v0.3.2: 添加完整安全声明(Security Declaration),明确声明只读文件列表、写入范围、命令白名单机制、网络访问策略、无副作用保证,降低ClawHub可疑标记
v0.3.1
fix: 修复所有 Markdown 文件的 CDATA 标签导致渲染异常
v0.3.0
v0.3.0: 新增规划能力和幻觉控制维度(12→14维度),修复全部评分公式归一化,扩展反作弊探针(7→15),新增6项测试(28→34),对齐CLEAR/T-Eval/Anthropic行业标准
v0.2.1
Version 0.2.1 - Added scripts/state_probe.py for new state probing and reliability checks. - Made minor updates to existing scripts and configuration files for improved robustness. - Documented that LLM Judge option now only triggers external API calls when explicitly enabled with --llm-judge. - No breaking changes; all previous usage and report formats remain supported.
v0.2.0
v0.2.0: 12维度独立评分公式, 28项测试, 多数据源融合, LLM Judge, pass@k, 反作弊探针
Metadata
Slug openclaw-smartness-eval
Version 0.3.2
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 5
Frequently Asked Questions

What is OpenClaw Smartness Eval?

OpenClaw 智能度综合评伌技能。围绕 14 个维度(含规划能力、幻觉控制)输出综合评分、证据、风险与趋势。对齐 CLEAR/T-Eval/Anthropic 行业标准。 It is an AI Agent Skill for Claude Code / OpenClaw, with 319 downloads so far.

How do I install OpenClaw Smartness Eval?

Run "/install openclaw-smartness-eval" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is OpenClaw Smartness Eval free?

Yes, OpenClaw Smartness Eval is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does OpenClaw Smartness Eval support?

OpenClaw Smartness Eval is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created OpenClaw Smartness Eval?

It is built and maintained by 圆规 (@yh22e); the current version is v0.3.2.

💬 Comments