modelshow

Name: modelshow
Author: schbz

功能描述

Blind multi-model comparison with architecturally guaranteed de-anonymization. Trigger with "mdls" or "modelshow" for double-blind evaluation of AI model res...

安全使用建议

The skill appears to implement the described multi-model blind-judging workflow, but take these precautions before installing or running it: 1) Understand the anonymity tradeoff: despite wording claiming the orchestrator never sees placeholders, the anonymize step returns an anonymization_map (placeholder→model) to the orchestrator — keep that map secret or modify the workflow if you require stronger guarantees. 2) Review and edit config.json: remove any model aliases you don't want queried and set outputDir to a safe location. 3) Be aware the skill will read your OpenClaw config (~/.openclaw/openclaw.json) to resolve model aliases and will write reports to your home directory; if that file contains sensitive info, inspect save_results.py and blind_judge scripts first. 4) External-content behavior: the SKILL.md says the agent will fetch referenced URLs or files and prepend them to prompts — if you want to avoid exposing local or remote data, disable this behavior or ensure you only run the skill on non-sensitive prompts. 5) If you require the stronger property that nobody except the judge can deanonymize, either remove the return of anonymization_map from the anonymize phase or ensure anonymization and finalize are executed inside a trusted atomic environment that never exposes the map to the orchestrator. 6) If you are unsure, run the skill on harmless test prompts first and manually inspect the outputs and saved files. If you want, I can point out the precise lines to change to avoid returning the anonymization map to the orchestrator or to prevent reading ~/.openclaw/openclaw.json.

功能分析

Type: OpenClaw Skill Name: modelshow Version: 1.0.1 ModelShow is a legitimate tool for double-blind AI model evaluation and benchmarking. The skill bundle contains Python scripts (`judge_pipeline.py`, `save_results.py`, and `blind_judge_manager.py`) that handle the anonymization of model responses and the local persistence of results in Markdown and JSON formats. The code uses standard libraries, implements cryptographically secure shuffling for unbiased judging, and restricts file operations to the user's local OpenClaw workspace. No evidence of data exfiltration, malicious command execution, or prompt-injection attacks was found.

能力评估

ℹ Purpose & Capability

The name/description (double-blind multi-model evaluation) matches the code and instructions: scripts anonymize responses, call a judge, de-anonymize outputs, and save results. The details (config.json listing model aliases, judge model, timeouts, outputDir) are consistent with that purpose. The skill reads OpenClaw agent config (~/.openclaw/openclaw.json) to resolve model aliases and writes results to a user-writable output directory — this is expected for producing human-friendly output, but it does mean the skill touches user config and home-directory storage which is beyond pure in-memory evaluation.

⚠ Instruction Scope

SKILL.md instructs the orchestrator to: fetch external content referenced by prompts (URLs, files, preferences) and prepend it to model tasks, read and write config.json under the skill baseDir, spawn judge sub-agents and instructs the judge to run local commands (piping JSON into judge_pipeline.py). Those instructions permit reading user files and fetching external URLs — operations that go beyond simply sending prompts to models and could surface private data. Also the documentation and scripts claim an 'architectural guarantee' that the orchestrator never sees placeholder labels, but the anonymize operation (judge_pipeline.py/blind_judge_manager.py) returns an explicit anonymization_map/reverse_map to the caller, which would allow the orchestrator to deanonymize. That is a functional contradiction between the stated guarantee and the actual code/instructions.

✓ Install Mechanism

No remote install steps or downloads are present in the skill bundle (instruction-only install spec, but with local Python scripts included). There are no brew/npm downloads or network-install commands. The code shipped in the skill will be installed locally when the skill is added; no remote code is fetched at install time.

ℹ Credentials

The skill declares no required environment variables or external credentials, which is appropriate. However, runtime behavior reads ~/.openclaw/openclaw.json (to resolve model aliases) and writes results to a home directory outputDir by default. Those file accesses are proportionate to the described features (alias resolution, saving reports), but they are also access to user configuration and filesystem that users should be aware of. Importantly, the pipeline returns anonymization_map data to the orchestrator during the anonymize phase, undermining the claimed 'orchestrator never sees placeholders' guarantee — that exposes mappings that can deanonymize results if retained.

✓ Persistence & Privilege

The skill does not request always:true or other elevated installation privileges. It writes results to an outputDir under the user's home by design, and offers an optional utility to copy JSON/MD to a web directory. Those behaviors involve filesystem persistence but are within the scope of expected functionality (saving reports). There is no code that modifies other skills or system-wide agent settings.

版本历史

v1.0.1

**ModelShow 1.0.1 — Major upgrade with save support and robust, cryptographically randomized judging.** - Introduces mandatory, verifiable result saving via the new save_results.py script after every run. - Guarantees cryptographically random blind order by shuffling responses using secrets.SystemRandom(). - Architecturally enforced de-anonymization: Model names are revealed only after scoring; orchestrators never see placeholders. - Professional, judge-centric output now includes holistic "Overall Assessment" of cross-model patterns. - Adds update_modelshow_index.py and README.md for improved documentation and index management. - Enhanced polling, robust agent/timeout tracking, and improved status feedback for a more reliable parallel comparison workflow

v1.0.0

- Initial release of ModelShow skill for double-blind multi-model comparison with automatic de-anonymization. - Supports parallel querying of multiple AI models and automatic polling for completion. - Independent judge evaluates anonymized responses and outputs de-anonymized rankings with reasoning. - Step-by-step workflow: initiate, spawn model agents, poll for results, anonymize, judge, de-anonymize, and present formatted output. - Trigger with "mdls" or "modelshow" at the start of a message.

v0.1.0

- Initial release of ModelShow: an unbiased, blind multi-model comparison tool for AI-generated content. - Trigger with "mdls" or "modelshow" followed by a prompt to compare responses from multiple AI models. - Automatic response anonymization, polling, and progress updates during model and judge evaluation. - Results are de-anonymized and displayed with model names, scores, and judge justifications. - Handles model timeouts and minimum response thresholds, notifying users if too few models answer. - Open-source and configurable, see more at https://github.com/schbz/modelshow.

元数据

Slug modelshow

版本 1.0.1

许可证 MIT-0

累计安装 2

当前安装数 2

历史版本数 3

常见问题

modelshow 是什么？

Blind multi-model comparison with architecturally guaranteed de-anonymization. Trigger with "mdls" or "modelshow" for double-blind evaluation of AI model res... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 483 次。

如何安装 modelshow？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install modelshow」即可一键安装，无需额外配置。

modelshow 是免费的吗？

是的，modelshow 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

modelshow 支持哪些平台？

modelshow 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 modelshow？

由 Sky Sloane（@schbz）开发并维护，当前版本 v1.0.1。

modelshow 是什么？

如何安装 modelshow？

modelshow 是免费的吗？

modelshow 支持哪些平台？

谁开发了 modelshow？

💬 留言讨论