modelshow

Name: modelshow
Author: schbz

Description

Blind multi-model comparison with architecturally guaranteed de-anonymization. Trigger with "mdls" or "modelshow" for double-blind evaluation of AI model res...

Usage Guidance

The skill appears to implement the described multi-model blind-judging workflow, but take these precautions before installing or running it: 1) Understand the anonymity tradeoff: despite wording claiming the orchestrator never sees placeholders, the anonymize step returns an anonymization_map (placeholder→model) to the orchestrator — keep that map secret or modify the workflow if you require stronger guarantees. 2) Review and edit config.json: remove any model aliases you don't want queried and set outputDir to a safe location. 3) Be aware the skill will read your OpenClaw config (~/.openclaw/openclaw.json) to resolve model aliases and will write reports to your home directory; if that file contains sensitive info, inspect save_results.py and blind_judge scripts first. 4) External-content behavior: the SKILL.md says the agent will fetch referenced URLs or files and prepend them to prompts — if you want to avoid exposing local or remote data, disable this behavior or ensure you only run the skill on non-sensitive prompts. 5) If you require the stronger property that nobody except the judge can deanonymize, either remove the return of anonymization_map from the anonymize phase or ensure anonymization and finalize are executed inside a trusted atomic environment that never exposes the map to the orchestrator. 6) If you are unsure, run the skill on harmless test prompts first and manually inspect the outputs and saved files. If you want, I can point out the precise lines to change to avoid returning the anonymization map to the orchestrator or to prevent reading ~/.openclaw/openclaw.json.

Capability Analysis

Type: OpenClaw Skill Name: modelshow Version: 1.0.1 ModelShow is a legitimate tool for double-blind AI model evaluation and benchmarking. The skill bundle contains Python scripts (`judge_pipeline.py`, `save_results.py`, and `blind_judge_manager.py`) that handle the anonymization of model responses and the local persistence of results in Markdown and JSON formats. The code uses standard libraries, implements cryptographically secure shuffling for unbiased judging, and restricts file operations to the user's local OpenClaw workspace. No evidence of data exfiltration, malicious command execution, or prompt-injection attacks was found.

Capability Assessment

ℹ Purpose & Capability

The name/description (double-blind multi-model evaluation) matches the code and instructions: scripts anonymize responses, call a judge, de-anonymize outputs, and save results. The details (config.json listing model aliases, judge model, timeouts, outputDir) are consistent with that purpose. The skill reads OpenClaw agent config (~/.openclaw/openclaw.json) to resolve model aliases and writes results to a user-writable output directory — this is expected for producing human-friendly output, but it does mean the skill touches user config and home-directory storage which is beyond pure in-memory evaluation.

⚠ Instruction Scope

SKILL.md instructs the orchestrator to: fetch external content referenced by prompts (URLs, files, preferences) and prepend it to model tasks, read and write config.json under the skill baseDir, spawn judge sub-agents and instructs the judge to run local commands (piping JSON into judge_pipeline.py). Those instructions permit reading user files and fetching external URLs — operations that go beyond simply sending prompts to models and could surface private data. Also the documentation and scripts claim an 'architectural guarantee' that the orchestrator never sees placeholder labels, but the anonymize operation (judge_pipeline.py/blind_judge_manager.py) returns an explicit anonymization_map/reverse_map to the caller, which would allow the orchestrator to deanonymize. That is a functional contradiction between the stated guarantee and the actual code/instructions.

✓ Install Mechanism

No remote install steps or downloads are present in the skill bundle (instruction-only install spec, but with local Python scripts included). There are no brew/npm downloads or network-install commands. The code shipped in the skill will be installed locally when the skill is added; no remote code is fetched at install time.

ℹ Credentials

The skill declares no required environment variables or external credentials, which is appropriate. However, runtime behavior reads ~/.openclaw/openclaw.json (to resolve model aliases) and writes results to a home directory outputDir by default. Those file accesses are proportionate to the described features (alias resolution, saving reports), but they are also access to user configuration and filesystem that users should be aware of. Importantly, the pipeline returns anonymization_map data to the orchestrator during the anonymize phase, undermining the claimed 'orchestrator never sees placeholders' guarantee — that exposes mappings that can deanonymize results if retained.

✓ Persistence & Privilege

The skill does not request always:true or other elevated installation privileges. It writes results to an outputDir under the user's home by design, and offers an optional utility to copy JSON/MD to a web directory. Those behaviors involve filesystem persistence but are within the scope of expected functionality (saving reports). There is no code that modifies other skills or system-wide agent settings.

Version History

v1.0.1

**ModelShow 1.0.1 — Major upgrade with save support and robust, cryptographically randomized judging.** - Introduces mandatory, verifiable result saving via the new save_results.py script after every run. - Guarantees cryptographically random blind order by shuffling responses using secrets.SystemRandom(). - Architecturally enforced de-anonymization: Model names are revealed only after scoring; orchestrators never see placeholders. - Professional, judge-centric output now includes holistic "Overall Assessment" of cross-model patterns. - Adds update_modelshow_index.py and README.md for improved documentation and index management. - Enhanced polling, robust agent/timeout tracking, and improved status feedback for a more reliable parallel comparison workflow

v1.0.0

- Initial release of ModelShow skill for double-blind multi-model comparison with automatic de-anonymization. - Supports parallel querying of multiple AI models and automatic polling for completion. - Independent judge evaluates anonymized responses and outputs de-anonymized rankings with reasoning. - Step-by-step workflow: initiate, spawn model agents, poll for results, anonymize, judge, de-anonymize, and present formatted output. - Trigger with "mdls" or "modelshow" at the start of a message.

v0.1.0

- Initial release of ModelShow: an unbiased, blind multi-model comparison tool for AI-generated content. - Trigger with "mdls" or "modelshow" followed by a prompt to compare responses from multiple AI models. - Automatic response anonymization, polling, and progress updates during model and judge evaluation. - Results are de-anonymized and displayed with model names, scores, and judge justifications. - Handles model timeouts and minimum response thresholds, notifying users if too few models answer. - Open-source and configurable, see more at https://github.com/schbz/modelshow.

Metadata

Slug modelshow

Version 1.0.1

License MIT-0

All-time Installs 2

Active Installs 2

Total Versions 3

Frequently Asked Questions

What is modelshow?

Blind multi-model comparison with architecturally guaranteed de-anonymization. Trigger with "mdls" or "modelshow" for double-blind evaluation of AI model res... It is an AI Agent Skill for Claude Code / OpenClaw, with 483 downloads so far.

How do I install modelshow?

Run "/install modelshow" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is modelshow free?

Yes, modelshow is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does modelshow support?

modelshow is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created modelshow?

It is built and maintained by Sky Sloane (@schbz); the current version is v1.0.1.

More Skills

What is modelshow?

How do I install modelshow?

Is modelshow free?

Which platforms does modelshow support?

Who created modelshow?

💬 Comments