← Back to Skills Marketplace

EvalScope

Name: EvalScope
Author: yunnglin

by Yunlin Mao · GitHub ↗ · v1.0.1 · MIT-0

cross-platform ✓ Security Clean

166

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install skill-evalscope

Description

Translates natural language requests into evalscope CLI commands. Core capabilities: (1) Model accuracy evaluation (eval) — runs 156+ benchmarks (Math, Codin...

Usage Guidance

This skill appears to do what it says (build evalscope CLI commands). Before installing or running commands: (1) Verify the evalscope package/source you will install (use a virtualenv or container) and prefer an official PyPI or GitHub release; (2) Be cautious when providing API keys or endpoint URLs—only supply credentials you trust and avoid posting keys to remote/untrusted services; (3) When running perf tests, ensure endpoints are intended targets (benchmarks can generate heavy traffic); (4) Use mock_llm or sandbox modes if you want to test without contacting external models; (5) Review outputs/ reports before sharing and do not expose sensitive logs. If you want a deeper review, provide the evalscope PyPI project URL or the package code so it can be inspected.

Capability Analysis

Type: OpenClaw Skill Name: skill-evalscope Version: 1.0.1 The skill bundle provides a legitimate integration for the EvalScope LLM evaluation framework, allowing an agent to perform model benchmarking, performance stress testing, and result visualization via the `evalscope` CLI. While the skill involves high-risk capabilities such as executing shell commands, installing Python packages, and handling API keys, these actions are transparently documented and strictly aligned with the tool's stated purpose. No evidence of data exfiltration, malicious obfuscation, or harmful prompt injection was found across the documentation or command references (SKILL.md, eval-reference.md, perf-reference.md).

Capability Assessment

✓ Purpose & Capability

The name and description match the instructions: the SKILL.md converts natural‑language requests into evalscope CLI commands for evaluation, perf, discovery, and visualization. There are no unrelated environment variables, binaries, or config paths declared.

ℹ Instruction Scope

Instructions stay within evaluation, performance, and visualization workflows. They direct the agent to run evalscope CLI commands, read/write output directories (./outputs), and optionally launch a local Gradio UI. The doc also contains examples showing use of API endpoints and API keys—so the agent may be instructed to send requests to network endpoints and to accept user-provided secrets for those endpoints.

ℹ Install Mechanism

The skill is instruction‑only (no install spec), but SKILL.md recommends installing evalscope via pip (pip install evalscope or extras). Installing an external PyPI package (and extras) can pull many dependencies; that is expected for a CLI tool but is a moderate operational risk if you don't trust the upstream package or want to avoid installing packages system‑wide.

✓ Credentials

The registry metadata requests no environment variables or credentials. The runtime instructions, however, include many optional flags that accept API URLs and API keys (e.g., --api-key, judge-model-args, wandb API keys). These are reasonable for a benchmarking tool but mean the agent or user may be prompted to provide secrets when evaluating remote/API‑served models.

✓ Persistence & Privilege

The skill is not always‑enabled and does not request persistent privileges. It does not instruct modifying other skills or global agent config. Running evalscope commands may create output directories and logs under ./outputs, which is normal.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install skill-evalscope
After installation, invoke the skill by name or use /skill-evalscope
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.1

No code changes; skill description clarified for accuracy and scope. - Expanded the skill description to concisely enumerate core EvalScope capabilities: model evaluation, performance benchmarking, benchmark discovery, and results visualization. - Clarified trigger scenarios for when this skill should be invoked. - No changes to CLI guidance, workflows, or example commands.

v1.0.0

Initial release of EvalScope skill: Natural language to evalscope CLI for LLM evaluation and benchmarking. - Translates requests into evalscope CLI commands for model eval, perf, and result visualization. - Discovers and filters benchmarks using tags and detailed queries. - Supports local checkpoints, API endpoints, and mock pipelines. - Guides through setup, model selection, benchmark selection, and parameterization. - Summarizes and points to results and reports after runs.

Metadata

Slug skill-evalscope

Version 1.0.1

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is EvalScope?

Translates natural language requests into evalscope CLI commands. Core capabilities: (1) Model accuracy evaluation (eval) — runs 156+ benchmarks (Math, Codin... It is an AI Agent Skill for Claude Code / OpenClaw, with 166 downloads so far.

How do I install EvalScope?

Run "/install skill-evalscope" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is EvalScope free?

Yes, EvalScope is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does EvalScope support?

EvalScope is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created EvalScope?

It is built and maintained by Yunlin Mao (@yunnglin); the current version is v1.0.1.

More Skills