← Back to Skills Marketplace
chekhovin

llm-benchmark-analyst

by Chekhovin · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
308
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install llm-benchmark-analyst
Description
search and analyze llm benchmark results within a fixed benchmark universe, then produce evidence-based model strength and weakness reports or domain-leader...
README (SKILL.md)

LLM Benchmark Analyst

Overview

Use this skill to research benchmark evidence and write structured reports about:

  1. a single model's strengths and weaknesses
  2. best models in a capability domain
  3. what a benchmark measures and how trustworthy it is
  4. predecessor vs current-model progress

Default to the user's language. Never invent scores, ranks, dates, benchmark variants, or missing table values.

Core constraints

  • Restrict the benchmark universe to references/benchmark-source.md. If a benchmark is not in that file, exclude it.
  • Use references/core-dimensions.md to collapse scattered benchmarks into a small set of report dimensions.
  • Follow references/search-playbook.md for routing, overlap expansion, evidence gathering, and comparison anchors.
  • Follow references/report-template.md for output structure.
  • Apply references/data-defect-warnings.md benchmark by benchmark, inline and again in the limitations section.
  • Prefer official benchmark or benchmark-author pages. Use aggregators mainly to discover links and context.
  • Record the evaluation mode exactly: benchmark version, split, difficulty, public/private, verified/original, with-tools/without-tools, pass@k, and any visible sub-score names.
  • Keep score units exact. Do not average incompatible metrics into a fake composite.

Required workflow

  1. Normalize the model identity before searching

    • Resolve exact provider, family, generation, version suffix, and release label.
    • Put time and version first. Reject ambiguous aliases like claude, gemini pro, gpt latest, or qwen max until you have the exact currently relevant model string for the searched leaderboard rows.
    • Capture the evaluation time point or access date for every key score.
  2. Route the request through core dimensions before web crawling

    • Start with references/core-dimensions.md to select the primary dimension(s).
    • Then list candidate benchmarks inside those dimensions.
    • Only then start website-by-website retrieval.
    • Keep the first pass narrow and token-efficient: start from the best 3-6 benchmarks for the asked domain, then expand only if needed.
  3. Expand beyond section labels

    • Do not let the source document's headings blind you.
    • After selecting the primary dimension, inspect benchmark descriptions and overlap tags to find relevant benchmarks that live in other sections.
    • Example: a coding analysis may need coding benchmarks, agentic coding benchmarks, general benchmarks with coding components, and research/math benchmarks with strong code components.
    • Example: a multimodal analysis may need vision benchmarks, OCR, GUI/computer-use, multimodal deep-research, and omni/video/audio benchmarks.
  4. Collect evidence in this order

    • official leaderboard or benchmark site
    • benchmark paper or benchmark README
    • benchmark-author blog or release note
    • trusted aggregator
    • vendor blog only as secondary evidence, clearly labeled as vendor-reported if no independent leaderboard row exists
  5. Use multimodal extraction when the leaderboard is not machine-readable

    • If the page uses images, canvas, screenshots, or chart-only rendering and plain text extraction misses the table, inspect screenshots or page images.
    • Extract only values that are clearly visible.
    • Mark the provenance as image-extracted.
    • If the image is unreadable or partially occluded, say so instead of guessing.
  6. Apply anchor comparisons

    • For code or agentic coding, compare against the latest available Claude Opus, latest Claude Sonnet, and latest GPT family model.
    • For multimodal analysis, compare against the latest available Gemini model. Add the latest GPT multimodal model if relevant.
    • For intelligence or reasoning analysis, compare against the latest available GPT family model.
    • Never assume which model is currently latest. Search that first.
  7. Apply predecessor comparison

    • If data exists, compare the target model with its immediate predecessor or last broadly comparable prior generation from the same provider/family.
    • Only compare like-for-like benchmark variants. If the predecessor only appears under a different benchmark mode, say the comparison is not clean.
  8. Attach defect warnings

    • Any benchmark with a known quality or methodology issue must carry an inline warning from references/data-defect-warnings.md.
    • If the report's conclusion depends heavily on warned benchmarks, lower confidence and say so explicitly.

Decision rules

  • When the user asks for best models in a domain, do not use only one benchmark. Use a cluster of relevant benchmarks and explain why each one matters.
  • When the user asks for what is this model good or bad at, synthesize at the core-dimension level first, then support with benchmark evidence.
  • When benchmark scores conflict, prefer freshness, exact version match, official source quality, and the number of agreeing benchmarks over one standout score.
  • Treat very small gaps as non-decisive when the benchmark is noisy, image-extracted, or known to be unstable.
  • Always include one short clause describing what each benchmark actually tests.

Minimum evidence to capture

For every benchmark you cite, capture:

  • benchmark name
  • what it tests in one short phrase
  • exact model row name
  • exact score and unit
  • rank or relative placement if visible
  • benchmark variant, split, or mode
  • date or access time point
  • source quality note if not official
  • data warning if applicable

Output expectations

Use the matching template in references/report-template.md.

At minimum, every substantive report must include:

  • a scope and identity section
  • a short executive summary
  • strengths
  • weaknesses or gaps
  • evidence table
  • comparison section
  • data-defect warnings and confidence
  • methodology or exclusions

Resource map

  • references/core-dimensions.md: benchmark routing and de-fragmentation map
  • references/search-playbook.md: token-efficient search order, overlap expansion, and comparison rules
  • references/data-defect-warnings.md: warning catalog and ready-to-use caution language
  • references/report-template.md: output structures for single-model, domain-leader, and benchmark-explainer tasks
  • references/benchmark-source.md: full allowed benchmark universe copied from the user's benchmark document

Example tasks

  • analyze gpt-5's coding and agentic coding strengths and weaknesses, and compare it with the latest claude opus, claude sonnet, and gpt model
  • find the best multimodal models right now using only the approved benchmark list and explain each benchmark briefly
  • write a report on qwen's reasoning strengths, benchmark gaps, predecessor comparison, and all data-quality caveats
  • tell me which models lead in deep research and search, with benchmark-specific warnings and freshness notes
Usage Guidance
This skill appears to do what it claims (search benchmark leaderboards and produce evidence-based reports) and is low-risk in terms of missing installs or credential requests. However: 1) the SKILL.md was flagged for unicode control characters — open the SKILL.md in a raw/text editor that shows hidden/control chars and verify there are no hidden or altered instructions; 2) confirm the skill's source/owner before installing (no homepage provided); 3) because it relies on web browsing and image extraction, avoid running it with access to sensitive accounts or data until you trust it; 4) if you allow autonomous invocation, prefer testing it first interactively with harmless queries and monitor the agent's external requests; 5) check referenced URLs in the references/ files manually — many external links are present and the skill will instruct browsing to those sites, so ensure that aligns with your policies. If you want, I can highlight any non-ASCII/control characters in the SKILL.md or produce a cleaned, visible-only version for review.
Capability Analysis
Type: OpenClaw Skill Name: llm-benchmark-analyst Version: 1.0.0 The skill bundle is a highly structured and legitimate tool designed to enable an AI agent to perform LLM benchmark analysis. It includes a comprehensive reference list of global benchmarks (benchmark-source.md), detailed routing logic (core-dimensions.md), and a search playbook that emphasizes identity normalization and evidence-based reporting. The instructions in SKILL.md and search-playbook.md are strictly aligned with the stated purpose of model comparison and include sophisticated guidance on handling data defects and multimodal extraction for non-text leaderboards. No evidence of malicious intent, data exfiltration, or harmful prompt injection was found; in fact, the bundle includes references to security-focused benchmarks like SKILL-INJECT to evaluate agent robustness.
Capability Assessment
Purpose & Capability
Name, description, and bundled reference files align: the skill's goal is structured benchmark search and reporting; it restricts scope to the provided references and doesn't request unrelated credentials or system access. The instruction-only design (no binaries, no env vars, no installs) is proportionate to the stated purpose.
Instruction Scope
SKILL.md instructs the agent to browse web pages and perform multimodal extraction (text/image/canvas). That is functionally coherent, but the static scan flagged 'unicode-control-chars' in the SKILL.md (a prompt-injection pattern). Unicode control characters can hide or obfuscate instructions and may be used to manipulate or subvert the evaluation or runtime behavior. Inspect the raw SKILL.md for hidden control characters and verify that no hidden directives or altered text exist.
Install Mechanism
No install spec and no code files beyond reference docs — lowest install risk. Nothing is downloaded or written to disk by the package itself.
Credentials
The skill declares no required environment variables, credentials, or config paths. Its needs (web browsing, multimodal extraction) are reasonable for the described functionality and do not demand secrets or broad system access.
Persistence & Privilege
always:false and default autonomous invocation are set (normal). Because the skill can be invoked autonomously and instructs web crawling and image extraction, it can perform network retrievals during runs — this is expected for a research/reporting skill but increases the operational blast radius if the skill contained hidden or malicious instructions. Combine this with the prompt-injection signal when deciding whether to enable autonomous runs.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install llm-benchmark-analyst
  3. After installation, invoke the skill by name or use /llm-benchmark-analyst
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of LLM Benchmark Analyst skill. - Enables searching and analysis of LLM benchmark results within a fixed, pre-approved benchmark universe. - Generates evidence-based reports on model strengths/weaknesses, domain leaders, benchmark explanations, and predecessor comparisons. - Enforces strict normalization of model/version and prioritizes official benchmark sources, with detailed provenance for every score. - Integrates workflow for benchmark selection, multimodal leaderboard extraction, overlap expansion, and defect warning application. - Outputs follow structured templates, always including scope, summary, evidence tables, comparisons, data defects, and exclusions. - Does not invent or guess missing data; maintains strict reporting fidelity and transparency.
Metadata
Slug llm-benchmark-analyst
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is llm-benchmark-analyst?

search and analyze llm benchmark results within a fixed benchmark universe, then produce evidence-based model strength and weakness reports or domain-leader... It is an AI Agent Skill for Claude Code / OpenClaw, with 308 downloads so far.

How do I install llm-benchmark-analyst?

Run "/install llm-benchmark-analyst" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is llm-benchmark-analyst free?

Yes, llm-benchmark-analyst is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does llm-benchmark-analyst support?

llm-benchmark-analyst is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created llm-benchmark-analyst?

It is built and maintained by Chekhovin (@chekhovin); the current version is v1.0.0.

💬 Comments