← 返回 Skills 市场
tankisstank

Benchmark Model Provider

作者 tankisstank · GitHub ↗ · v1.0.5 · MIT-0
cross-platform ✓ 安全检测通过
171
总下载
0
收藏
0
当前安装
6
版本数
在 OpenClaw 中安装
/install benchmark-model-provider
功能描述
Benchmark and rank AI providers/models against a user-specific prompt suite derived from the user's purpose, domain, and usage frequency. Use when users ask...
使用说明 (SKILL.md)

Benchmark Model Provider

Use this skill to help users choose the most suitable model for their own workflow instead of giving generic “best model” advice.

Tiếng Việt
Dùng skill này khi Boss muốn biết model nào thật sự đáng dùng cho workflow hằng ngày: model nào research tốt hơn, viết báo cáo ổn hơn, code ngon hơn, rẻ hơn, nhanh hơn, hay đáng dùng lâu dài hơn. Skill này không trả lời kiểu cảm tính, mà dựng benchmark theo đúng nhu cầu thực tế của người dùng rồi chấm, rerank và xuất report rõ ràng.

中文说明
当用户想知道“哪个模型更聪明、更便宜、更适合日常工作流、更适合研究/写报告/编程”时,使用这个技能。它不会给出泛泛而谈的“最佳模型”建议,而是根据用户自己的实际任务构建基准测试,保留原始结果、重新排序,并生成可审阅、可分享的报告。

Treat the benchmark as a personal decision framework:

  • derive the benchmark from the user's real work
  • keep the run auditable
  • preserve raw outputs for reranking
  • generate outputs that can be reviewed, shared, and published cleanly

What this skill is for

People often ask questions like:

  • Which model is smarter?
  • Which model is cheaper to run daily?
  • Which model is deeper or more useful for my job?
  • Should I use a local model or a service model?

This skill exists to answer those questions with a repeatable benchmark process, not with vague preferences.


Core operating flow

  1. Collect benchmark context
    • purpose
    • domain
    • usage frequency
  2. Build or select a benchmark spec with 5–10 domain-specific questions
  3. List currently available providers/models from trusted local OpenClaw context when allowed
  4. Ask whether the user wants to use the current list or add more models
  5. Verify every user-supplied model before running; if the name does not match, ask again or suggest the closest valid model id
  6. Run each model independently on the same benchmark set
  7. Preserve raw outputs and metrics so the run can be audited and reranked later
  8. Score results across quality, depth, cost, and speed metrics
  9. Build reports in markdown / HTML / PDF
  10. Optionally suggest simple ways to publish the generated HTML report (Vercel, Netlify, Cloudflare Pages, GitHub Pages) if the user wants a shareable link

Default decisions

Area Default
Benchmark mode prompt_only
Overall scoring quality + depth + cost
Speed handling measured and reported, excluded from default overall
Execution strategy sequential unless orchestration is needed
Web publish target (no built-in publish) — suggest Vercel / Netlify / Cloudflare Pages / GitHub Pages

Workflow rules

Benchmark input rules

  • Default to prompt_only unless the user explicitly wants agent_context.
  • In prompt_only, send only the raw prompt.
  • Do not inject extra context, memory, few-shot examples, or hidden scaffolding in prompt_only mode.
  • In agent_context, use one fixed shared system/context layer for all compared models and record it in metadata.

Execution rules

  • Support both sequential and subagent_orchestrated execution strategies.
  • Allow bounded parallel execution for subagents (for example --max-parallel 4) when the endpoint can tolerate it.
  • Treat rerank as a first-class operation; do not rerun models when only the scoring formula changes.
  • Report progress at every major step so the user never feels the process is hanging.
  • During batch execution, surface a clear update whenever one agent/model finishes.
  • Normalize model ids before calling the endpoint when the provider catalog exposes raw model ids but the user/runtime spec may contain provider-prefixed names.
  • If the endpoint returns naming/provider mismatch errors, explain the mismatch clearly instead of leaving only a raw 502/unknown-provider error.

Output rules

  • Mark every estimated metric clearly.
  • Rewrite reports/landing pages to the newest snapshot.
  • Do not append patch fragments to stale output.
  • Reports should include: ranking table, cost table, executive summary, overall assessment, recommended model selection, and full answer details.
  • Default the report language to the user's current conversation language.
  • Only switch the report language when the user explicitly asks for a different language or a bilingual output.
  • PDF output must use Unicode-capable fonts so Vietnamese, Chinese, and multilingual content render correctly.
  • Multilingual support means the renderer can display multiple languages correctly; it does not mean the skill should arbitrarily change the report language.
  • Ask before delivering externally via Vercel or other web publishing.

Safety and trust boundary

This skill may perform network I/O depending on how the benchmark spec is configured.

Safe-by-design intent

  • Example specs should use placeholder endpoints, not a private hardcoded runtime.
  • The user should supply only trusted API endpoints and credentials.
  • Publishing should happen only when the user explicitly wants delivery.

Important runtime notes

  • run_benchmark.py sends prompts to the base_url configured in the benchmark spec.
  • This skill does not publish to Vercel/Netlify/Cloudflare/GitHub automatically. It only generates local HTML/PDF artifacts.
  • If you want a shareable link, publish the generated HTML folder using one of these services: Vercel, Netlify, Cloudflare Pages, or GitHub Pages.
  • Only run the skill with endpoints, tokens, and outputs you trust.

For detailed runtime assumptions, read:

  • references/runtime-safety.md
  • references/environment-vars.md
  • references/pricing-sources.md

What to read

Read only what you need:

  • references/initial-project-spec.md — authoritative design baseline
  • references/benchmark-schema.md — benchmark spec structure, run artifacts, file layout
  • references/scoring-rubric.md — scoring model, normalization rules, default weights
  • references/pricing-sources.md — pricing precedence and estimation policy
  • references/execution-modes.md — benchmark modes, execution strategies, operational modes
  • references/output-modes.md — delivery choices, publish rules, progress feedback rules
  • references/runtime-safety.md — trust boundaries, network behavior, safe usage guidance
  • references/environment-vars.md — expected environment variables and dependency notes
  • examples/*.yaml — benchmark context templates and ready-made examples in multiple languages

Scripts

Script Purpose
scripts/build_benchmark_spec.py Build a benchmark spec from benchmark context
scripts/run_benchmark.py Execute benchmark runs and write raw outputs/metrics
scripts/estimate_tokens.py Estimate token counts when provider usage is missing
scripts/resolve_pricing.py Resolve pricing sources and compute estimated/official pricing
scripts/score_models.py Combine raw metrics and rubric scores into rankings
scripts/build_report.py Build markdown, HTML, and PDF report artifacts
scripts/publish_report.py No deployment automation. Export/copy PDF and print suggested static hosting options (Vercel/Netlify/Cloudflare Pages/GitHub Pages).

Output contract

Try to produce these artifacts whenever possible:

  • versioned benchmark spec
  • raw per-model answer files
  • raw metrics JSON
  • score breakdown JSON
  • markdown summary report
  • HTML landing page
  • PDF output when requested
  • publish result metadata when delivery occurs
安全使用建议
This skill appears coherent for model benchmarking, but it will send prompts, outputs, and the BENCHMARK_API_KEY to whatever base_url you configure in a benchmark spec. Before running: (1) verify the base_url is a trusted OpenAI‑compatible endpoint, (2) test with non-sensitive prompts first, (3) run in an isolated environment and install PyYAML/reportlab from requirements.txt, and (4) only provide Vercel/Netlify/GitHub tokens if you explicitly want automatic publish — the skill documents that publishing is a separate, opt-in step. If you need tighter safeguards, review run_benchmark.py and publish_report.py to confirm how credentials and artifacts are used/stored.
功能分析
Type: OpenClaw Skill Name: benchmark-model-provider Version: 1.0.5 The benchmark-model-provider skill is a comprehensive framework for evaluating AI models against user-specific workflows. It includes scripts for generating benchmark specifications, executing prompts via OpenAI-compatible APIs (run_benchmark.py), heuristic scoring based on response features (score_models.py), and generating multi-format reports (build_report.py). The bundle demonstrates high security awareness: run_benchmark.py includes a safety validator (_is_safe_base_url) that enforces HTTPS and blocks localhost/IP addresses to prevent SSRF or accidental credential exposure, and publish_report.py explicitly avoids automated shell execution for deployments to remain within safety boundaries. The logic is transparent, well-documented, and aligns strictly with its stated purpose.
能力评估
Purpose & Capability
Name/description, required binary (python3), required env (BENCHMARK_API_KEY), example specs, and scripts all align with a benchmarking tool that calls OpenAI‑compatible endpoints. The listed optional publishing helpers (Vercel/Netlify) are consistent with the report-publishing feature.
Instruction Scope
SKILL.md and scripts explicitly perform network I/O to the base_url from a benchmark spec and use the BENCHMARK_API_KEY for auth. This is expected for the stated purpose, but means prompts, model outputs, and the API key will be sent to whichever endpoint the user configures — the skill warns about this. The instructions do not ask for unrelated secrets or arbitrary system files.
Install Mechanism
There is no platform install spec (no remote downloads). The repo includes Python scripts and a small requirements.txt (PyYAML, reportlab). This is low risk; packages are standard and the code is shipped with the skill. Users should still install dependencies in an isolated environment before running.
Credentials
Only BENCHMARK_API_KEY is required (declared as primary). References mention an optional VERCEL_TOKEN for non-interactive publishing, but that is not required by default. No unrelated credentials or excessive env requests are present.
Persistence & Privilege
The skill does not request always:true or system-wide privileges. It stores run artifacts (raw outputs, metrics, reports) locally for audit/reranking — consistent with its purpose. Publishing to web hosts is explicit and documented; it only occurs when the user chooses that step.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install benchmark-model-provider
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /benchmark-model-provider 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.5
Security hardening: remove Vercel auto-deploy; add base_url safety checks; update docs with static hosting suggestions.
v1.0.4
Add multilingual SKILL.md intros (VI/ZH), provider/model listing workflow guidance, model-id normalization for OpenAI-compatible endpoints, per-agent progress updates, improved scoring heuristics, Unicode PDF rendering, and richer report summaries with cost/recommendation sections.
v1.0.3
Metadata precision release: keep BENCHMARK_API_KEY as the only required environment variable, mark it as primaryEnv, and keep VERCEL_TOKEN documented as optional for web publishing only.
v1.0.2
Metadata alignment release: declared BENCHMARK_API_KEY and VERCEL_TOKEN in skill metadata for registry scanning, while retaining placeholder endpoints, dependency manifest, and runtime safety guidance.
v1.0.1
Polish release: improved SKILL.md markdown presentation, replaced private default endpoint examples with generic placeholders, added requirements.txt, added environment/dependency documentation, and added runtime safety guidance for trusted endpoints and publishing.
v1.0.0
Initial release: user-specific benchmark spec generation, prompt-only and agent-context benchmark modes, sequential and orchestrated runs, scoring across quality/depth/cost with speed tracking, HTML/PDF reporting, and publish helpers for delivery workflows.
元数据
Slug benchmark-model-provider
版本 1.0.5
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 6
常见问题

Benchmark Model Provider 是什么?

Benchmark and rank AI providers/models against a user-specific prompt suite derived from the user's purpose, domain, and usage frequency. Use when users ask... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 171 次。

如何安装 Benchmark Model Provider?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install benchmark-model-provider」即可一键安装,无需额外配置。

Benchmark Model Provider 是免费的吗?

是的,Benchmark Model Provider 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Benchmark Model Provider 支持哪些平台?

Benchmark Model Provider 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Benchmark Model Provider?

由 tankisstank(@tankisstank)开发并维护,当前版本 v1.0.5。

💬 留言讨论