← 返回 Skills 市场
yuzhihui886

LLM Tester

作者 yuzhihui886 · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
68
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install llm-tester
功能描述
LLM 模型对比测试工具。支持多模型批量对比测试,自动记录耗时、Token 消耗、成功率,生成 JSON 格式对比报告。当需要评估不同 LLM 模型在特定任务上的表现时使用。
使用说明 (SKILL.md)

LLM Tester - 模型对比测试工具

Overview

自动化对比测试多个 LLM 模型在相同任务上的表现,记录速度、Token 消耗、成功率等指标,生成标准化 JSON 报告。适用于模型选型、Prompt 优化、成本评估等场景。

使用场景

  • 需要对比不同 LLM 模型的性能
  • 需要评估模型在特定任务上的质量
  • 需要批量测试多个样本
  • 需要标准化测试报告

核心功能

功能 说明
多模型对比 同时测试多个模型,自动对比结果
批量测试 支持多个样本文件和 Prompt 模板
性能记录 自动记录耗时、Token 消耗、成功率
超时控制 单个模型超时自动跳过,不阻塞其他模型
JSON 报告 标准化输出,易于集成 CI/CD

CLI 使用

# 基本用法
python3 scripts/llm_benchmark.py --samples samples/ --prompts prompts/ --models qwen3.6-plus qwen3-max

# 指定超时和输出路径
python3 scripts/llm_benchmark.py \
  --samples samples/ \
  --prompts prompts/ \
  --models qwen3.6-plus qwen3-max-2026-01-23 \
  --timeout 90 \
  --output reports/report.json

# 简写
python3 scripts/llm_benchmark.py -s samples/ -p prompts/ -m qwen3.6-plus qwen3-max -t 90 -o report.json

参数说明

参数 必填 说明
--samples / -s 测试样本目录(包含 .txt 文件)
--prompts / -p Prompt 模板目录(包含 .txt 文件)
--models / -m 要测试的模型列表(默认:qwen3.6-plus qwen3-max)
--timeout / -t 单个模型超时时间(秒,默认 60)
--output / -o 报告输出路径(默认:reports/benchmark-report.json)

环境变量

变量 说明 默认值
DASHSCOPE_API_KEY API Key 无(必须设置)
LLM_API_BASE API 地址 https://coding.dashscope.aliyuncs.com/v1/chat/completions

测试数据格式

样本文件(samples/*.txt)

纯文本文件,每个文件一个测试样本。

Prompt 模板(prompts/*.txt)

使用 {text} 占位符,运行时替换为样本内容。

分析以下文本的写作风格,输出 JSON:
{"complexity": "简洁或复杂", "pace": "快速/舒缓/紧凑", "labels": ["标签1", "标签2", "标签3"]}

文本:
{text}

报告格式

{
  "timestamp": "2026-04-14 13:30:00",
  "config": {
    "samples_dir": "samples/",
    "prompts_dir": "prompts/",
    "models": ["qwen3.6-plus", "qwen3-max-2026-01-23"],
    "timeout": 60
  },
  "summary": {
    "qwen3.6-plus": {
      "success_rate": "16/16 (100%)",
      "avg_time": "43.8s",
      "avg_tokens": "2952",
      "total_time": "701.4s",
      "total_tokens": 47230
    },
    "qwen3-max-2026-01-23": {
      "success_rate": "16/16 (100%)",
      "avg_time": "6.1s",
      "avg_tokens": "735",
      "total_time": "97.7s",
      "total_tokens": 11762
    }
  },
  "results": {
    "chapter_1_style-analysis": {
      "qwen3.6-plus": {"status": "success", "time": 36.13, "tokens": 2496, "result": {...}},
      "qwen3-max-2026-01-23": {"status": "success", "time": 13.12, "tokens": 707, "result": {...}}
    }
  }
}

依赖

  • Python 3.10+
  • requests >= 2.31

安装依赖:

pip install -r scripts/requirements.txt

与其他技能集成

与 style-analyzer 集成

# 测试风格分析能力
python3 scripts/llm_benchmark.py \
  --samples samples/novel-chapters/ \
  --prompts prompts/style-analysis.txt \
  --models qwen3.6-plus qwen3-max \
  --output reports/style-benchmark.json

与 foreshadowing-tracker 集成

# 测试伏笔识别能力
python3 scripts/llm_benchmark.py \
  --samples samples/novel-chapters/ \
  --prompts prompts/foreshadowing.txt \
  --models qwen3.6-plus qwen3-max \
  --output reports/foreshadowing-benchmark.json

注意事项

  • 必须设置 DASHSCOPE_API_KEY 环境变量
  • Prompt 模板必须包含 {text} 占位符
  • 样本文件必须是 .txt 格式
  • 超时时间建议 60-90 秒
安全使用建议
This tool appears to do what it says: it reads .txt samples and prompts, posts formatted prompts to an HTTP LLM API, collects timing/token info, and writes a JSON report. Before installing or running: 1) Be aware you must set DASHSCOPE_API_KEY (the registry metadata incorrectly omitted this); without it the script will fail. 2) The default API endpoint (LLM_API_BASE) is https://coding.dashscope.aliyuncs.com; confirm you trust that service — your sample texts (up to 2000 chars each) and prompts are sent to it. Avoid sending sensitive data unless you control or trust the endpoint. 3) Install the dependency (requests) in a controlled environment (pip install -r scripts/requirements.txt). 4) If you prefer not to use the default endpoint, set LLM_API_BASE to a trusted API or proxy that you control. 5) If you need higher assurance, review the API provider's privacy/security terms or run benchmarking against local/self-hosted models.
功能分析
Type: OpenClaw Skill Name: llm-tester Version: 1.0.0 The skill is a legitimate LLM benchmarking tool designed to compare performance across different models using user-provided samples and prompts. The core logic in `scripts/llm_benchmark.py` handles API requests to a configurable endpoint (defaulting to Alibaba's DashScope) and generates local JSON reports, with no evidence of data exfiltration, malicious execution, or prompt injection.
能力评估
Purpose & Capability
The name/description (LLM model comparison/benchmarking) matches the included script: it loads samples and prompts, calls an LLM HTTP API, records timing/tokens, and writes a JSON report. Default model names (qwen...) and the behaviour are consistent with the stated purpose. Note: the registry metadata claims no required environment variables, but both SKILL.md and the script require DASHSCOPE_API_KEY and optionally LLM_API_BASE — this metadata omission is an incoherence.
Instruction Scope
SKILL.md instructs running scripts/llm_benchmark.py with sample and prompt directories and to set DASHSCOPE_API_KEY; the script performs only the expected actions (reads .txt files, formats prompts, posts to API_BASE, aggregates results, and writes a report). It does not access other system files or extra env vars. Note: it will transmit sample contents (up to 2000 characters per sample) to an external HTTP API — expected for a benchmarking tool but important for privacy.
Install Mechanism
No install spec is provided (instruction-only install), and dependencies are minimal (requests). The skill includes scripts/requirements.txt and instructs pip install -r, which is reasonable. No downloads from arbitrary URLs or extraction behavior present.
Credentials
The script requires DASHSCOPE_API_KEY and supports overriding LLM_API_BASE; both are proportional to an HTTP-based LLM client. However, registry metadata incorrectly lists 'Required env vars: none' while both SKILL.md and scripts use DASHSCOPE_API_KEY — this mismatch is a practical risk (user may not realize a secret is needed). Also, providing the API key and sample data will send potentially sensitive content to the external service, so confirm trustworthiness of the key/endpoint before use.
Persistence & Privilege
The skill does not request permanent/always-on presence, does not modify other skills or system configurations, and does not add privileged behaviour. It only runs when invoked.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install llm-tester
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /llm-tester 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
初始版本:多模型对比测试工具,支持批量测试、性能记录、JSON 报告生成
元数据
Slug llm-tester
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

LLM Tester 是什么?

LLM 模型对比测试工具。支持多模型批量对比测试,自动记录耗时、Token 消耗、成功率,生成 JSON 格式对比报告。当需要评估不同 LLM 模型在特定任务上的表现时使用。 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 68 次。

如何安装 LLM Tester?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install llm-tester」即可一键安装,无需额外配置。

LLM Tester 是免费的吗?

是的,LLM Tester 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

LLM Tester 支持哪些平台?

LLM Tester 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 LLM Tester?

由 yuzhihui886(@yuzhihui886)开发并维护,当前版本 v1.0.0。

💬 留言讨论