← 返回 Skills 市场
Llm Evaluator
作者
aiwithabidi
· GitHub ↗
· v1.0.0
375
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install llm-evaluator
功能描述
LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical trac...
使用说明 (SKILL.md)
LLM Evaluator ⚖️
LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.
When to Use
- Evaluating quality of search results or AI responses
- Scoring traces for relevance, accuracy, hallucination detection
- Batch scoring recent unscored traces
- Quality assurance on agent outputs
Usage
# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test
# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score \x3Ctrace_id>
# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score \x3Ctrace_id> --evaluators relevance
# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20
Evaluators
| Evaluator | Measures | Scale |
|---|---|---|
| relevance | Response relevance to query | 0–1 |
| accuracy | Factual correctness | 0–1 |
| hallucination | Made-up information detection | 0–1 |
| helpfulness | Overall usefulness | 0–1 |
Credits
Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.
📅 Need help setting up OpenClaw for your business? Book a free consultation
安全使用建议
Before installing or running this skill, inspect the included scripts (scripts/evaluator.py) yourself. Pay particular attention to: 1) the hardcoded LANGFUSE_SECRET_KEY/LANGFUSE_PUBLIC_KEY and LANGFUSE_HOST — verify they are not production secrets and consider removing or replacing them with environment-configured values; 2) the code path that reads ~/.openclaw/workspace/.env to obtain an OpenRouter key — ensure you are comfortable with that file being read or set OPENROUTER_API_KEY explicitly instead; 3) network endpoints (openrouter.ai and the langfuse host) — run in an isolated environment if you do not fully trust them. If you plan to use this in production, rotate any exposed credentials, replace hardcoded keys with proper environment variables or configuration, and run the script in a sandbox while monitoring outbound network traffic. If you are unsure, request the author to remove embedded keys and document any file reads and external endpoints explicitly.
功能分析
Type: OpenClaw Skill
Name: llm-evaluator
Version: 1.0.0
The skill contains hardcoded sensitive credentials (LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY) within scripts/evaluator.py, which is a significant security vulnerability. Furthermore, the script automatically attempts to read the OPENROUTER_API_KEY from the user's workspace environment file (~/.openclaw/workspace/.env). While these behaviors are technically functional for the stated purpose of LLM evaluation, the inclusion of hardcoded secrets and the automated harvesting of API keys from the filesystem are high-risk practices.
能力评估
Purpose & Capability
Name/description (Langfuse + OpenRouter) matches the script's behavior: it evaluates traces and posts scores to Langfuse using an OpenRouter-backed judge model. However, the code embeds LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY values and a LANGFUSE_HOST that are not declared in requires.env or SKILL.md; shipping hardcoded service credentials is unexpected and disproportionate to the stated purpose.
Instruction Scope
SKILL.md directs running the included Python script, which is expected. The script also attempts to read a user-local file (~/.openclaw/workspace/.env) to find OPENROUTER_API_KEY if the env var isn't set; this config-file access is not declared in requires.config_paths and is an additional data access surface that users should be aware of.
Install Mechanism
No install spec (instruction-only with an included script). That keeps install risk low — nothing is downloaded or executed automatically beyond running the bundled Python script.
Credentials
The registry declares only OPENROUTER_API_KEY as required (which is appropriate). But the code embeds Langfuse public/secret keys and a Langfuse host URL; these are effectively credentials baked into the skill rather than requested from the environment. Also the script will make network calls to Langfuse and OpenRouter, which is expected but worth noting.
Persistence & Privilege
The skill does not request always:true and does not request system-wide persistence. It runs network operations and writes scores to Langfuse, which is consistent with its purpose and not an unusual privilege level.
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install llm-evaluator - 安装完成后,直接呼叫该 Skill 的名称或使用
/llm-evaluator触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
- Initial release of the llm-evaluator skill.
- Provides an LLM-as-a-Judge system for evaluating AI outputs using relevance, accuracy, hallucination, and helpfulness scores.
- Integrates with Langfuse and uses GPT-5-nano for efficient automated judging.
- Enables batch backfill scoring for historical traces and real-time evaluation of outputs.
- Command-line interface for testing, scoring specific traces, and running backfills.
元数据
常见问题
Llm Evaluator 是什么?
LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical trac... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 375 次。
如何安装 Llm Evaluator?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install llm-evaluator」即可一键安装,无需额外配置。
Llm Evaluator 是免费的吗?
是的,Llm Evaluator 完全免费(开源免费),可自由下载、安装和使用。
Llm Evaluator 支持哪些平台?
Llm Evaluator 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Llm Evaluator?
由 aiwithabidi(@aiwithabidi)开发并维护,当前版本 v1.0.0。
推荐 Skills