← 返回 Skills 市场

Llm Evaluator

Name: Llm Evaluator
Author: aiwithabidi

作者 aiwithabidi · GitHub ↗ · v1.0.0

cross-platform ⚠ suspicious

375

总下载

当前安装

版本数

在 OpenClaw 中安装

/install llm-evaluator

功能描述

LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical trac...

使用说明 (SKILL.md)

LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

When to Use

Evaluating quality of search results or AI responses
Scoring traces for relevance, accuracy, hallucination detection
Batch scoring recent unscored traces
Quality assurance on agent outputs

Usage

# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score \x3Ctrace_id>

# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score \x3Ctrace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20

Evaluators

Evaluator	Measures	Scale
relevance	Response relevance to query	0–1
accuracy	Factual correctness	0–1
hallucination	Made-up information detection	0–1
helpfulness	Overall usefulness	0–1

Credits

Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.

📅 Need help setting up OpenClaw for your business? Book a free consultation

安全使用建议

Before installing or running this skill, inspect the included scripts (scripts/evaluator.py) yourself. Pay particular attention to: 1) the hardcoded LANGFUSE_SECRET_KEY/LANGFUSE_PUBLIC_KEY and LANGFUSE_HOST — verify they are not production secrets and consider removing or replacing them with environment-configured values; 2) the code path that reads ~/.openclaw/workspace/.env to obtain an OpenRouter key — ensure you are comfortable with that file being read or set OPENROUTER_API_KEY explicitly instead; 3) network endpoints (openrouter.ai and the langfuse host) — run in an isolated environment if you do not fully trust them. If you plan to use this in production, rotate any exposed credentials, replace hardcoded keys with proper environment variables or configuration, and run the script in a sandbox while monitoring outbound network traffic. If you are unsure, request the author to remove embedded keys and document any file reads and external endpoints explicitly.

功能分析

Type: OpenClaw Skill Name: llm-evaluator Version: 1.0.0 The skill contains hardcoded sensitive credentials (LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY) within scripts/evaluator.py, which is a significant security vulnerability. Furthermore, the script automatically attempts to read the OPENROUTER_API_KEY from the user's workspace environment file (~/.openclaw/workspace/.env). While these behaviors are technically functional for the stated purpose of LLM evaluation, the inclusion of hardcoded secrets and the automated harvesting of API keys from the filesystem are high-risk practices.

能力评估

⚠ Purpose & Capability

Name/description (Langfuse + OpenRouter) matches the script's behavior: it evaluates traces and posts scores to Langfuse using an OpenRouter-backed judge model. However, the code embeds LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY values and a LANGFUSE_HOST that are not declared in requires.env or SKILL.md; shipping hardcoded service credentials is unexpected and disproportionate to the stated purpose.

⚠ Instruction Scope

SKILL.md directs running the included Python script, which is expected. The script also attempts to read a user-local file (~/.openclaw/workspace/.env) to find OPENROUTER_API_KEY if the env var isn't set; this config-file access is not declared in requires.config_paths and is an additional data access surface that users should be aware of.

✓ Install Mechanism

No install spec (instruction-only with an included script). That keeps install risk low — nothing is downloaded or executed automatically beyond running the bundled Python script.

⚠ Credentials

The registry declares only OPENROUTER_API_KEY as required (which is appropriate). But the code embeds Langfuse public/secret keys and a Langfuse host URL; these are effectively credentials baked into the skill rather than requested from the environment. Also the script will make network calls to Langfuse and OpenRouter, which is expected but worth noting.

✓ Persistence & Privilege

The skill does not request always:true and does not request system-wide persistence. It runs network operations and writes scores to Langfuse, which is consistent with its purpose and not an unusual privilege level.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install llm-evaluator
安装完成后，直接呼叫该 Skill 的名称或使用 /llm-evaluator 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

- Initial release of the llm-evaluator skill. - Provides an LLM-as-a-Judge system for evaluating AI outputs using relevance, accuracy, hallucination, and helpfulness scores. - Integrates with Langfuse and uses GPT-5-nano for efficient automated judging. - Enables batch backfill scoring for historical traces and real-time evaluation of outputs. - Command-line interface for testing, scoring specific traces, and running backfills.

元数据

Slug llm-evaluator

版本 1.0.0

许可证 —

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Llm Evaluator 是什么？

LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical trac... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 375 次。

如何安装 Llm Evaluator？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install llm-evaluator」即可一键安装，无需额外配置。

Llm Evaluator 是免费的吗？

是的，Llm Evaluator 完全免费（开源免费），可自由下载、安装和使用。

Llm Evaluator 支持哪些平台？

Llm Evaluator 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Llm Evaluator？

由 aiwithabidi（@aiwithabidi）开发并维护，当前版本 v1.0.0。