← 返回 Skills 市场

LLM Evaluator Pro

Name: LLM Evaluator Pro
Author: aiwithabidi

作者 aiwithabidi · GitHub ↗ · v1.0.0

cross-platform ⚠ suspicious

739

总下载

当前安装

版本数

在 OpenClaw 中安装

/install llm-evaluator-pro

功能描述

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace...

使用说明 (SKILL.md)

LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

When to Use

Evaluating quality of search results or AI responses
Scoring traces for relevance, accuracy, hallucination detection
Batch scoring recent unscored traces
Quality assurance on agent outputs

Usage

# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score \x3Ctrace_id>

# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score \x3Ctrace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20

Evaluators

Evaluator	Measures	Scale
relevance	Response relevance to query	0–1
accuracy	Factual correctness	0–1
hallucination	Made-up information detection	0–1
helpfulness	Overall usefulness	0–1

Credits

Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.

📅 Need help setting up OpenClaw for your business? Book a free consultation

安全使用建议

This skill largely does what its README says, but there are several red flags you should resolve before running it in a production environment: 1) The script contains hardcoded Langfuse API keys and a hardcoded Langfuse host and uses those values directly — that could send your trace data (or allow the script to act using somebody else's account). Treat those embedded keys as suspicious and do not rely on them. 2) The script will attempt to read ~/.openclaw/workspace/.env for an OPENROUTER_API_KEY if you don't set one in the environment; that file may contain unrelated secrets. The skill metadata did not declare that config path. 3) Dependencies (requests, openai, langfuse) are not declared; running without knowing what will be installed is fragile. Recommended actions before installing/using: - Inspect the evaluator.py file fully (remove or rotate any embedded keys). - Replace hardcoded LF_AUTH/LF_API with explicit env-based configuration and ensure the host points to a Langfuse instance you control. - Avoid running the script as-is on systems with sensitive ~/.openclaw/workspace/.env files; run it in an isolated test environment or container first. - If you need to trust this skill, ask the publisher to provide a version that reads credentials only from declared env vars (no defaults), documents required Python packages, and documents exactly which endpoints will receive data. If the publisher confirms the embedded keys are inert placeholders and the code is changed to respect environment values only, the concerns would be reduced.

功能分析

Type: OpenClaw Skill Name: llm-evaluator-pro Version: 1.0.0 The skill is classified as suspicious due to two key vulnerabilities found in `scripts/evaluator.py`. Firstly, it attempts to read the `OPENROUTER_API_KEY` from `~/.openclaw/workspace/.env` if not found in environment variables, which is a local file inclusion/information disclosure risk, granting the skill access to a potentially sensitive file within the OpenClaw workspace. Secondly, it contains hardcoded Langfuse API keys (`sk-lf-115cb6b4-7153-4fe6-9255-bf28f8b115de`, `pk-lf-8a9322b9-5eb1-4e8b-815e-b3428dc69bc4`) and an internal host (`http://langfuse-web:3000`), which, while potentially overridden by environment variables, represents poor security practice and a potential credential leak if these keys were sensitive and used in an unintended context.

能力评估

ℹ Purpose & Capability

Name/description match the code: it uses OpenRouter (GPT judge) and Langfuse to score traces. Requesting OPENROUTER_API_KEY and Langfuse keys is consistent with the described function. However the code contains hardcoded Langfuse keys and host values, which undermines the declared requirement model (the skill claims to require env vars but will fall back to embedded credentials).

⚠ Instruction Scope

SKILL.md instructs running the included Python script. The script, however, attempts to read ~/.openclaw/workspace/.env for the OpenRouter key (a config path not declared in metadata) and uses hardcoded Langfuse credentials/host to call the Langfuse API. Reading an undeclared workspace .env can access other secrets; always-posting scores to a hardcoded Langfuse endpoint (with embedded keys) could transmit data to an unexpected/third-party account.

ℹ Install Mechanism

There is no install spec. The skill includes a Python script but does not declare Python package dependencies (requests, openai, langfuse). That is a coherence/usability issue (script may fail). Lack of an install step lowers installation auditability, but is not itself malicious — still increases risk because it's unclear what packages will be installed by users to run it.

⚠ Credentials

Declared env vars (OPENROUTER_API_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY) are appropriate for the stated purpose. However the script: (1) sets default LANGFUSE keys in code, (2) hardcodes LF_AUTH and LF_API values rather than reading the environment, and (3) attempts to parse ~/.openclaw/workspace/.env if OPENROUTER_API_KEY is not set. These behaviors mean the skill can use embedded credentials and read an undeclared local .env file, which is disproportionate and suspicious.

✓ Persistence & Privilege

The skill is not force-included (always=false) and does not request persistent platform privileges. It does not attempt to modify other skills or global agent configuration. Autonomy is enabled by default but is not an additional red flag here.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install llm-evaluator-pro
安装完成后，直接呼叫该 Skill 的名称或使用 /llm-evaluator-pro 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

LLM-as-a-Judge evaluator via Langfuse

元数据

Slug llm-evaluator-pro

版本 1.0.0

许可证 —

累计安装 1

当前安装数 1

历史版本数 1

常见问题