← Back to Skills Marketplace
aiwithabidi

LLM Evaluator Pro

by aiwithabidi · GitHub ↗ · v1.0.0
cross-platform ⚠ suspicious
739
Downloads
1
Stars
1
Active Installs
1
Versions
Install in OpenClaw
/install llm-evaluator-pro
Description
LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace...
README (SKILL.md)

LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

When to Use

  • Evaluating quality of search results or AI responses
  • Scoring traces for relevance, accuracy, hallucination detection
  • Batch scoring recent unscored traces
  • Quality assurance on agent outputs

Usage

# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score \x3Ctrace_id>

# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score \x3Ctrace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20

Evaluators

Evaluator Measures Scale
relevance Response relevance to query 0–1
accuracy Factual correctness 0–1
hallucination Made-up information detection 0–1
helpfulness Overall usefulness 0–1

Credits

Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.

📅 Need help setting up OpenClaw for your business? Book a free consultation

Usage Guidance
This skill largely does what its README says, but there are several red flags you should resolve before running it in a production environment: 1) The script contains hardcoded Langfuse API keys and a hardcoded Langfuse host and uses those values directly — that could send your trace data (or allow the script to act using somebody else's account). Treat those embedded keys as suspicious and do not rely on them. 2) The script will attempt to read ~/.openclaw/workspace/.env for an OPENROUTER_API_KEY if you don't set one in the environment; that file may contain unrelated secrets. The skill metadata did not declare that config path. 3) Dependencies (requests, openai, langfuse) are not declared; running without knowing what will be installed is fragile. Recommended actions before installing/using: - Inspect the evaluator.py file fully (remove or rotate any embedded keys). - Replace hardcoded LF_AUTH/LF_API with explicit env-based configuration and ensure the host points to a Langfuse instance you control. - Avoid running the script as-is on systems with sensitive ~/.openclaw/workspace/.env files; run it in an isolated test environment or container first. - If you need to trust this skill, ask the publisher to provide a version that reads credentials only from declared env vars (no defaults), documents required Python packages, and documents exactly which endpoints will receive data. If the publisher confirms the embedded keys are inert placeholders and the code is changed to respect environment values only, the concerns would be reduced.
Capability Analysis
Type: OpenClaw Skill Name: llm-evaluator-pro Version: 1.0.0 The skill is classified as suspicious due to two key vulnerabilities found in `scripts/evaluator.py`. Firstly, it attempts to read the `OPENROUTER_API_KEY` from `~/.openclaw/workspace/.env` if not found in environment variables, which is a local file inclusion/information disclosure risk, granting the skill access to a potentially sensitive file within the OpenClaw workspace. Secondly, it contains hardcoded Langfuse API keys (`sk-lf-115cb6b4-7153-4fe6-9255-bf28f8b115de`, `pk-lf-8a9322b9-5eb1-4e8b-815e-b3428dc69bc4`) and an internal host (`http://langfuse-web:3000`), which, while potentially overridden by environment variables, represents poor security practice and a potential credential leak if these keys were sensitive and used in an unintended context.
Capability Assessment
Purpose & Capability
Name/description match the code: it uses OpenRouter (GPT judge) and Langfuse to score traces. Requesting OPENROUTER_API_KEY and Langfuse keys is consistent with the described function. However the code contains hardcoded Langfuse keys and host values, which undermines the declared requirement model (the skill claims to require env vars but will fall back to embedded credentials).
Instruction Scope
SKILL.md instructs running the included Python script. The script, however, attempts to read ~/.openclaw/workspace/.env for the OpenRouter key (a config path not declared in metadata) and uses hardcoded Langfuse credentials/host to call the Langfuse API. Reading an undeclared workspace .env can access other secrets; always-posting scores to a hardcoded Langfuse endpoint (with embedded keys) could transmit data to an unexpected/third-party account.
Install Mechanism
There is no install spec. The skill includes a Python script but does not declare Python package dependencies (requests, openai, langfuse). That is a coherence/usability issue (script may fail). Lack of an install step lowers installation auditability, but is not itself malicious — still increases risk because it's unclear what packages will be installed by users to run it.
Credentials
Declared env vars (OPENROUTER_API_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY) are appropriate for the stated purpose. However the script: (1) sets default LANGFUSE keys in code, (2) hardcodes LF_AUTH and LF_API values rather than reading the environment, and (3) attempts to parse ~/.openclaw/workspace/.env if OPENROUTER_API_KEY is not set. These behaviors mean the skill can use embedded credentials and read an undeclared local .env file, which is disproportionate and suspicious.
Persistence & Privilege
The skill is not force-included (always=false) and does not request persistent platform privileges. It does not attempt to modify other skills or global agent configuration. Autonomy is enabled by default but is not an additional red flag here.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install llm-evaluator-pro
  3. After installation, invoke the skill by name or use /llm-evaluator-pro
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
LLM-as-a-Judge evaluator via Langfuse
Metadata
Slug llm-evaluator-pro
Version 1.0.0
License
All-time Installs 1
Active Installs 1
Total Versions 1
Frequently Asked Questions

What is LLM Evaluator Pro?

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace... It is an AI Agent Skill for Claude Code / OpenClaw, with 739 downloads so far.

How do I install LLM Evaluator Pro?

Run "/install llm-evaluator-pro" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is LLM Evaluator Pro free?

Yes, LLM Evaluator Pro is completely free (open-source). You can download, install and use it at no cost.

Which platforms does LLM Evaluator Pro support?

LLM Evaluator Pro is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created LLM Evaluator Pro?

It is built and maintained by aiwithabidi (@aiwithabidi); the current version is v1.0.0.

💬 Comments