← Back to Skills Marketplace
aiwithabidi

Llm Evaluator

by aiwithabidi · GitHub ↗ · v1.0.0
cross-platform ⚠ suspicious
375
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install llm-evaluator
Description
LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical trac...
README (SKILL.md)

LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

When to Use

  • Evaluating quality of search results or AI responses
  • Scoring traces for relevance, accuracy, hallucination detection
  • Batch scoring recent unscored traces
  • Quality assurance on agent outputs

Usage

# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score \x3Ctrace_id>

# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score \x3Ctrace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20

Evaluators

Evaluator Measures Scale
relevance Response relevance to query 0–1
accuracy Factual correctness 0–1
hallucination Made-up information detection 0–1
helpfulness Overall usefulness 0–1

Credits

Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.

📅 Need help setting up OpenClaw for your business? Book a free consultation

Usage Guidance
Before installing or running this skill, inspect the included scripts (scripts/evaluator.py) yourself. Pay particular attention to: 1) the hardcoded LANGFUSE_SECRET_KEY/LANGFUSE_PUBLIC_KEY and LANGFUSE_HOST — verify they are not production secrets and consider removing or replacing them with environment-configured values; 2) the code path that reads ~/.openclaw/workspace/.env to obtain an OpenRouter key — ensure you are comfortable with that file being read or set OPENROUTER_API_KEY explicitly instead; 3) network endpoints (openrouter.ai and the langfuse host) — run in an isolated environment if you do not fully trust them. If you plan to use this in production, rotate any exposed credentials, replace hardcoded keys with proper environment variables or configuration, and run the script in a sandbox while monitoring outbound network traffic. If you are unsure, request the author to remove embedded keys and document any file reads and external endpoints explicitly.
Capability Analysis
Type: OpenClaw Skill Name: llm-evaluator Version: 1.0.0 The skill contains hardcoded sensitive credentials (LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY) within scripts/evaluator.py, which is a significant security vulnerability. Furthermore, the script automatically attempts to read the OPENROUTER_API_KEY from the user's workspace environment file (~/.openclaw/workspace/.env). While these behaviors are technically functional for the stated purpose of LLM evaluation, the inclusion of hardcoded secrets and the automated harvesting of API keys from the filesystem are high-risk practices.
Capability Assessment
Purpose & Capability
Name/description (Langfuse + OpenRouter) matches the script's behavior: it evaluates traces and posts scores to Langfuse using an OpenRouter-backed judge model. However, the code embeds LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY values and a LANGFUSE_HOST that are not declared in requires.env or SKILL.md; shipping hardcoded service credentials is unexpected and disproportionate to the stated purpose.
Instruction Scope
SKILL.md directs running the included Python script, which is expected. The script also attempts to read a user-local file (~/.openclaw/workspace/.env) to find OPENROUTER_API_KEY if the env var isn't set; this config-file access is not declared in requires.config_paths and is an additional data access surface that users should be aware of.
Install Mechanism
No install spec (instruction-only with an included script). That keeps install risk low — nothing is downloaded or executed automatically beyond running the bundled Python script.
Credentials
The registry declares only OPENROUTER_API_KEY as required (which is appropriate). But the code embeds Langfuse public/secret keys and a Langfuse host URL; these are effectively credentials baked into the skill rather than requested from the environment. Also the script will make network calls to Langfuse and OpenRouter, which is expected but worth noting.
Persistence & Privilege
The skill does not request always:true and does not request system-wide persistence. It runs network operations and writes scores to Langfuse, which is consistent with its purpose and not an unusual privilege level.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install llm-evaluator
  3. After installation, invoke the skill by name or use /llm-evaluator
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
- Initial release of the llm-evaluator skill. - Provides an LLM-as-a-Judge system for evaluating AI outputs using relevance, accuracy, hallucination, and helpfulness scores. - Integrates with Langfuse and uses GPT-5-nano for efficient automated judging. - Enables batch backfill scoring for historical traces and real-time evaluation of outputs. - Command-line interface for testing, scoring specific traces, and running backfills.
Metadata
Slug llm-evaluator
Version 1.0.0
License
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Llm Evaluator?

LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical trac... It is an AI Agent Skill for Claude Code / OpenClaw, with 375 downloads so far.

How do I install Llm Evaluator?

Run "/install llm-evaluator" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Llm Evaluator free?

Yes, Llm Evaluator is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Llm Evaluator support?

Llm Evaluator is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Llm Evaluator?

It is built and maintained by aiwithabidi (@aiwithabidi); the current version is v1.0.0.

💬 Comments