/install skylv-prompt-evaluation
Prompt Evaluation
Evaluate and benchmark AI prompts for quality, consistency, and performance. Score, compare, and optimize your prompts systematically.
Overview
A prompt evaluation framework that helps agents measure prompt quality across multiple dimensions: clarity, specificity, robustness, cost-efficiency, and output consistency. Compare prompt variants and find the optimal version.
Capabilities
1. Quality Scoring
node evaluate.js score --prompt "Summarize the article" --dimensions clarity,specificity,robustness
node evaluate.js score --prompt-file ./prompts/ --output scores.json
Scores prompts on clarity (0-10), specificity (0-10), robustness (0-10), and cost-efficiency (0-10).
2. A/B Comparison
node evaluate.js compare --prompt-a "Summarize" --prompt-b "Write a 3-bullet summary" --trials 50
node evaluate.js compare --config ab-test-config.json
Run statistical A/B tests between prompt variants with significance analysis.
3. Consistency Check
node evaluate.js consistency --prompt "Translate to French" --runs 100 --variance-threshold 0.15
node evaluate.js consistency --temperature 0.7 --top-p 0.9
Measures output consistency across multiple runs to find the most stable prompts.
4. Regression Testing
node evaluate.js regression --baseline v1.0 --current v1.1 --test-suite golden-set.jsonl
node evaluate.js regression --fail-on-degradation 5%
Detects quality regressions between prompt versions using golden test sets.
5. Cost Analysis
node evaluate.js cost --prompt "Long prompt..." --model gpt-4 --estimate-tokens
node evaluate.js cost --compare-prompts --output cost-report.csv
Estimates token usage and costs for different prompt variants and models.
Configuration
{
"evaluation": {
"dimensions": ["clarity", "specificity", "robustness", "cost"],
"scoringModel": "gpt-4",
"abTest": {
"trials": 50,
"significanceLevel": 0.05
},
"consistency": {
"runs": 100,
"varianceThreshold": 0.15
},
"regression": {
"degradationThreshold": "5%",
"goldenSet": "./golden-set.jsonl"
}
}
}
Use Cases
- Prompt Engineering: Systematically improve prompt quality
- Quality Assurance: Ensure prompts meet quality standards before production
- Cost Optimization: Find prompts that achieve goals with fewer tokens
- Version Control: Track prompt quality across versions
- Agent Tuning: Optimize agent system prompts for consistency
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install skylv-prompt-evaluation - After installation, invoke the skill by name or use
/skylv-prompt-evaluation - Provide required inputs per the skill's parameter spec and get structured output
What is Skylv Prompt Evaluation?
Evaluate and benchmark AI prompts for quality, consistency, and performance. Triggers: prompt evaluation, prompt testing, prompt quality, prompt benchmark, p... It is an AI Agent Skill for Claude Code / OpenClaw, with 47 downloads so far.
How do I install Skylv Prompt Evaluation?
Run "/install skylv-prompt-evaluation" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Skylv Prompt Evaluation free?
Yes, Skylv Prompt Evaluation is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Skylv Prompt Evaluation support?
Skylv Prompt Evaluation is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Skylv Prompt Evaluation?
It is built and maintained by SKY-lv (@sky-lv); the current version is v1.0.0.