/install skylv-prompt-evaluation
Prompt Evaluation
Evaluate and benchmark AI prompts for quality, consistency, and performance. Score, compare, and optimize your prompts systematically.
Overview
A prompt evaluation framework that helps agents measure prompt quality across multiple dimensions: clarity, specificity, robustness, cost-efficiency, and output consistency. Compare prompt variants and find the optimal version.
Capabilities
1. Quality Scoring
node evaluate.js score --prompt "Summarize the article" --dimensions clarity,specificity,robustness
node evaluate.js score --prompt-file ./prompts/ --output scores.json
Scores prompts on clarity (0-10), specificity (0-10), robustness (0-10), and cost-efficiency (0-10).
2. A/B Comparison
node evaluate.js compare --prompt-a "Summarize" --prompt-b "Write a 3-bullet summary" --trials 50
node evaluate.js compare --config ab-test-config.json
Run statistical A/B tests between prompt variants with significance analysis.
3. Consistency Check
node evaluate.js consistency --prompt "Translate to French" --runs 100 --variance-threshold 0.15
node evaluate.js consistency --temperature 0.7 --top-p 0.9
Measures output consistency across multiple runs to find the most stable prompts.
4. Regression Testing
node evaluate.js regression --baseline v1.0 --current v1.1 --test-suite golden-set.jsonl
node evaluate.js regression --fail-on-degradation 5%
Detects quality regressions between prompt versions using golden test sets.
5. Cost Analysis
node evaluate.js cost --prompt "Long prompt..." --model gpt-4 --estimate-tokens
node evaluate.js cost --compare-prompts --output cost-report.csv
Estimates token usage and costs for different prompt variants and models.
Configuration
{
"evaluation": {
"dimensions": ["clarity", "specificity", "robustness", "cost"],
"scoringModel": "gpt-4",
"abTest": {
"trials": 50,
"significanceLevel": 0.05
},
"consistency": {
"runs": 100,
"varianceThreshold": 0.15
},
"regression": {
"degradationThreshold": "5%",
"goldenSet": "./golden-set.jsonl"
}
}
}
Use Cases
- Prompt Engineering: Systematically improve prompt quality
- Quality Assurance: Ensure prompts meet quality standards before production
- Cost Optimization: Find prompts that achieve goals with fewer tokens
- Version Control: Track prompt quality across versions
- Agent Tuning: Optimize agent system prompts for consistency
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install skylv-prompt-evaluation - 安装完成后,直接呼叫该 Skill 的名称或使用
/skylv-prompt-evaluation触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Skylv Prompt Evaluation 是什么?
Evaluate and benchmark AI prompts for quality, consistency, and performance. Triggers: prompt evaluation, prompt testing, prompt quality, prompt benchmark, p... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 47 次。
如何安装 Skylv Prompt Evaluation?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install skylv-prompt-evaluation」即可一键安装,无需额外配置。
Skylv Prompt Evaluation 是免费的吗?
是的,Skylv Prompt Evaluation 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Skylv Prompt Evaluation 支持哪些平台?
Skylv Prompt Evaluation 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Skylv Prompt Evaluation?
由 SKY-lv(@sky-lv)开发并维护,当前版本 v1.0.0。