← 返回 Skills 市场

Skylv Prompt Evaluation

Name: Skylv Prompt Evaluation
Author: sky-lv

作者 SKY-lv · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ 安全检测通过

总下载

当前安装

版本数

在 OpenClaw 中安装

/install skylv-prompt-evaluation

功能描述

Evaluate and benchmark AI prompts for quality, consistency, and performance. Triggers: prompt evaluation, prompt testing, prompt quality, prompt benchmark, p...

使用说明 (SKILL.md)

Prompt Evaluation

Evaluate and benchmark AI prompts for quality, consistency, and performance. Score, compare, and optimize your prompts systematically.

Overview

A prompt evaluation framework that helps agents measure prompt quality across multiple dimensions: clarity, specificity, robustness, cost-efficiency, and output consistency. Compare prompt variants and find the optimal version.

Capabilities

1. Quality Scoring

node evaluate.js score --prompt "Summarize the article" --dimensions clarity,specificity,robustness
node evaluate.js score --prompt-file ./prompts/ --output scores.json

Scores prompts on clarity (0-10), specificity (0-10), robustness (0-10), and cost-efficiency (0-10).

2. A/B Comparison

node evaluate.js compare --prompt-a "Summarize" --prompt-b "Write a 3-bullet summary" --trials 50
node evaluate.js compare --config ab-test-config.json

Run statistical A/B tests between prompt variants with significance analysis.

3. Consistency Check

node evaluate.js consistency --prompt "Translate to French" --runs 100 --variance-threshold 0.15
node evaluate.js consistency --temperature 0.7 --top-p 0.9

Measures output consistency across multiple runs to find the most stable prompts.

4. Regression Testing

node evaluate.js regression --baseline v1.0 --current v1.1 --test-suite golden-set.jsonl
node evaluate.js regression --fail-on-degradation 5%

Detects quality regressions between prompt versions using golden test sets.

5. Cost Analysis

node evaluate.js cost --prompt "Long prompt..." --model gpt-4 --estimate-tokens
node evaluate.js cost --compare-prompts --output cost-report.csv

Estimates token usage and costs for different prompt variants and models.

Configuration

{
  "evaluation": {
    "dimensions": ["clarity", "specificity", "robustness", "cost"],
    "scoringModel": "gpt-4",
    "abTest": {
      "trials": 50,
      "significanceLevel": 0.05
    },
    "consistency": {
      "runs": 100,
      "varianceThreshold": 0.15
    },
    "regression": {
      "degradationThreshold": "5%",
      "goldenSet": "./golden-set.jsonl"
    }
  }
}

Use Cases

Prompt Engineering: Systematically improve prompt quality
Quality Assurance: Ensure prompts meet quality standards before production
Cost Optimization: Find prompts that achieve goals with fewer tokens
Version Control: Track prompt quality across versions
Agent Tuning: Optimize agent system prompts for consistency

安全使用建议

This skill appears benign as an instruction-only prompt-evaluation description. Before installing or using it, be aware that the reviewed package does not include the `evaluate.js` code shown in examples, and do not run any external evaluator or submit confidential prompts until you know exactly what code and model provider will process them.

功能分析

Type: OpenClaw Skill Name: skylv-prompt-evaluation Version: 1.0.0 The skill bundle describes a prompt evaluation framework for benchmarking AI prompts. The documentation in SKILL.md and metadata in _meta.json outline standard CLI-based functionalities such as quality scoring, A/B testing, and cost analysis. There are no indicators of malicious intent, prompt injection attacks, or suspicious behaviors in the provided files.

能力评估

ℹ Purpose & Capability

The stated purpose—evaluating, comparing, and optimizing prompts—is coherent and low-risk, but the advertised CLI capabilities are not verifiable because the package contains only SKILL.md.

✓ Instruction Scope

The instructions are user-directed examples for scoring, comparison, consistency, regression, and cost analysis; they do not override user intent, demand hidden behavior, or encourage destructive actions.

ℹ Install Mechanism

There is no install spec and no code, yet the documentation shows commands such as `node evaluate.js`; users would need a separate implementation that was not included in the reviewed artifacts.

ℹ Credentials

The examples read prompt files and golden test sets and write reports, which is proportionate for prompt evaluation, but users should avoid running it on confidential prompts unless they understand where evaluations are processed.

✓ Persistence & Privilege

The artifacts declare no credentials, required environment variables, background services, persistent memory, privileged paths, or autonomous long-running behavior.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install skylv-prompt-evaluation
安装完成后，直接呼叫该 Skill 的名称或使用 /skylv-prompt-evaluation 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Initial release of the prompt-evaluation skill. - Evaluate and benchmark AI prompts for clarity, specificity, robustness, and cost-efficiency. - Score prompts, compare variants with A/B tests, and measure output consistency. - Run regression testing to detect quality changes across prompt versions. - Estimate and compare token usage and cost for different prompts and models. - Designed for prompt engineering, quality assurance, and cost optimization.

元数据

Slug skylv-prompt-evaluation

版本 1.0.0

许可证 MIT-0

累计安装 1

当前安装数 1

历史版本数 1

常见问题