← 返回 Skills 市场
1970
总下载
1
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install azure-ai-evaluation-py
功能描述
Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, and custom evaluators.
Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics".
使用说明 (SKILL.md)
Azure AI Evaluation SDK for Python
Assess generative AI application performance with built-in and custom evaluators.
Installation
pip install azure-ai-evaluation
# With remote evaluation support
pip install azure-ai-evaluation[remote]
Environment Variables
# For AI-assisted evaluators
AZURE_OPENAI_ENDPOINT=https://\x3Cresource>.openai.azure.com
AZURE_OPENAI_API_KEY=\x3Cyour-api-key>
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
# For Foundry project integration
AIPROJECT_CONNECTION_STRING=\x3Cyour-connection-string>
Built-in Evaluators
Quality Evaluators (AI-Assisted)
from azure.ai.evaluation import (
GroundednessEvaluator,
RelevanceEvaluator,
CoherenceEvaluator,
FluencyEvaluator,
SimilarityEvaluator,
RetrievalEvaluator
)
# Initialize with Azure OpenAI model config
model_config = {
"azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
"api_key": os.environ["AZURE_OPENAI_API_KEY"],
"azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"]
}
groundedness = GroundednessEvaluator(model_config)
relevance = RelevanceEvaluator(model_config)
coherence = CoherenceEvaluator(model_config)
Quality Evaluators (NLP-based)
from azure.ai.evaluation import (
F1ScoreEvaluator,
RougeScoreEvaluator,
BleuScoreEvaluator,
GleuScoreEvaluator,
MeteorScoreEvaluator
)
f1 = F1ScoreEvaluator()
rouge = RougeScoreEvaluator()
bleu = BleuScoreEvaluator()
Safety Evaluators
from azure.ai.evaluation import (
ViolenceEvaluator,
SexualEvaluator,
SelfHarmEvaluator,
HateUnfairnessEvaluator,
IndirectAttackEvaluator,
ProtectedMaterialEvaluator
)
violence = ViolenceEvaluator(azure_ai_project=project_scope)
sexual = SexualEvaluator(azure_ai_project=project_scope)
Single Row Evaluation
from azure.ai.evaluation import GroundednessEvaluator
groundedness = GroundednessEvaluator(model_config)
result = groundedness(
query="What is Azure AI?",
context="Azure AI is Microsoft's AI platform...",
response="Azure AI provides AI services and tools."
)
print(f"Groundedness score: {result['groundedness']}")
print(f"Reason: {result['groundedness_reason']}")
Batch Evaluation with evaluate()
from azure.ai.evaluation import evaluate
result = evaluate(
data="test_data.jsonl",
evaluators={
"groundedness": groundedness,
"relevance": relevance,
"coherence": coherence
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.query}",
"context": "${data.context}",
"response": "${data.response}"
}
}
}
)
print(result["metrics"])
Composite Evaluators
from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluator
# All quality metrics in one
qa_evaluator = QAEvaluator(model_config)
# All safety metrics in one
safety_evaluator = ContentSafetyEvaluator(azure_ai_project=project_scope)
result = evaluate(
data="data.jsonl",
evaluators={
"qa": qa_evaluator,
"content_safety": safety_evaluator
}
)
Evaluate Application Target
from azure.ai.evaluation import evaluate
from my_app import chat_app # Your application
result = evaluate(
data="queries.jsonl",
target=chat_app, # Callable that takes query, returns response
evaluators={
"groundedness": groundedness
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.query}",
"context": "${outputs.context}",
"response": "${outputs.response}"
}
}
}
)
Custom Evaluators
Code-Based
from azure.ai.evaluation import evaluator
@evaluator
def word_count_evaluator(response: str) -> dict:
return {"word_count": len(response.split())}
# Use in evaluate()
result = evaluate(
data="data.jsonl",
evaluators={"word_count": word_count_evaluator}
)
Prompt-Based
from azure.ai.evaluation import PromptChatTarget
class CustomEvaluator:
def __init__(self, model_config):
self.model = PromptChatTarget(model_config)
def __call__(self, query: str, response: str) -> dict:
prompt = f"Rate this response 1-5: Query: {query}, Response: {response}"
result = self.model.send_prompt(prompt)
return {"custom_score": int(result)}
Log to Foundry Project
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
project = AIProjectClient.from_connection_string(
conn_str=os.environ["AIPROJECT_CONNECTION_STRING"],
credential=DefaultAzureCredential()
)
result = evaluate(
data="data.jsonl",
evaluators={"groundedness": groundedness},
azure_ai_project=project.scope # Logs results to Foundry
)
print(f"View results: {result['studio_url']}")
Evaluator Reference
| Evaluator | Type | Metrics |
|---|---|---|
GroundednessEvaluator |
AI | groundedness (1-5) |
RelevanceEvaluator |
AI | relevance (1-5) |
CoherenceEvaluator |
AI | coherence (1-5) |
FluencyEvaluator |
AI | fluency (1-5) |
SimilarityEvaluator |
AI | similarity (1-5) |
RetrievalEvaluator |
AI | retrieval (1-5) |
F1ScoreEvaluator |
NLP | f1_score (0-1) |
RougeScoreEvaluator |
NLP | rouge scores |
ViolenceEvaluator |
Safety | violence (0-7) |
SexualEvaluator |
Safety | sexual (0-7) |
SelfHarmEvaluator |
Safety | self_harm (0-7) |
HateUnfairnessEvaluator |
Safety | hate_unfairness (0-7) |
QAEvaluator |
Composite | All quality metrics |
ContentSafetyEvaluator |
Composite | All safety metrics |
Best Practices
- Use composite evaluators for comprehensive assessment
- Map columns correctly — mismatched columns cause silent failures
- Log to Foundry for tracking and comparison across runs
- Create custom evaluators for domain-specific metrics
- Use NLP evaluators when you have ground truth answers
- Safety evaluators require Azure AI project scope
- Batch evaluation is more efficient than single-row loops
Reference Files
| File | Contents |
|---|---|
| references/built-in-evaluators.md | Detailed patterns for AI-assisted, NLP-based, and Safety evaluators with configuration tables |
| references/custom-evaluators.md | Creating code-based and prompt-based custom evaluators, testing patterns |
| scripts/run_batch_evaluation.py | CLI tool for running batch evaluations with quality, safety, and custom evaluators |
安全使用建议
This skill appears coherent with its stated purpose — it needs Azure/OpenAI endpoint and either an API key or DefaultAzureCredential, and optionally a Foundry connection string if you want safety logging. Before installing/using: 1) Confirm you trust the pip package 'azure-ai-evaluation' (review its upstream source) before pip installing. 2) Only run evaluations on datasets you control or have vetted, since data will be sent to your configured Azure OpenAI deployment. 3) Review any custom/prompt-based evaluators you add — they can send arbitrary text to the model, so avoid embedding secrets in evaluated data or prompts. 4) The documentation contains an example string used to demonstrate detecting prompt-injection — this is benign in context but be cautious when reusing prompts that include 'ignore previous instructions' patterns.
功能分析
Type: OpenClaw Skill
Name: azure-ai-evaluation-py
Version: 0.1.0
The skill bundle provides an Azure AI Evaluation SDK for Python, including a CLI tool for batch evaluations. All files (code and documentation) are clearly aligned with the stated purpose. The Python script (`scripts/run_batch_evaluation.py`) uses standard Azure SDKs (`azure.identity`, `azure.ai.projects`, `azure.ai.evaluation`) to connect to Azure services, consuming environment variables like `AZURE_OPENAI_ENDPOINT` and `AIPROJECT_CONNECTION_STRING` for its intended functionality, not for exfiltration. File system access is limited to reading input data and writing evaluation results, as expected for a CLI tool. The markdown files (`SKILL.md`, `references/*.md`) serve as documentation and do not contain any prompt injection attempts against the OpenClaw agent or instructions for malicious activities. No evidence of data exfiltration, unauthorized remote execution, persistence, or obfuscation was found.
能力评估
Purpose & Capability
The name/description match the included docs and CLI script: evaluating generative AI with built-in and custom evaluators. The environment variables mentioned (AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT, AIPROJECT_CONNECTION_STRING) and imports (azure.ai.evaluation, azure.identity, azure.ai.projects) are appropriate and expected for the described functionality. No unrelated credentials, binaries, or config paths are requested.
Instruction Scope
SKILL.md and scripts limit actions to building evaluator instances, calling evaluate(), and reading user-supplied data files (JSONL). Examples include prompt-based evaluators that send prompts to Azure OpenAI models (expected). There are no instructions to read arbitrary system files or post data to unexpected third-party endpoints. Note: examples include a sample that contains the phrase 'ignore previous instructions' as part of demonstrating an IndirectAttackEvaluator — this is a documentation/example of prompt-injection detection, not an instruction to ignore agent constraints.
Install Mechanism
The skill is instruction-only and includes no platform install spec. SKILL.md recommends pip installing the 'azure-ai-evaluation' package and optional extras for remote evaluation; this is normal but means installation happens outside the platform. Verify the pip package provenance before installing to your environment.
Credentials
Requested environment variables are limited to Azure/OpenAI and Foundry (AIPROJECT_CONNECTION_STRING). Those are proportional to evaluating models and logging to a Foundry project. No unrelated secrets or broad system credentials are requested.
Persistence & Privilege
The skill does not request persistent presence (always:false) and contains no code that modifies other skills or global agent configuration. It does not require elevated privileges beyond normal network calls to Azure services.
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install azure-ai-evaluation-py - 安装完成后,直接呼叫该 Skill 的名称或使用
/azure-ai-evaluation-py触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v0.1.0
Azure AI Evaluation SDK for Python - Initial Release
- Introduces built-in quality, safety, and composite evaluators for generative AI assessment.
- Supports both AI-assisted (OpenAI model) and NLP-based evaluators.
- Adds batch and single-row evaluation APIs with flexible data-column mapping.
- Enables custom evaluators via code or prompts.
- Integrates with Azure AI Foundry for evaluation result logging.
- Provides best practices, reference guides, and CLI tool for batch evaluation.
元数据
常见问题
Azure Ai Evaluation Py 是什么?
Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, and custom evaluators. Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics". 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 1970 次。
如何安装 Azure Ai Evaluation Py?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install azure-ai-evaluation-py」即可一键安装,无需额外配置。
Azure Ai Evaluation Py 是免费的吗?
是的,Azure Ai Evaluation Py 完全免费(开源免费),可自由下载、安装和使用。
Azure Ai Evaluation Py 支持哪些平台?
Azure Ai Evaluation Py 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Azure Ai Evaluation Py?
由 thegovind(@thegovind)开发并维护,当前版本 v0.1.0。
推荐 Skills