/install azure-ai-evaluation-py
Azure AI Evaluation SDK for Python
Assess generative AI application performance with built-in and custom evaluators.
Installation
pip install azure-ai-evaluation
# With remote evaluation support
pip install azure-ai-evaluation[remote]
Environment Variables
# For AI-assisted evaluators
AZURE_OPENAI_ENDPOINT=https://\x3Cresource>.openai.azure.com
AZURE_OPENAI_API_KEY=\x3Cyour-api-key>
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
# For Foundry project integration
AIPROJECT_CONNECTION_STRING=\x3Cyour-connection-string>
Built-in Evaluators
Quality Evaluators (AI-Assisted)
from azure.ai.evaluation import (
GroundednessEvaluator,
RelevanceEvaluator,
CoherenceEvaluator,
FluencyEvaluator,
SimilarityEvaluator,
RetrievalEvaluator
)
# Initialize with Azure OpenAI model config
model_config = {
"azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
"api_key": os.environ["AZURE_OPENAI_API_KEY"],
"azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"]
}
groundedness = GroundednessEvaluator(model_config)
relevance = RelevanceEvaluator(model_config)
coherence = CoherenceEvaluator(model_config)
Quality Evaluators (NLP-based)
from azure.ai.evaluation import (
F1ScoreEvaluator,
RougeScoreEvaluator,
BleuScoreEvaluator,
GleuScoreEvaluator,
MeteorScoreEvaluator
)
f1 = F1ScoreEvaluator()
rouge = RougeScoreEvaluator()
bleu = BleuScoreEvaluator()
Safety Evaluators
from azure.ai.evaluation import (
ViolenceEvaluator,
SexualEvaluator,
SelfHarmEvaluator,
HateUnfairnessEvaluator,
IndirectAttackEvaluator,
ProtectedMaterialEvaluator
)
violence = ViolenceEvaluator(azure_ai_project=project_scope)
sexual = SexualEvaluator(azure_ai_project=project_scope)
Single Row Evaluation
from azure.ai.evaluation import GroundednessEvaluator
groundedness = GroundednessEvaluator(model_config)
result = groundedness(
query="What is Azure AI?",
context="Azure AI is Microsoft's AI platform...",
response="Azure AI provides AI services and tools."
)
print(f"Groundedness score: {result['groundedness']}")
print(f"Reason: {result['groundedness_reason']}")
Batch Evaluation with evaluate()
from azure.ai.evaluation import evaluate
result = evaluate(
data="test_data.jsonl",
evaluators={
"groundedness": groundedness,
"relevance": relevance,
"coherence": coherence
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.query}",
"context": "${data.context}",
"response": "${data.response}"
}
}
}
)
print(result["metrics"])
Composite Evaluators
from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluator
# All quality metrics in one
qa_evaluator = QAEvaluator(model_config)
# All safety metrics in one
safety_evaluator = ContentSafetyEvaluator(azure_ai_project=project_scope)
result = evaluate(
data="data.jsonl",
evaluators={
"qa": qa_evaluator,
"content_safety": safety_evaluator
}
)
Evaluate Application Target
from azure.ai.evaluation import evaluate
from my_app import chat_app # Your application
result = evaluate(
data="queries.jsonl",
target=chat_app, # Callable that takes query, returns response
evaluators={
"groundedness": groundedness
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.query}",
"context": "${outputs.context}",
"response": "${outputs.response}"
}
}
}
)
Custom Evaluators
Code-Based
from azure.ai.evaluation import evaluator
@evaluator
def word_count_evaluator(response: str) -> dict:
return {"word_count": len(response.split())}
# Use in evaluate()
result = evaluate(
data="data.jsonl",
evaluators={"word_count": word_count_evaluator}
)
Prompt-Based
from azure.ai.evaluation import PromptChatTarget
class CustomEvaluator:
def __init__(self, model_config):
self.model = PromptChatTarget(model_config)
def __call__(self, query: str, response: str) -> dict:
prompt = f"Rate this response 1-5: Query: {query}, Response: {response}"
result = self.model.send_prompt(prompt)
return {"custom_score": int(result)}
Log to Foundry Project
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
project = AIProjectClient.from_connection_string(
conn_str=os.environ["AIPROJECT_CONNECTION_STRING"],
credential=DefaultAzureCredential()
)
result = evaluate(
data="data.jsonl",
evaluators={"groundedness": groundedness},
azure_ai_project=project.scope # Logs results to Foundry
)
print(f"View results: {result['studio_url']}")
Evaluator Reference
| Evaluator | Type | Metrics |
|---|---|---|
GroundednessEvaluator |
AI | groundedness (1-5) |
RelevanceEvaluator |
AI | relevance (1-5) |
CoherenceEvaluator |
AI | coherence (1-5) |
FluencyEvaluator |
AI | fluency (1-5) |
SimilarityEvaluator |
AI | similarity (1-5) |
RetrievalEvaluator |
AI | retrieval (1-5) |
F1ScoreEvaluator |
NLP | f1_score (0-1) |
RougeScoreEvaluator |
NLP | rouge scores |
ViolenceEvaluator |
Safety | violence (0-7) |
SexualEvaluator |
Safety | sexual (0-7) |
SelfHarmEvaluator |
Safety | self_harm (0-7) |
HateUnfairnessEvaluator |
Safety | hate_unfairness (0-7) |
QAEvaluator |
Composite | All quality metrics |
ContentSafetyEvaluator |
Composite | All safety metrics |
Best Practices
- Use composite evaluators for comprehensive assessment
- Map columns correctly — mismatched columns cause silent failures
- Log to Foundry for tracking and comparison across runs
- Create custom evaluators for domain-specific metrics
- Use NLP evaluators when you have ground truth answers
- Safety evaluators require Azure AI project scope
- Batch evaluation is more efficient than single-row loops
Reference Files
| File | Contents |
|---|---|
| references/built-in-evaluators.md | Detailed patterns for AI-assisted, NLP-based, and Safety evaluators with configuration tables |
| references/custom-evaluators.md | Creating code-based and prompt-based custom evaluators, testing patterns |
| scripts/run_batch_evaluation.py | CLI tool for running batch evaluations with quality, safety, and custom evaluators |
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install azure-ai-evaluation-py - After installation, invoke the skill by name or use
/azure-ai-evaluation-py - Provide required inputs per the skill's parameter spec and get structured output
What is Azure Ai Evaluation Py?
Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, and custom evaluators. Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics". It is an AI Agent Skill for Claude Code / OpenClaw, with 1970 downloads so far.
How do I install Azure Ai Evaluation Py?
Run "/install azure-ai-evaluation-py" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Azure Ai Evaluation Py free?
Yes, Azure Ai Evaluation Py is completely free (open-source). You can download, install and use it at no cost.
Which platforms does Azure Ai Evaluation Py support?
Azure Ai Evaluation Py is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Azure Ai Evaluation Py?
It is built and maintained by thegovind (@thegovind); the current version is v0.1.0.