Model Behavior Tuning: Opus 4.7 Literal Execution, Default Style Override and Subagent Frequency Control
Chapter 75: Prompt Evaluation Framework: Automated Testing, Human Evaluation, and LLM-as-Judge
75.1 Why Systematic Evaluation Matters
"Is this prompt good?" is a question that cannot be reliably answered by intuition. Human cognitive biases, sample selection bias, and the inherent high variance of LLM outputs make subjective judgment deeply unreliable. Prompt engineering without a systematic evaluation framework is, at its core, navigating in the dark.
A complete prompt evaluation framework consists of three complementary layers:
- Automated Testing โ Deterministic evaluation based on exact matching, regex, or schema validation. Fast, cheap, and suitable for continuous integration.
- Human Evaluation โ Subjective quality judgments by human reviewers. High quality but expensive and non-scalable.
- LLM-as-Judge โ Using a language model as a proxy for human evaluation. Combines the scale advantages of automation with judgment quality that approaches human-level.
These three are not substitutes for one another; they divide responsibility. Automated testing enforces basic quality gates. LLM-as-Judge covers subjective quality dimensions. Human evaluation serves as the final arbiter and gold-standard calibration source.
75.2 Automated Testing
75.2.1 Deterministic Evaluation Metrics
For tasks with unambiguous correct answers, deterministic evaluation is the most reliable method:
import re
import json
class AutomatedEvaluator:
def exact_match(self, prediction: str, ground_truth: str) -> float:
pred = prediction.strip().lower()
truth = ground_truth.strip().lower()
return 1.0 if pred == truth else 0.0
def contains_match(self, prediction: str, expected_phrases: list) -> float:
pred_lower = prediction.lower()
matches = sum(1 for phrase in expected_phrases if phrase.lower() in pred_lower)
return matches / len(expected_phrases) if expected_phrases else 0.0
def json_schema_validation(self, prediction: str, schema: dict) -> float:
try:
json_match = re.search(r'\{.*\}', prediction, re.DOTALL)
if not json_match:
return 0.0
parsed = json.loads(json_match.group())
required_fields = schema.get("required", [])
fields_present = all(field in parsed for field in required_fields)
type_correct = True
for field, field_schema in schema.get("properties", {}).items():
if field in parsed:
expected_type = field_schema.get("type")
if expected_type == "string" and not isinstance(parsed[field], str):
type_correct = False
elif expected_type == "number" and not isinstance(parsed[field], (int, float)):
type_correct = False
elif expected_type == "array" and not isinstance(parsed[field], list):
type_correct = False
return 1.0 if (fields_present and type_correct) else 0.5
except json.JSONDecodeError:
return 0.0
def length_check(self, prediction: str, min_words: int = 0, max_words: int = 10**9) -> float:
word_count = len(prediction.split())
if min_words <= word_count <= max_words:
return 1.0
elif word_count < min_words:
return word_count / min_words
else:
return max_words / word_count
75.2.2 Designing a Test Suite
class PromptTestSuite:
def __init__(self, prompt: str, model: str = "claude-opus-4-5"):
self.prompt = prompt
self.model = model
self.test_cases = []
self.evaluator = AutomatedEvaluator()
def add_test_case(self, input_text: str, expected_output: str = None,
checks: list = None, tags: list = None):
"""
checks: [
{"type": "exact_match", "expected": "..."},
{"type": "contains", "phrases": [...]},
{"type": "json_schema", "schema": {...}},
{"type": "regex", "pattern": "..."},
{"type": "length", "min": 10, "max": 500},
{"type": "custom", "fn": lambda output: float}
]
"""
self.test_cases.append({
"input": input_text,
"expected": expected_output,
"checks": checks or [],
"tags": tags or []
})
def run(self, client, verbose: bool = False) -> dict:
results = []
for i, case in enumerate(self.test_cases):
response = client.messages.create(
model=self.model,
max_tokens=1024,
system=self.prompt,
messages=[{"role": "user", "content": case["input"]}]
)
output = response.content[0].text
check_scores = []
for check in case["checks"]:
score = self._run_check(check, output, case.get("expected", ""))
check_scores.append(score)
case_score = sum(check_scores) / len(check_scores) if check_scores else 1.0
results.append({
"case_index": i,
"output": output,
"score": case_score,
"passed": case_score >= 0.8
})
if verbose:
status = "PASS" if results[-1]["passed"] else "FAIL"
print(f"[{status}] Case {i+1}: {case_score:.2f}")
total = len(results)
passed = sum(1 for r in results if r["passed"])
return {
"total_cases": total,
"passed": passed,
"failed": total - passed,
"pass_rate": passed / total if total > 0 else 0,
"avg_score": sum(r["score"] for r in results) / total if total > 0 else 0,
"results": results
}
75.2.3 Prompt CI/CD Integration
# .github/workflows/prompt-ci.yml
name: Prompt CI
on:
pull_request:
paths:
- 'prompts/**'
- 'tests/prompt_tests/**'
jobs:
test-prompts:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Prompt Tests
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python tests/run_prompt_tests.py \
--min-pass-rate 0.85 \
--max-regression-rate 0.05
75.3 Human Evaluation
75.3.1 When Human Evaluation Is Irreplaceable
Human evaluation cannot be fully automated, especially for:
- Subjective quality dimensions: Does the style match the brand voice? Does the answer feel warm? Is the creativity sufficient?
- Establishing new baselines: Before LLM-as-Judge can reliably work on a new task, humans must build the gold standard
- Safety evaluation: Harmful content detection, bias assessment โ these require trained human reviewers
- Professional domain accuracy: Medical, legal, and financial advice accuracy requires domain expert review
75.3.2 Annotation Guideline Design
Clear annotation guidelines are the foundation of consistent human evaluation:
## Customer Service Response Quality Guide
### Evaluation Dimensions
**1. Problem Resolution (1-5)**
- 5: Completely and accurately resolves the user's specific issue
- 4: Mostly resolves the issue with minor omissions
- 3: Partially resolves the issue; user likely needs follow-up
- 2: Directionally correct but insufficient to resolve the issue
- 1: Does not resolve the issue or is irrelevant
**2. Tone and Professionalism (1-5)**
- 5: Warm and professional; user feels valued
- 4: Professional and polite, no notable issues
- 3: Neutral, neither warm nor cold
- 2: Slightly curt or overly formal
- 1: Unprofessional or makes user uncomfortable
**3. Information Accuracy (1-5)**
- 5: All information completely accurate
- 4: Main information accurate; minor detail errors
- 3: Mostly accurate; one clear error
- 2: Multiple information errors
- 1: Severely incorrect or misleading
### Edge Case Handling
- When the user is emotionally distressed, weight tone score higher
- When the issue involves billing/refunds, weight accuracy score higher
75.3.3 Inter-Rater Agreement
import numpy as np
from itertools import combinations
def compute_inter_rater_agreement(ratings: dict) -> dict:
"""
ratings: {"rater_a": [score_list], "rater_b": [score_list], ...}
"""
rater_names = list(ratings.keys())
agreements = {}
for r1, r2 in combinations(rater_names, 2):
scores1 = np.array(ratings[r1])
scores2 = np.array(ratings[r2])
correlation = np.corrcoef(scores1, scores2)[0, 1]
mae = np.mean(np.abs(scores1 - scores2))
exact_agreement = np.mean(scores1 == scores2)
agreements[f"{r1}_vs_{r2}"] = {
"correlation": float(correlation),
"mae": float(mae),
"exact_agreement": float(exact_agreement),
"acceptable": correlation > 0.7 and mae < 0.8
}
return agreements
75.4 LLM-as-Judge
75.4.1 Concept and Use Cases
LLM-as-Judge uses a language model โ typically a more capable model such as Claude Opus โ to evaluate another model's output quality.
Well-suited for:
- Tasks requiring subjective judgment with clear standards (writing fluency, logical coherence)
- Large-scale evaluation where human cost is prohibitive
- Rapid feedback loops during development iteration
Core challenges:
- Position bias โ The model tends to rate the first or last option higher
- Verbosity bias โ The model tends to prefer longer responses
- Self-enhancement bias โ The model tends to favor outputs stylistically similar to its own
- Sycophancy bias โ When pressed, the model tends to change its rating to please the questioner
75.4.2 Designing a High-Quality Judge Prompt
JUDGE_SYSTEM_PROMPT = """You are an objective, rigorous AI output quality reviewer.
Your evaluations must:
1. Be based on the provided evaluation criteria, not personal preferences
2. Maintain consistency: outputs of equal quality should receive equal scores
3. Provide traceable reasoning: every score must be backed by specific textual evidence
4. Resist the following biases:
- Do not give higher scores simply because a response is longer
- Do not let presentation order (A vs B shown first) influence your judgment
- Do not let flattering tone override substance
Important: Read all outputs completely before beginning your evaluation."""
def llm_judge_single(task_description, user_input, model_output, evaluation_criteria) -> dict:
criteria_text = "\n".join([
f"- {name} (weight {info['weight']}): {info['description']}"
for name, info in evaluation_criteria.items()
])
judge_prompt = f"""Evaluate the quality of the following AI response.
Task description: {task_description}
User input:
{user_input}
AI response:
{model_output}
Evaluation dimensions:
{criteria_text}
Output format:
<evaluation>
Analyze each dimension citing specific text as evidence
</evaluation>
<scores>
{{
{", ".join([f'"{name}": {{"score": 1-10, "reasoning": "..."}}' for name in evaluation_criteria])}
}}
</scores>
<overall>
{{
"weighted_score": weighted composite score (0-10),
"verdict": "excellent/good/acceptable/poor/unacceptable",
"key_strength": "most notable strength",
"key_weakness": "most important improvement area"
}}
</overall>"""
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=2000,
system=JUDGE_SYSTEM_PROMPT,
messages=[{"role": "user", "content": judge_prompt}]
)
return parse_judge_response(response.content[0].text)
75.4.3 Pairwise Comparison
Pairwise comparison is more reliable than absolute scoring because it reduces calibration bias:
def llm_judge_pairwise(task_description, user_input, output_a, output_b,
evaluation_criteria, randomize_order=True) -> dict:
import random
swapped = False
if randomize_order and random.random() > 0.5:
output_a, output_b = output_b, output_a
swapped = True
criteria_text = "\n".join([
f"- {name}: {info['description']}"
for name, info in evaluation_criteria.items()
])
judge_prompt = f"""Compare the following two AI responses and determine which is better.
Task: {task_description}
User input:
{user_input}
Response A:
{output_a}
Response B:
{output_b}
Evaluation dimensions:
{criteria_text}
Important:
- Do not favor longer responses simply for their length
- Base your judgment solely on content quality, ignoring presentation order
- Choose "tie" if quality is genuinely equal
Output:
<comparison>
Comparative analysis across each dimension
</comparison>
<verdict>
{{
"winner": "A" or "B" or "tie",
"confidence": "high/medium/low",
"reasoning": "core justification (1-2 sentences)"
}}
</verdict>"""
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1500,
system=JUDGE_SYSTEM_PROMPT,
messages=[{"role": "user", "content": judge_prompt}]
)
result = parse_pairwise_response(response.content[0].text)
# Restore original ordering if swapped
if swapped and result.get("winner") in ["A", "B"]:
result["winner"] = "B" if result["winner"] == "A" else "A"
return result
75.4.4 Bias Calibration
def calibrate_judge_bias(judge_fn, calibration_set: list) -> dict:
"""
calibration_set: [
{
"input": ...,
"output_better": ..., # human-confirmed better output
"output_worse": ..., # human-confirmed worse output
}
]
"""
position_a_correct = []
position_b_correct = []
for item in calibration_set:
# Test 1: better output in position A
result_a = judge_fn(item["input"], item["output_better"], item["output_worse"])
position_a_correct.append(result_a.get("winner") == "A")
# Test 2: better output in position B
result_b = judge_fn(item["input"], item["output_worse"], item["output_better"])
position_b_correct.append(result_b.get("winner") == "B")
acc_a = sum(position_a_correct) / len(position_a_correct)
acc_b = sum(position_b_correct) / len(position_b_correct)
return {
"accuracy_when_better_is_a": acc_a,
"accuracy_when_better_is_b": acc_b,
"position_bias": abs(acc_a - acc_b),
"overall_accuracy": (acc_a + acc_b) / 2,
"bias_detected": abs(acc_a - acc_b) > 0.1
}
75.5 Three-Layer Evaluation Pipeline Architecture
Input test set
โ
[Layer 1] Automated Testing (format checks, basic correctness)
โ Pass rate < 80%? โ Alert, block deployment
โ
[Layer 2] LLM-as-Judge (subjective quality, multi-dimensional scoring)
โ Score drops > 5%? โ Trigger human review
โ
[Layer 3] Human Evaluation (edge cases, safety, novel scenarios)
โ
Evaluation report โ Optimization decisions
75.5.1 Metrics Dashboard
class EvaluationDashboard:
def __init__(self):
self.metrics_history = []
def record_evaluation(self, prompt_version, auto_results, llm_results, timestamp):
self.metrics_history.append({
"version": prompt_version,
"timestamp": timestamp,
"auto_pass_rate": auto_results.get("pass_rate", 0),
"llm_overall_score": llm_results.get("overall_score", 0),
"llm_dimension_scores": llm_results.get("dimension_scores", {})
})
def detect_regression(self, current_version, baseline_version) -> dict:
current = next((m for m in self.metrics_history if m["version"] == current_version), None)
baseline = next((m for m in self.metrics_history if m["version"] == baseline_version), None)
if not current or not baseline:
return {"error": "Version not found"}
return {
"auto_pass_rate_change": current["auto_pass_rate"] - baseline["auto_pass_rate"],
"llm_score_change": current["llm_overall_score"] - baseline["llm_overall_score"],
"has_regression": (
current["auto_pass_rate"] < baseline["auto_pass_rate"] - 0.05 or
current["llm_overall_score"] < baseline["llm_overall_score"] - 0.3
)
}
75.6 Building and Maintaining Evaluation Datasets
High-quality test sets must be:
- Representative: Cover the major scenario distribution of the task
- Challenging: Include sufficient difficult cases and edge conditions
- Leakage-free: Test sets must not appear in the prompt optimization training process
- Versioned: The test set itself requires version control
Mine new test cases from production failures:
def mine_failure_cases_from_production(production_logs: list) -> list:
new_cases = []
for log in production_logs:
if log.get("user_feedback") == "negative":
new_cases.append({
"input": log["input"],
"output": log["output"],
"issue": "user_negative_feedback",
"source": "production"
})
# Only confirmed issues graduate to the formal test set
return [c for c in new_cases if c.get("review_status") == "confirmed_issue"]
Summary
A prompt evaluation framework is the critical transformation from "feels good" to "measurable and improvable." Automated testing enforces functional consistency. Human evaluation establishes the gold standard for quality. LLM-as-Judge provides scalable subjective quality evaluation between the two.
In practice, the typical allocation is: automated testing covers 70-80% of routine scenarios, LLM-as-Judge handles subjective dimensions, and human evaluation focuses on high-value safety reviews and novel scenarios. Bias awareness, calibration mechanisms, and continuously maintained test sets are the essential guarantees for this system to remain effective over time.