Chapter 75

Model Behavior Tuning: Opus 4.7 Literal Execution, Default Style Override and Subagent Frequency Control

Chapter 75: Prompt Evaluation Framework: Automated Testing, Human Evaluation, and LLM-as-Judge

75.1 Why Systematic Evaluation Matters

"Is this prompt good?" is a question that cannot be reliably answered by intuition. Human cognitive biases, sample selection bias, and the inherent high variance of LLM outputs make subjective judgment deeply unreliable. Prompt engineering without a systematic evaluation framework is, at its core, navigating in the dark.

A complete prompt evaluation framework consists of three complementary layers:

Automated Testing — Deterministic evaluation based on exact matching, regex, or schema validation. Fast, cheap, and suitable for continuous integration.
Human Evaluation — Subjective quality judgments by human reviewers. High quality but expensive and non-scalable.
LLM-as-Judge — Using a language model as a proxy for human evaluation. Combines the scale advantages of automation with judgment quality that approaches human-level.

These three are not substitutes for one another; they divide responsibility. Automated testing enforces basic quality gates. LLM-as-Judge covers subjective quality dimensions. Human evaluation serves as the final arbiter and gold-standard calibration source.

75.2 Automated Testing

75.2.1 Deterministic Evaluation Metrics

For tasks with unambiguous correct answers, deterministic evaluation is the most reliable method:

import re
import json

class AutomatedEvaluator:
    
    def exact_match(self, prediction: str, ground_truth: str) -> float:
        pred = prediction.strip().lower()
        truth = ground_truth.strip().lower()
        return 1.0 if pred == truth else 0.0
    
    def contains_match(self, prediction: str, expected_phrases: list) -> float:
        pred_lower = prediction.lower()
        matches = sum(1 for phrase in expected_phrases if phrase.lower() in pred_lower)
        return matches / len(expected_phrases) if expected_phrases else 0.0
    
    def json_schema_validation(self, prediction: str, schema: dict) -> float:
        try:
            json_match = re.search(r'\{.*\}', prediction, re.DOTALL)
            if not json_match:
                return 0.0
            parsed = json.loads(json_match.group())
            required_fields = schema.get("required", [])
            fields_present = all(field in parsed for field in required_fields)
            
            type_correct = True
            for field, field_schema in schema.get("properties", {}).items():
                if field in parsed:
                    expected_type = field_schema.get("type")
                    if expected_type == "string" and not isinstance(parsed[field], str):
                        type_correct = False
                    elif expected_type == "number" and not isinstance(parsed[field], (int, float)):
                        type_correct = False
                    elif expected_type == "array" and not isinstance(parsed[field], list):
                        type_correct = False
            
            return 1.0 if (fields_present and type_correct) else 0.5
        except json.JSONDecodeError:
            return 0.0
    
    def length_check(self, prediction: str, min_words: int = 0, max_words: int = 10**9) -> float:
        word_count = len(prediction.split())
        if min_words <= word_count <= max_words:
            return 1.0
        elif word_count < min_words:
            return word_count / min_words
        else:
            return max_words / word_count

75.2.2 Designing a Test Suite

class PromptTestSuite:
    
    def __init__(self, prompt: str, model: str = "claude-opus-4-5"):
        self.prompt = prompt
        self.model = model
        self.test_cases = []
        self.evaluator = AutomatedEvaluator()
    
    def add_test_case(self, input_text: str, expected_output: str = None,
                      checks: list = None, tags: list = None):
        """
        checks: [
            {"type": "exact_match", "expected": "..."},
            {"type": "contains", "phrases": [...]},
            {"type": "json_schema", "schema": {...}},
            {"type": "regex", "pattern": "..."},
            {"type": "length", "min": 10, "max": 500},
            {"type": "custom", "fn": lambda output: float}
        ]
        """
        self.test_cases.append({
            "input": input_text,
            "expected": expected_output,
            "checks": checks or [],
            "tags": tags or []
        })
    
    def run(self, client, verbose: bool = False) -> dict:
        results = []
        
        for i, case in enumerate(self.test_cases):
            response = client.messages.create(
                model=self.model,
                max_tokens=1024,
                system=self.prompt,
                messages=[{"role": "user", "content": case["input"]}]
            )
            output = response.content[0].text
            
            check_scores = []
            for check in case["checks"]:
                score = self._run_check(check, output, case.get("expected", ""))
                check_scores.append(score)
            
            case_score = sum(check_scores) / len(check_scores) if check_scores else 1.0
            results.append({
                "case_index": i,
                "output": output,
                "score": case_score,
                "passed": case_score >= 0.8
            })
            
            if verbose:
                status = "PASS" if results[-1]["passed"] else "FAIL"
                print(f"[{status}] Case {i+1}: {case_score:.2f}")
        
        total = len(results)
        passed = sum(1 for r in results if r["passed"])
        return {
            "total_cases": total,
            "passed": passed,
            "failed": total - passed,
            "pass_rate": passed / total if total > 0 else 0,
            "avg_score": sum(r["score"] for r in results) / total if total > 0 else 0,
            "results": results
        }

75.2.3 Prompt CI/CD Integration

# .github/workflows/prompt-ci.yml
name: Prompt CI

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'tests/prompt_tests/**'

jobs:
  test-prompts:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run Prompt Tests
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python tests/run_prompt_tests.py \
            --min-pass-rate 0.85 \
            --max-regression-rate 0.05

75.3 Human Evaluation

75.3.1 When Human Evaluation Is Irreplaceable

Human evaluation cannot be fully automated, especially for:

Subjective quality dimensions: Does the style match the brand voice? Does the answer feel warm? Is the creativity sufficient?
Establishing new baselines: Before LLM-as-Judge can reliably work on a new task, humans must build the gold standard
Safety evaluation: Harmful content detection, bias assessment — these require trained human reviewers
Professional domain accuracy: Medical, legal, and financial advice accuracy requires domain expert review

75.3.2 Annotation Guideline Design

Clear annotation guidelines are the foundation of consistent human evaluation:

## Customer Service Response Quality Guide

### Evaluation Dimensions

**1. Problem Resolution (1-5)**
- 5: Completely and accurately resolves the user's specific issue
- 4: Mostly resolves the issue with minor omissions
- 3: Partially resolves the issue; user likely needs follow-up
- 2: Directionally correct but insufficient to resolve the issue
- 1: Does not resolve the issue or is irrelevant

**2. Tone and Professionalism (1-5)**
- 5: Warm and professional; user feels valued
- 4: Professional and polite, no notable issues
- 3: Neutral, neither warm nor cold
- 2: Slightly curt or overly formal
- 1: Unprofessional or makes user uncomfortable

**3. Information Accuracy (1-5)**
- 5: All information completely accurate
- 4: Main information accurate; minor detail errors
- 3: Mostly accurate; one clear error
- 2: Multiple information errors
- 1: Severely incorrect or misleading

### Edge Case Handling
- When the user is emotionally distressed, weight tone score higher
- When the issue involves billing/refunds, weight accuracy score higher

75.3.3 Inter-Rater Agreement

import numpy as np
from itertools import combinations

def compute_inter_rater_agreement(ratings: dict) -> dict:
    """
    ratings: {"rater_a": [score_list], "rater_b": [score_list], ...}
    """
    rater_names = list(ratings.keys())
    agreements = {}
    
    for r1, r2 in combinations(rater_names, 2):
        scores1 = np.array(ratings[r1])
        scores2 = np.array(ratings[r2])
        correlation = np.corrcoef(scores1, scores2)[0, 1]
        mae = np.mean(np.abs(scores1 - scores2))
        exact_agreement = np.mean(scores1 == scores2)
        
        agreements[f"{r1}_vs_{r2}"] = {
            "correlation": float(correlation),
            "mae": float(mae),
            "exact_agreement": float(exact_agreement),
            "acceptable": correlation > 0.7 and mae < 0.8
        }
    
    return agreements

75.4 LLM-as-Judge

75.4.1 Concept and Use Cases

LLM-as-Judge uses a language model — typically a more capable model such as Claude Opus — to evaluate another model's output quality.

Well-suited for:

Tasks requiring subjective judgment with clear standards (writing fluency, logical coherence)
Large-scale evaluation where human cost is prohibitive
Rapid feedback loops during development iteration

Core challenges:

Position bias — The model tends to rate the first or last option higher
Verbosity bias — The model tends to prefer longer responses
Self-enhancement bias — The model tends to favor outputs stylistically similar to its own
Sycophancy bias — When pressed, the model tends to change its rating to please the questioner

75.4.2 Designing a High-Quality Judge Prompt

JUDGE_SYSTEM_PROMPT = """You are an objective, rigorous AI output quality reviewer.

Your evaluations must:
1. Be based on the provided evaluation criteria, not personal preferences
2. Maintain consistency: outputs of equal quality should receive equal scores
3. Provide traceable reasoning: every score must be backed by specific textual evidence
4. Resist the following biases:
   - Do not give higher scores simply because a response is longer
   - Do not let presentation order (A vs B shown first) influence your judgment
   - Do not let flattering tone override substance

Important: Read all outputs completely before beginning your evaluation."""

def llm_judge_single(task_description, user_input, model_output, evaluation_criteria) -> dict:
    criteria_text = "\n".join([
        f"- {name} (weight {info['weight']}): {info['description']}"
        for name, info in evaluation_criteria.items()
    ])
    
    judge_prompt = f"""Evaluate the quality of the following AI response.

Task description: {task_description}

User input:
{user_input}

AI response:
{model_output}

Evaluation dimensions:
{criteria_text}

Output format:

<evaluation>
Analyze each dimension citing specific text as evidence
</evaluation>

<scores>
{{
    {", ".join([f'"{name}": {{"score": 1-10, "reasoning": "..."}}' for name in evaluation_criteria])}
}}
</scores>

<overall>
{{
    "weighted_score": weighted composite score (0-10),
    "verdict": "excellent/good/acceptable/poor/unacceptable",
    "key_strength": "most notable strength",
    "key_weakness": "most important improvement area"
}}
</overall>"""
    
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=2000,
        system=JUDGE_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    
    return parse_judge_response(response.content[0].text)

75.4.3 Pairwise Comparison

Pairwise comparison is more reliable than absolute scoring because it reduces calibration bias:

def llm_judge_pairwise(task_description, user_input, output_a, output_b,
                       evaluation_criteria, randomize_order=True) -> dict:
    import random
    
    swapped = False
    if randomize_order and random.random() > 0.5:
        output_a, output_b = output_b, output_a
        swapped = True
    
    criteria_text = "\n".join([
        f"- {name}: {info['description']}"
        for name, info in evaluation_criteria.items()
    ])
    
    judge_prompt = f"""Compare the following two AI responses and determine which is better.

Task: {task_description}

User input:
{user_input}

Response A:
{output_a}

Response B:
{output_b}

Evaluation dimensions:
{criteria_text}

Important:
- Do not favor longer responses simply for their length
- Base your judgment solely on content quality, ignoring presentation order
- Choose "tie" if quality is genuinely equal

Output:
<comparison>
Comparative analysis across each dimension
</comparison>

<verdict>
{{
    "winner": "A" or "B" or "tie",
    "confidence": "high/medium/low",
    "reasoning": "core justification (1-2 sentences)"
}}
</verdict>"""
    
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1500,
        system=JUDGE_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    
    result = parse_pairwise_response(response.content[0].text)
    
    # Restore original ordering if swapped
    if swapped and result.get("winner") in ["A", "B"]:
        result["winner"] = "B" if result["winner"] == "A" else "A"
    
    return result

75.4.4 Bias Calibration

def calibrate_judge_bias(judge_fn, calibration_set: list) -> dict:
    """
    calibration_set: [
        {
            "input": ...,
            "output_better": ...,   # human-confirmed better output
            "output_worse": ...,    # human-confirmed worse output
        }
    ]
    """
    position_a_correct = []
    position_b_correct = []
    
    for item in calibration_set:
        # Test 1: better output in position A
        result_a = judge_fn(item["input"], item["output_better"], item["output_worse"])
        position_a_correct.append(result_a.get("winner") == "A")
        
        # Test 2: better output in position B
        result_b = judge_fn(item["input"], item["output_worse"], item["output_better"])
        position_b_correct.append(result_b.get("winner") == "B")
    
    acc_a = sum(position_a_correct) / len(position_a_correct)
    acc_b = sum(position_b_correct) / len(position_b_correct)
    
    return {
        "accuracy_when_better_is_a": acc_a,
        "accuracy_when_better_is_b": acc_b,
        "position_bias": abs(acc_a - acc_b),
        "overall_accuracy": (acc_a + acc_b) / 2,
        "bias_detected": abs(acc_a - acc_b) > 0.1
    }

75.5 Three-Layer Evaluation Pipeline Architecture

Input test set
      ↓
[Layer 1] Automated Testing (format checks, basic correctness)
      ↓ Pass rate < 80%? → Alert, block deployment
      ↓
[Layer 2] LLM-as-Judge (subjective quality, multi-dimensional scoring)
      ↓ Score drops > 5%? → Trigger human review
      ↓
[Layer 3] Human Evaluation (edge cases, safety, novel scenarios)
      ↓
Evaluation report → Optimization decisions

75.5.1 Metrics Dashboard

class EvaluationDashboard:
    def __init__(self):
        self.metrics_history = []
    
    def record_evaluation(self, prompt_version, auto_results, llm_results, timestamp):
        self.metrics_history.append({
            "version": prompt_version,
            "timestamp": timestamp,
            "auto_pass_rate": auto_results.get("pass_rate", 0),
            "llm_overall_score": llm_results.get("overall_score", 0),
            "llm_dimension_scores": llm_results.get("dimension_scores", {})
        })
    
    def detect_regression(self, current_version, baseline_version) -> dict:
        current = next((m for m in self.metrics_history if m["version"] == current_version), None)
        baseline = next((m for m in self.metrics_history if m["version"] == baseline_version), None)
        
        if not current or not baseline:
            return {"error": "Version not found"}
        
        return {
            "auto_pass_rate_change": current["auto_pass_rate"] - baseline["auto_pass_rate"],
            "llm_score_change": current["llm_overall_score"] - baseline["llm_overall_score"],
            "has_regression": (
                current["auto_pass_rate"] < baseline["auto_pass_rate"] - 0.05 or
                current["llm_overall_score"] < baseline["llm_overall_score"] - 0.3
            )
        }

75.6 Building and Maintaining Evaluation Datasets

High-quality test sets must be:

Representative: Cover the major scenario distribution of the task
Challenging: Include sufficient difficult cases and edge conditions
Leakage-free: Test sets must not appear in the prompt optimization training process
Versioned: The test set itself requires version control

Mine new test cases from production failures:

def mine_failure_cases_from_production(production_logs: list) -> list:
    new_cases = []
    for log in production_logs:
        if log.get("user_feedback") == "negative":
            new_cases.append({
                "input": log["input"],
                "output": log["output"],
                "issue": "user_negative_feedback",
                "source": "production"
            })
    # Only confirmed issues graduate to the formal test set
    return [c for c in new_cases if c.get("review_status") == "confirmed_issue"]

Summary

A prompt evaluation framework is the critical transformation from "feels good" to "measurable and improvable." Automated testing enforces functional consistency. Human evaluation establishes the gold standard for quality. LLM-as-Judge provides scalable subjective quality evaluation between the two.

In practice, the typical allocation is: automated testing covers 70-80% of routine scenarios, LLM-as-Judge handles subjective dimensions, and human evaluation focuses on high-value safety reviews and novel scenarios. Bias awareness, calibration mechanisms, and continuously maintained test sets are the essential guarantees for this system to remain effective over time.

Rate this chapter

4.8 / 5 (3 ratings)