Chapter 67

SSO/SAML + SCIM: Enterprise Integration with Okta / Microsoft Entra ID / Google Workspace

Chapter 67: Prompt Version Management: Prompt Engineering Practice Under GitOps Thinking

67.1 Prompts Are Code

In the early exploratory phase of AI applications, prompts were often treated as "black magic" — engineers would find effective prompt text through trial and error, then hardcode it in source files or scatter it across random configuration files, with little thought given to version control.

As AI applications mature toward production, the costs of this casual management approach become apparent:

The root cause of these problems is not treating prompts as first-class citizens in software engineering.

This chapter systematically introduces how to bring GitOps thinking into prompt management, building a complete engineering system from version control to A/B testing, from CI/CD to automated evaluation.

67.2 Prompt File Organization Structure

Directory Layout Design

A mature prompt repository should have a clear directory structure:

prompts/
├── README.md
├── schemas/
│   ├── prompt_manifest.json
│   └── eval_config.json
├── features/
│   ├── customer_service/
│   │   ├── intent_classification/
│   │   │   ├── v1.0.0.yaml
│   │   │   ├── v1.1.0.yaml
│   │   │   └── current -> v1.1.0.yaml  # symlink
│   │   └── response_generation/
│   │       ├── v2.0.0.yaml
│   │       └── current -> v2.0.0.yaml
│   ├── code_review/
│   └── document_summary/
├── shared/
│   ├── persona_base.yaml
│   ├── safety_instructions.yaml
│   └── output_formats.yaml
└── experiments/
    └── code_review_v3_test.yaml

Prompt File Format

Use YAML to store prompts with complete metadata:

# features/customer_service/intent_classification/v1.1.0.yaml

metadata:
  id: "customer_service/intent_classification"
  version: "1.1.0"
  created_at: "2024-11-15T09:30:00Z"
  created_by: "[email protected]"
  description: "Classify user questions into predefined intent categories"
  changelog: |
    v1.1.0: Added RETURN_REQUEST intent category, improved ambiguity handling
    v1.0.0: Initial version with 7 intent categories
  tags:
    - "customer-service"
    - "classification"
    - "production"

model:
  id: "claude-haiku-3-5"
  max_tokens: 256
  temperature: 0.1    # low temperature for classification stability

prompt:
  system: |
    You are an intent classification assistant. Classify user customer service 
    inquiries into one of the following categories:
    
    - ORDER_STATUS: inquiries about order status and shipping
    - RETURN_REQUEST: requests for returns or refunds
    - PRODUCT_INQUIRY: product information and specification queries
    - COMPLAINT: complaints and dissatisfaction feedback
    - TECHNICAL_SUPPORT: technical issues and troubleshooting
    - ACCOUNT_ISSUE: account login and settings issues
    - OTHER: none of the above categories apply
    
    Output format:
    {
      "intent": "<category name>",
      "confidence": <float between 0-1>,
      "reason": "<one sentence explaining the classification>"
    }
    
    Output JSON only, with no other text.
  
  user_template: |
    User question: {{user_message}}

evaluation:
  test_cases_path: "tests/intent_classification_v1.yaml"
  metrics:
    - accuracy
    - avg_confidence
  pass_threshold:
    accuracy: 0.92

dependencies:
  - shared/safety_instructions.yaml

67.3 Git Workflow Design

Branching Strategy

Borrowing from Git Flow, define a clear branching strategy for prompts:

main                    # Prompts used in production
├── develop             # Prompts under development
│   ├── feature/add-refund-intent
│   ├── fix/classification-bug
│   └── experiment/gpt4-comparison
└── hotfix/urgent-safety-fix

Commit Message Conventions

Define structured commit messages for prompt changes:

<type>(<scope>): <description>

[body]

[footer]

Types:

Example:

feat(intent_classification): add RETURN_REQUEST intent category

In v1.0.0, return-related inquiries were frequently misclassified as 
ORDER_STATUS, routing customers to the wrong handling team. This adds
a RETURN_REQUEST category and improves ambiguous case handling logic.

Test results: return scenario accuracy improved from 78% to 96%
Scope: only affects customer_service/intent_classification

Pull Request Workflow

All prompt changes should go through the PR process:

## Prompt Change Description

**Change type**: [ ] feat [ ] fix [ ] refactor [ ] perf

**Affected Prompts**:
- `features/customer_service/intent_classification/v1.1.0.yaml`

**Motivation**:
> Explain why this change is needed

**Changes**:
> Describe what specifically changed

**Test Results**:
| Metric | Before | After |
|--------|--------|-------|
| Accuracy | 89% | 96% |
| Avg confidence | 0.81 | 0.87 |

**Rollback Plan**:
> How to quickly roll back if the new version has issues

**Checklist**:
- [ ] Version number updated
- [ ] Changelog updated
- [ ] Test cases added/updated
- [ ] Automated evaluation run
- [ ] Human review completed

67.4 Automated Evaluation Pipeline

Evaluation Framework Design

Automated evaluation is the core of prompt CI/CD. Every PR should trigger an evaluation pipeline:

import yaml
import json
from dataclasses import dataclass
from typing import List, Optional
import anthropic

@dataclass
class EvalCase:
    input: dict
    expected_output: str
    metadata: dict

@dataclass
class EvalResult:
    case_id: str
    passed: bool
    score: float
    actual_output: str
    expected_output: str
    latency_ms: float
    error: Optional[str] = None

class PromptEvaluator:
    def __init__(self, client: anthropic.Anthropic):
        self.client = client
    
    def run_eval(
        self,
        prompt_config: dict,
        test_cases: List[EvalCase],
        metric_functions: dict
    ) -> dict:
        results = []
        
        for i, case in enumerate(test_cases):
            result = self._run_single_case(prompt_config, case, i)
            results.append(result)
        
        summary = {}
        for metric_name, metric_fn in metric_functions.items():
            summary[metric_name] = metric_fn(results)
        
        thresholds = prompt_config.get("evaluation", {}).get("pass_threshold", {})
        passed = all(
            summary.get(metric, 0) >= threshold
            for metric, threshold in thresholds.items()
        )
        
        return {
            "passed": passed,
            "summary": summary,
            "details": [vars(r) for r in results],
            "total_cases": len(results),
            "passed_cases": sum(1 for r in results if r.passed)
        }
    
    def _run_single_case(self, prompt_config, case, case_index):
        import time
        
        user_message = self._render_template(
            prompt_config["prompt"]["user_template"],
            case.input
        )
        
        start_time = time.time()
        try:
            response = self.client.messages.create(
                model=prompt_config["model"]["id"],
                max_tokens=prompt_config["model"]["max_tokens"],
                temperature=prompt_config["model"].get("temperature", 0.7),
                system=prompt_config["prompt"]["system"],
                messages=[{"role": "user", "content": user_message}]
            )
            
            actual_output = response.content[0].text
            latency_ms = (time.time() - start_time) * 1000
            score = self._score_output(actual_output, case.expected_output)
            
            return EvalResult(
                case_id=f"case_{case_index}",
                passed=score >= 0.8,
                score=score,
                actual_output=actual_output,
                expected_output=case.expected_output,
                latency_ms=latency_ms
            )
        except Exception as e:
            return EvalResult(
                case_id=f"case_{case_index}",
                passed=False,
                score=0.0,
                actual_output="",
                expected_output=case.expected_output,
                latency_ms=(time.time() - start_time) * 1000,
                error=str(e)
            )
    
    def _render_template(self, template: str, variables: dict) -> str:
        result = template
        for key, value in variables.items():
            result = result.replace(f"{{{{{key}}}}}", str(value))
        return result

GitHub Actions CI/CD

# .github/workflows/prompt-eval.yml

name: Prompt Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: pip install anthropic pyyaml pandas
      
      - name: Detect changed prompts
        id: changes
        run: |
          CHANGED=$(git diff --name-only origin/main...HEAD -- 'prompts/**/*.yaml')
          echo "changed_prompts=$CHANGED" >> $GITHUB_OUTPUT
      
      - name: Run evaluations
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python eval/run_ci_eval.py \
            --changed-prompts "${{ steps.changes.outputs.changed_prompts }}" \
            --output-file eval_results.json
      
      - name: Comment PR with results
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval_results.json'));
            
            let comment = '## Prompt Evaluation Results\n\n';
            
            for (const [promptId, result] of Object.entries(results)) {
              const status = result.passed ? 'PASSED' : 'FAILED';
              comment += `### ${promptId} - ${status}\n`;
              comment += `- Accuracy: ${(result.summary.accuracy * 100).toFixed(1)}%\n`;
              comment += `- Pass threshold: ${result.thresholds.accuracy * 100}%\n\n`;
            }
            
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });

67.5 A/B Testing Framework

Traffic Splitting Mechanism

In production, A/B testing allows gradual validation of new prompt effectiveness:

import hashlib
from typing import Optional

class PromptABTestManager:
    """
    Stable hash-based user traffic splitting.
    Ensures the same user always sees the same version (sticky assignment).
    """
    
    def __init__(self, prompt_store):
        self.prompt_store = prompt_store
        self.experiments = {}
    
    def register_experiment(
        self,
        experiment_id: str,
        control_version: str,
        treatment_version: str,
        traffic_split: float = 0.1,
        metrics: list = None
    ):
        self.experiments[experiment_id] = {
            "control": control_version,
            "treatment": treatment_version,
            "traffic_split": traffic_split,
            "metrics": metrics or ["user_satisfaction", "accuracy"],
            "started_at": datetime.utcnow().isoformat()
        }
    
    def get_prompt_variant(
        self,
        prompt_id: str,
        user_id: str,
        experiment_id: Optional[str] = None
    ) -> tuple:
        """
        Returns (prompt_config, variant_name).
        variant_name: "control" or "treatment"
        """
        if not experiment_id or experiment_id not in self.experiments:
            return self.prompt_store.get_current(prompt_id), "control"
        
        experiment = self.experiments[experiment_id]
        
        hash_input = f"{user_id}:{experiment_id}"
        hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
        bucket = (hash_value % 100) / 100
        
        if bucket < experiment["traffic_split"]:
            variant = "treatment"
            version = experiment["treatment"]
        else:
            variant = "control"
            version = experiment["control"]
        
        prompt = self.prompt_store.get_version(prompt_id, version)
        return prompt, variant
    
    def analyze_experiment(self, experiment_id: str) -> dict:
        control_metrics = self.metrics_store.aggregate(
            experiment_id=experiment_id, variant="control"
        )
        treatment_metrics = self.metrics_store.aggregate(
            experiment_id=experiment_id, variant="treatment"
        )
        
        return {
            "experiment_id": experiment_id,
            "sample_sizes": {
                "control": control_metrics["n"],
                "treatment": treatment_metrics["n"]
            },
            "metrics": {
                metric: {
                    "control": control_metrics[metric],
                    "treatment": treatment_metrics[metric],
                    "lift": (treatment_metrics[metric] - control_metrics[metric]) 
                             / control_metrics[metric]
                }
                for metric in self.experiments[experiment_id]["metrics"]
            }
        }

Canary Releases for High-Risk Changes

For high-risk prompt changes, a progressive canary release strategy is recommended:

class CanaryReleaseManager:
    """
    Progressive prompt release:
    1% → 5% → 20% → 50% → 100% traffic incrementally
    """
    
    CANARY_STAGES = [
        {"traffic_pct": 1,   "duration_hours": 2,  "error_threshold": 0.05},
        {"traffic_pct": 5,   "duration_hours": 4,  "error_threshold": 0.03},
        {"traffic_pct": 20,  "duration_hours": 8,  "error_threshold": 0.02},
        {"traffic_pct": 50,  "duration_hours": 12, "error_threshold": 0.02},
        {"traffic_pct": 100, "duration_hours": 0,  "error_threshold": 0.02},
    ]
    
    def advance_canary(self, prompt_id: str, deployment_id: str) -> dict:
        current_stage = self.get_current_stage(deployment_id)
        metrics = self.get_canary_metrics(deployment_id)
        
        if metrics["error_rate"] > current_stage["error_threshold"]:
            self.rollback(deployment_id)
            return {"action": "rollback", "reason": "error_rate_exceeded"}
        
        next_stage_idx = self.CANARY_STAGES.index(current_stage) + 1
        if next_stage_idx < len(self.CANARY_STAGES):
            next_stage = self.CANARY_STAGES[next_stage_idx]
            self.set_traffic(prompt_id, next_stage["traffic_pct"])
            return {"action": "advanced", "new_traffic_pct": next_stage["traffic_pct"]}
        
        return {"action": "complete", "message": "Full rollout complete"}

67.6 Prompt Registry

Centralized Prompt Storage

In production, all prompts should be managed through a unified registry rather than reading directly from Git repositories:

import redis
import yaml

class PromptRegistry:
    """
    Centralized prompt registry.
    - Caches hot prompts (Redis)
    - Supports version history queries
    - Provides change notifications (Webhook/message queue)
    """
    
    def __init__(self, git_backend, cache: redis.Redis):
        self.git = git_backend
        self.cache = cache
        self.TTL = 300  # 5-minute cache
    
    def get(self, prompt_id: str, version: str = "current") -> dict:
        cache_key = f"prompt:{prompt_id}:{version}"
        
        cached = self.cache.get(cache_key)
        if cached:
            return yaml.safe_load(cached)
        
        prompt = self.git.load(prompt_id, version)
        self.cache.setex(cache_key, self.TTL, yaml.dump(prompt))
        
        return prompt
    
    def publish(self, prompt_id: str, version: str, config: dict):
        """Publish a new version."""
        self.git.save(prompt_id, version, config)
        
        # Invalidate cache
        self.cache.delete(f"prompt:{prompt_id}:current")
        self.cache.delete(f"prompt:{prompt_id}:{version}")
        
        self._notify_subscribers(prompt_id, version)

67.7 Best Practices Summary

Golden Rules for Prompt Versioning

Never modify production versions directly. All modifications go through the PR process and are evaluated and reviewed before merging to main.

Use semantic versioning. Follow SemVer:

Preserve historical versions. Don't delete old versions. History is the foundation for rollbacks and a record of prompt evolution.

Keep test cases synchronized with prompts. When a prompt gains new capabilities, test cases must gain corresponding coverage.

Treat prompt changes as high-risk code changes. In code review, prompt changes require at least one person familiar with the business domain to review — not just technical review.

Common Anti-Patterns

Anti-pattern 1: Storing prompts in a database without versioning. Database-stored prompts have no history, cannot be rolled back, cannot be diffed. Use Git as the versioning store; databases are only for query caching.

Anti-pattern 2: Embedding business data in prompts. Hardcoding mutable business data like company names or product prices in prompts means business data changes require prompt version bumps. Use template variables to separate business data from prompt structure.

Anti-pattern 3: No rollback plan. Before every prompt deployment, explicitly document rollback steps and trigger criteria (under what conditions rollback should be initiated).


Summary

Prompt engineering under GitOps thinking is the systematic application of software engineering best practices — version control, CI/CD, A/B testing, progressive releases — to the domain of prompt management.

The core conceptual shift is: prompts are not configuration parameters but part of software logic, deserving the same rigorous engineering controls as code. The short-term investment in building this system pays dividends continuously as AI applications mature — through faster iteration velocity, higher quality stability, and lower debugging costs.

Rate this chapter
4.6  / 5  (3 ratings)

💬 Comments