SSO/SAML + SCIM: Enterprise Integration with Okta / Microsoft Entra ID / Google Workspace
Chapter 67: Prompt Version Management: Prompt Engineering Practice Under GitOps Thinking
67.1 Prompts Are Code
In the early exploratory phase of AI applications, prompts were often treated as "black magic" — engineers would find effective prompt text through trial and error, then hardcode it in source files or scatter it across random configuration files, with little thought given to version control.
As AI applications mature toward production, the costs of this casual management approach become apparent:
- Someone modified prompt wording last week, performance degraded today, but no one remembers what changed
- Three team members each maintain different versions of the same feature's prompt
- After deploying a new prompt, output quality degraded for certain use cases, but rollback is unclear
- No ability to A/B test old and new prompts to verify whether changes actually improve things
The root cause of these problems is not treating prompts as first-class citizens in software engineering.
This chapter systematically introduces how to bring GitOps thinking into prompt management, building a complete engineering system from version control to A/B testing, from CI/CD to automated evaluation.
67.2 Prompt File Organization Structure
Directory Layout Design
A mature prompt repository should have a clear directory structure:
prompts/
├── README.md
├── schemas/
│ ├── prompt_manifest.json
│ └── eval_config.json
├── features/
│ ├── customer_service/
│ │ ├── intent_classification/
│ │ │ ├── v1.0.0.yaml
│ │ │ ├── v1.1.0.yaml
│ │ │ └── current -> v1.1.0.yaml # symlink
│ │ └── response_generation/
│ │ ├── v2.0.0.yaml
│ │ └── current -> v2.0.0.yaml
│ ├── code_review/
│ └── document_summary/
├── shared/
│ ├── persona_base.yaml
│ ├── safety_instructions.yaml
│ └── output_formats.yaml
└── experiments/
└── code_review_v3_test.yaml
Prompt File Format
Use YAML to store prompts with complete metadata:
# features/customer_service/intent_classification/v1.1.0.yaml
metadata:
id: "customer_service/intent_classification"
version: "1.1.0"
created_at: "2024-11-15T09:30:00Z"
created_by: "[email protected]"
description: "Classify user questions into predefined intent categories"
changelog: |
v1.1.0: Added RETURN_REQUEST intent category, improved ambiguity handling
v1.0.0: Initial version with 7 intent categories
tags:
- "customer-service"
- "classification"
- "production"
model:
id: "claude-haiku-3-5"
max_tokens: 256
temperature: 0.1 # low temperature for classification stability
prompt:
system: |
You are an intent classification assistant. Classify user customer service
inquiries into one of the following categories:
- ORDER_STATUS: inquiries about order status and shipping
- RETURN_REQUEST: requests for returns or refunds
- PRODUCT_INQUIRY: product information and specification queries
- COMPLAINT: complaints and dissatisfaction feedback
- TECHNICAL_SUPPORT: technical issues and troubleshooting
- ACCOUNT_ISSUE: account login and settings issues
- OTHER: none of the above categories apply
Output format:
{
"intent": "<category name>",
"confidence": <float between 0-1>,
"reason": "<one sentence explaining the classification>"
}
Output JSON only, with no other text.
user_template: |
User question: {{user_message}}
evaluation:
test_cases_path: "tests/intent_classification_v1.yaml"
metrics:
- accuracy
- avg_confidence
pass_threshold:
accuracy: 0.92
dependencies:
- shared/safety_instructions.yaml
67.3 Git Workflow Design
Branching Strategy
Borrowing from Git Flow, define a clear branching strategy for prompts:
main # Prompts used in production
├── develop # Prompts under development
│ ├── feature/add-refund-intent
│ ├── fix/classification-bug
│ └── experiment/gpt4-comparison
└── hotfix/urgent-safety-fix
Commit Message Conventions
Define structured commit messages for prompt changes:
<type>(<scope>): <description>
[body]
[footer]
Types:
feat: New prompt or feature addedfix: Bug fixed in promptrefactor: Prompt structure refactored (no behavioral change)perf: Performance optimization (e.g., token reduction)test: Test cases added or modifiedchore: Maintenance work (e.g., metadata updates)
Example:
feat(intent_classification): add RETURN_REQUEST intent category
In v1.0.0, return-related inquiries were frequently misclassified as
ORDER_STATUS, routing customers to the wrong handling team. This adds
a RETURN_REQUEST category and improves ambiguous case handling logic.
Test results: return scenario accuracy improved from 78% to 96%
Scope: only affects customer_service/intent_classification
Pull Request Workflow
All prompt changes should go through the PR process:
## Prompt Change Description
**Change type**: [ ] feat [ ] fix [ ] refactor [ ] perf
**Affected Prompts**:
- `features/customer_service/intent_classification/v1.1.0.yaml`
**Motivation**:
> Explain why this change is needed
**Changes**:
> Describe what specifically changed
**Test Results**:
| Metric | Before | After |
|--------|--------|-------|
| Accuracy | 89% | 96% |
| Avg confidence | 0.81 | 0.87 |
**Rollback Plan**:
> How to quickly roll back if the new version has issues
**Checklist**:
- [ ] Version number updated
- [ ] Changelog updated
- [ ] Test cases added/updated
- [ ] Automated evaluation run
- [ ] Human review completed
67.4 Automated Evaluation Pipeline
Evaluation Framework Design
Automated evaluation is the core of prompt CI/CD. Every PR should trigger an evaluation pipeline:
import yaml
import json
from dataclasses import dataclass
from typing import List, Optional
import anthropic
@dataclass
class EvalCase:
input: dict
expected_output: str
metadata: dict
@dataclass
class EvalResult:
case_id: str
passed: bool
score: float
actual_output: str
expected_output: str
latency_ms: float
error: Optional[str] = None
class PromptEvaluator:
def __init__(self, client: anthropic.Anthropic):
self.client = client
def run_eval(
self,
prompt_config: dict,
test_cases: List[EvalCase],
metric_functions: dict
) -> dict:
results = []
for i, case in enumerate(test_cases):
result = self._run_single_case(prompt_config, case, i)
results.append(result)
summary = {}
for metric_name, metric_fn in metric_functions.items():
summary[metric_name] = metric_fn(results)
thresholds = prompt_config.get("evaluation", {}).get("pass_threshold", {})
passed = all(
summary.get(metric, 0) >= threshold
for metric, threshold in thresholds.items()
)
return {
"passed": passed,
"summary": summary,
"details": [vars(r) for r in results],
"total_cases": len(results),
"passed_cases": sum(1 for r in results if r.passed)
}
def _run_single_case(self, prompt_config, case, case_index):
import time
user_message = self._render_template(
prompt_config["prompt"]["user_template"],
case.input
)
start_time = time.time()
try:
response = self.client.messages.create(
model=prompt_config["model"]["id"],
max_tokens=prompt_config["model"]["max_tokens"],
temperature=prompt_config["model"].get("temperature", 0.7),
system=prompt_config["prompt"]["system"],
messages=[{"role": "user", "content": user_message}]
)
actual_output = response.content[0].text
latency_ms = (time.time() - start_time) * 1000
score = self._score_output(actual_output, case.expected_output)
return EvalResult(
case_id=f"case_{case_index}",
passed=score >= 0.8,
score=score,
actual_output=actual_output,
expected_output=case.expected_output,
latency_ms=latency_ms
)
except Exception as e:
return EvalResult(
case_id=f"case_{case_index}",
passed=False,
score=0.0,
actual_output="",
expected_output=case.expected_output,
latency_ms=(time.time() - start_time) * 1000,
error=str(e)
)
def _render_template(self, template: str, variables: dict) -> str:
result = template
for key, value in variables.items():
result = result.replace(f"{{{{{key}}}}}", str(value))
return result
GitHub Actions CI/CD
# .github/workflows/prompt-eval.yml
name: Prompt Evaluation
on:
pull_request:
paths:
- 'prompts/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install anthropic pyyaml pandas
- name: Detect changed prompts
id: changes
run: |
CHANGED=$(git diff --name-only origin/main...HEAD -- 'prompts/**/*.yaml')
echo "changed_prompts=$CHANGED" >> $GITHUB_OUTPUT
- name: Run evaluations
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python eval/run_ci_eval.py \
--changed-prompts "${{ steps.changes.outputs.changed_prompts }}" \
--output-file eval_results.json
- name: Comment PR with results
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('eval_results.json'));
let comment = '## Prompt Evaluation Results\n\n';
for (const [promptId, result] of Object.entries(results)) {
const status = result.passed ? 'PASSED' : 'FAILED';
comment += `### ${promptId} - ${status}\n`;
comment += `- Accuracy: ${(result.summary.accuracy * 100).toFixed(1)}%\n`;
comment += `- Pass threshold: ${result.thresholds.accuracy * 100}%\n\n`;
}
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: comment
});
67.5 A/B Testing Framework
Traffic Splitting Mechanism
In production, A/B testing allows gradual validation of new prompt effectiveness:
import hashlib
from typing import Optional
class PromptABTestManager:
"""
Stable hash-based user traffic splitting.
Ensures the same user always sees the same version (sticky assignment).
"""
def __init__(self, prompt_store):
self.prompt_store = prompt_store
self.experiments = {}
def register_experiment(
self,
experiment_id: str,
control_version: str,
treatment_version: str,
traffic_split: float = 0.1,
metrics: list = None
):
self.experiments[experiment_id] = {
"control": control_version,
"treatment": treatment_version,
"traffic_split": traffic_split,
"metrics": metrics or ["user_satisfaction", "accuracy"],
"started_at": datetime.utcnow().isoformat()
}
def get_prompt_variant(
self,
prompt_id: str,
user_id: str,
experiment_id: Optional[str] = None
) -> tuple:
"""
Returns (prompt_config, variant_name).
variant_name: "control" or "treatment"
"""
if not experiment_id or experiment_id not in self.experiments:
return self.prompt_store.get_current(prompt_id), "control"
experiment = self.experiments[experiment_id]
hash_input = f"{user_id}:{experiment_id}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
bucket = (hash_value % 100) / 100
if bucket < experiment["traffic_split"]:
variant = "treatment"
version = experiment["treatment"]
else:
variant = "control"
version = experiment["control"]
prompt = self.prompt_store.get_version(prompt_id, version)
return prompt, variant
def analyze_experiment(self, experiment_id: str) -> dict:
control_metrics = self.metrics_store.aggregate(
experiment_id=experiment_id, variant="control"
)
treatment_metrics = self.metrics_store.aggregate(
experiment_id=experiment_id, variant="treatment"
)
return {
"experiment_id": experiment_id,
"sample_sizes": {
"control": control_metrics["n"],
"treatment": treatment_metrics["n"]
},
"metrics": {
metric: {
"control": control_metrics[metric],
"treatment": treatment_metrics[metric],
"lift": (treatment_metrics[metric] - control_metrics[metric])
/ control_metrics[metric]
}
for metric in self.experiments[experiment_id]["metrics"]
}
}
Canary Releases for High-Risk Changes
For high-risk prompt changes, a progressive canary release strategy is recommended:
class CanaryReleaseManager:
"""
Progressive prompt release:
1% → 5% → 20% → 50% → 100% traffic incrementally
"""
CANARY_STAGES = [
{"traffic_pct": 1, "duration_hours": 2, "error_threshold": 0.05},
{"traffic_pct": 5, "duration_hours": 4, "error_threshold": 0.03},
{"traffic_pct": 20, "duration_hours": 8, "error_threshold": 0.02},
{"traffic_pct": 50, "duration_hours": 12, "error_threshold": 0.02},
{"traffic_pct": 100, "duration_hours": 0, "error_threshold": 0.02},
]
def advance_canary(self, prompt_id: str, deployment_id: str) -> dict:
current_stage = self.get_current_stage(deployment_id)
metrics = self.get_canary_metrics(deployment_id)
if metrics["error_rate"] > current_stage["error_threshold"]:
self.rollback(deployment_id)
return {"action": "rollback", "reason": "error_rate_exceeded"}
next_stage_idx = self.CANARY_STAGES.index(current_stage) + 1
if next_stage_idx < len(self.CANARY_STAGES):
next_stage = self.CANARY_STAGES[next_stage_idx]
self.set_traffic(prompt_id, next_stage["traffic_pct"])
return {"action": "advanced", "new_traffic_pct": next_stage["traffic_pct"]}
return {"action": "complete", "message": "Full rollout complete"}
67.6 Prompt Registry
Centralized Prompt Storage
In production, all prompts should be managed through a unified registry rather than reading directly from Git repositories:
import redis
import yaml
class PromptRegistry:
"""
Centralized prompt registry.
- Caches hot prompts (Redis)
- Supports version history queries
- Provides change notifications (Webhook/message queue)
"""
def __init__(self, git_backend, cache: redis.Redis):
self.git = git_backend
self.cache = cache
self.TTL = 300 # 5-minute cache
def get(self, prompt_id: str, version: str = "current") -> dict:
cache_key = f"prompt:{prompt_id}:{version}"
cached = self.cache.get(cache_key)
if cached:
return yaml.safe_load(cached)
prompt = self.git.load(prompt_id, version)
self.cache.setex(cache_key, self.TTL, yaml.dump(prompt))
return prompt
def publish(self, prompt_id: str, version: str, config: dict):
"""Publish a new version."""
self.git.save(prompt_id, version, config)
# Invalidate cache
self.cache.delete(f"prompt:{prompt_id}:current")
self.cache.delete(f"prompt:{prompt_id}:{version}")
self._notify_subscribers(prompt_id, version)
67.7 Best Practices Summary
Golden Rules for Prompt Versioning
Never modify production versions directly. All modifications go through the PR process and are evaluated and reviewed before merging to main.
Use semantic versioning. Follow SemVer:
- Patch (1.0.x): wording tweaks that don't affect output format
- Minor (1.x.0): new features or intent categories added
- Major (x.0.0): fundamental changes to output format or behavior
Preserve historical versions. Don't delete old versions. History is the foundation for rollbacks and a record of prompt evolution.
Keep test cases synchronized with prompts. When a prompt gains new capabilities, test cases must gain corresponding coverage.
Treat prompt changes as high-risk code changes. In code review, prompt changes require at least one person familiar with the business domain to review — not just technical review.
Common Anti-Patterns
Anti-pattern 1: Storing prompts in a database without versioning. Database-stored prompts have no history, cannot be rolled back, cannot be diffed. Use Git as the versioning store; databases are only for query caching.
Anti-pattern 2: Embedding business data in prompts. Hardcoding mutable business data like company names or product prices in prompts means business data changes require prompt version bumps. Use template variables to separate business data from prompt structure.
Anti-pattern 3: No rollback plan. Before every prompt deployment, explicitly document rollback steps and trigger criteria (under what conditions rollback should be initiated).
Summary
Prompt engineering under GitOps thinking is the systematic application of software engineering best practices — version control, CI/CD, A/B testing, progressive releases — to the domain of prompt management.
The core conceptual shift is: prompts are not configuration parameters but part of software logic, deserving the same rigorous engineering controls as code. The short-term investment in building this system pays dividends continuously as AI applications mature — through faster iteration velocity, higher quality stability, and lower debugging costs.