Model Family Deep Dive: Opus 4.7 / Sonnet 4.6 / Haiku 4.5 Selection Decision Tree
Chapter 2: Model Selection Guide: Capability Boundaries and Cost Trade-offs for Opus, Sonnet, and Haiku
2.1 Why Model Selection Is an Architecture Decision
The instinct to use the most capable model is understandable but economically unsound in production. A mid-sized SaaS product routing all requests to claude-opus-4-6 can easily spend 10โ20x more on model inference than a system that matches the model to the task. Model selection belongs in the architecture review, not in the configuration file.
The decision is a multi-dimensional optimization across:
- Task complexity: How much reasoning depth does this task actually require?
- Latency requirements: How long can the user wait?
- Throughput volume: How many requests per day?
- Error tolerance: Is a 5% error rate acceptable, or catastrophic?
- Cost budget: What is the maximum acceptable cost per 1,000 requests?
This chapter provides a systematic decision framework covering each dimension.
2.2 Quantifying Capability Boundaries
Multi-Step Reasoning: The Largest Gap
The most significant performance gap between the three tiers appears in tasks requiring extended chains of inference. Consider a representative task: given 500 lines of Python, identify thread-safety issues and explain why they trigger intermittently rather than on every run.
| Model | Typical Performance |
|---|---|
| Haiku | Finds obvious race conditions; explanation of intermittency is imprecise; complex lock-ordering issues may be missed |
| Sonnet | Finds most issues; provides reasonable intermittency analysis; occasional misses on subtle ABA problems |
| Opus | Systematically identifies all issues; precise trigger analysis including CPU cache-line and memory barrier considerations; prioritized recommendations |
Academic benchmarks quantify these gaps:
MMLU (knowledge breadth):
Opus: ~88%
Sonnet: ~83%
Haiku: ~75%
HumanEval (code generation):
Opus: ~75%
Sonnet: ~70%
Haiku: ~60%
MATH (mathematical reasoning):
Opus: ~60%
Sonnet: ~50%
Haiku: ~35%
The MATH benchmark shows the largest absolute gapโ25 percentage points between Opus and Haiku. Tasks requiring formal mathematical reasoning amplify model tier differences more than almost any other category.
Instruction Following: A Smaller Gap
For tasks primarily about following complex formatting or structural instructions, the three tiers converge. A 30-rule system prompt governing output structure will typically see Haiku comply with 25โ28 rules reliably. This means structured extraction, format conversion, and template-filling tasks often run well on Haiku.
Long-Context Retrieval: The "Lost in the Middle" Effect
All three models support 200K token context windows, but long-context recall quality varies. Empirical research has documented a consistent pattern: information placed in the middle of a long document is harder to retrieve accurately than information at the beginning or end.
Recall accuracy by document position:
Accuracy
100% โค
90% โค โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
80% โค โโโโโโโโโโโโ
70% โค
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Position
Document start Document end
Opus maintains roughly 85% accuracy at the middle-of-document position; Haiku drops to around 65%. If your task requires reliably retrieving information from the middle of a long document (a contract clause buried on page 40 of 80, for example), this gap is consequential.
2.3 Cost Analysis: Real Numbers for Real Scenarios
Cost calculations must be grounded in actual token consumption patterns, not the per-million-token rate card in isolation.
Scenario 1: Customer Service Chatbot
Assumptions:
- Per-turn token breakdown: user input 150 tokens + system prompt 500 tokens + conversation history 800 tokens = 1,450 input tokens
- Average output: 300 tokens
- Daily request volume: 10,000
# Daily cost by model tier
haiku_daily = (10_000 * 1_450 / 1_000_000 * 0.25) + (10_000 * 300 / 1_000_000 * 1.25)
# = $3.625 + $3.75 = $7.375 / day โ ~$2,700 / year
sonnet_daily = (10_000 * 1_450 / 1_000_000 * 3.00) + (10_000 * 300 / 1_000_000 * 15.00)
# = $43.50 + $45.00 = $88.50 / day โ ~$32,300 / year
opus_daily = (10_000 * 1_450 / 1_000_000 * 15.00) + (10_000 * 300 / 1_000_000 * 75.00)
# = $217.50 + $225.00 = $442.50 / day โ ~$161,500 / year
For a customer service bot where Haiku's error rate is within acceptable bounds, the choice saves ~95% of model costs compared to Opus.
Scenario 2: Code Review Tool
Assumptions:
- Per-review: system prompt 1,000 tokens + code 3,000 tokens + output 1,500 tokens
- Daily volume: 500 reviews
haiku_daily = (500 * 4_000 / 1e6 * 0.25) + (500 * 1_500 / 1e6 * 1.25) # $1.44
sonnet_daily = (500 * 4_000 / 1e6 * 3.00) + (500 * 1_500 / 1e6 * 15.00) # $17.25
opus_daily = (500 * 4_000 / 1e6 * 15.0) + (500 * 1_500 / 1e6 * 75.00) # $86.25
Code review is a case where quality matters more. If Sonnet achieves 95% of Opus's quality at 20% of the cost, Sonnet is typically the correct choice.
Scenario 3: Batch Content Generation
Assumptions:
- Per generation: system prompt 500 tokens + instructions 200 tokens + output 2,000 tokens
- Volume: 2,000 items/day, async batch (not real-time)
Batch API provides 50% cost reduction for asynchronous workloads:
# Sonnet with Batch API discount
sonnet_batch_input_price = 3.00 * 0.50 # $1.50/M tokens
sonnet_batch_output_price = 15.00 * 0.50 # $7.50/M tokens
sonnet_batch_daily = (2_000 * 700 / 1e6 * 1.50) + (2_000 * 2_000 / 1e6 * 7.50)
# = $2.10 + $30.00 = $32.10 / day โ ~$11,700 / year
2.4 A Five-Step Decision Framework
Step 1: Classify the Task
Map your task to one of these categories:
Task classification tree:
Does the task require chained reasoning (>3 logical steps)?
โโ Yes โ Does it involve math, formal logic, or complex code?
โ โโ Yes โ Opus (or Sonnet + Extended Thinking)
โ โโ No โ Sonnet
โโ No โ Is the task primarily about structured output (extraction, classification)?
โโ Yes โ Haiku
โโ No โ Does it require high-quality writing or nuanced analysis?
โโ Yes โ Sonnet
โโ No โ Haiku
Step 2: Evaluate Latency Requirements
| Latency budget | Recommended tier |
|---|---|
| < 1 second (real-time interaction) | Haiku |
| 1โ5 seconds (normal conversational) | Haiku or Sonnet |
| 5โ30 seconds (acceptable thinking time) | Sonnet or Opus |
| > 30 seconds (async / batch) | Any tier; optimize for cost |
Step 3: Assess the Cost of Errors
| Error cost | Example | Strategy |
|---|---|---|
| Low (human can easily correct) | Draft generation, initial triage | Haiku |
| Medium (affects user experience) | Customer responses, code suggestions | Sonnet |
| High (impacts business decisions) | Contract analysis, financial summaries | Opus + human review |
| Very high (legal/safety-critical) | Compliance checking, medical triage | Opus + expert review |
Step 4: Run Task-Specific A/B Tests
Do not rely on intuition. Build an evaluation set from real production samples and measure directly:
import anthropic
import json
client = anthropic.Anthropic()
def run_evaluation(
test_cases: list[dict],
models: list[str],
system_prompt: str = ""
) -> dict[str, dict]:
"""
Run identical test cases across multiple models and collect quality + cost metrics.
test_cases: [{"prompt": str, "expected": str, "criteria": list[str]}]
Returns per-model scores and token usage.
"""
results = {
model: {"scores": [], "input_tokens": 0, "output_tokens": 0}
for model in models
}
for case in test_cases:
for model in models:
kwargs = {
"model": model,
"max_tokens": 1024,
"messages": [{"role": "user", "content": case["prompt"]}]
}
if system_prompt:
kwargs["system"] = system_prompt
resp = client.messages.create(**kwargs)
output = resp.content[0].text
results[model]["input_tokens"] += resp.usage.input_tokens
results[model]["output_tokens"] += resp.usage.output_tokens
score = grade_output(output, case["expected"], case["criteria"])
results[model]["scores"].append(score)
# Compute summary statistics
for model in models:
scores = results[model]["scores"]
results[model]["mean_score"] = sum(scores) / len(scores) if scores else 0
return results
def grade_output(output: str, expected: str, criteria: list[str]) -> float:
"""Use Opus as an evaluation judge."""
judge_prompt = f"""
Grade the following response on a scale of 0-100.
Grading criteria: {json.dumps(criteria)}
Expected outcome: {expected}
Actual response: {output}
Return only an integer score with no explanation.
"""
resp = client.messages.create(
model="claude-opus-4-6",
max_tokens=10,
messages=[{"role": "user", "content": judge_prompt}]
)
try:
return int(resp.content[0].text.strip())
except ValueError:
return 0
Step 5: Monitor and Adjust in Production
Model performance drifts as your task distribution changes. Instrument your production system with these metrics:
Key metrics to track:
- User satisfaction (CSAT / thumbs up-down)
- Error / complaint rate
- Task completion rate
- p50 / p95 response latency
- Cost per request (rolling 7-day average)
Alert thresholds (example):
- Error rate > 5% โ evaluate upgrading model tier
- Cost growth > 20% month-over-month โ audit for prompt size bloat
- p95 latency > 5s โ investigate prompt length or consider tier downgrade
2.5 Hybrid Model Architectures
Production systems rarely use a single model. The most cost-effective architectures route different task types to different models.
The Router Pattern
User request
โ
[Classifier: Haiku] (cheap, fast)
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Simple lookup / format task โ Haiku โ
โ Moderate reasoning / generation โ Sonnet โ
โ Deep reasoning / complex code โ Opus โ
โ Tool-use with retrieval โ Haiku + tool calls โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
import anthropic
client = anthropic.Anthropic()
ROUTING_PROMPT = """
Classify the complexity of the user's request. Reply with exactly one word:
- "simple" โ factual lookup, format conversion, classification
- "medium" โ reasoning up to 3-4 steps, code generation, summarization
- "complex" โ multi-step reasoning, architecture design, math proofs
User request: {message}
"""
MODEL_MAP = {
"simple": "claude-haiku-4-5-20251001",
"medium": "claude-sonnet-4-6",
"complex": "claude-opus-4-6",
}
def route_and_complete(user_message: str, system: str = "") -> str:
# Step 1: route (Haiku classifier)
route_resp = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{"role": "user",
"content": ROUTING_PROMPT.format(message=user_message)}]
)
tier = route_resp.content[0].text.strip().lower()
model = MODEL_MAP.get(tier, "claude-sonnet-4-6")
# Step 2: generate (selected model)
kwargs = {
"model": model,
"max_tokens": 2048,
"messages": [{"role": "user", "content": user_message}],
}
if system:
kwargs["system"] = system
resp = client.messages.create(**kwargs)
print(f"[Routed to: {model}]")
return resp.content[0].text
The Cascade Validation Pattern
For high-accuracy requirements, generate cheaply and validate expensively:
def cascade_generate(prompt: str) -> dict:
"""
1. Haiku generates a draft.
2. Opus validates. If it passes, return the draft.
3. If it fails, Opus regenerates.
"""
# Draft
draft_resp = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
draft = draft_resp.content[0].text
# Validate
val_resp = client.messages.create(
model="claude-opus-4-6",
max_tokens=80,
messages=[{"role": "user", "content": f"""
Is the following response accurate and complete for the given question?
Reply "PASS" or "FAIL: <reason>".
Question: {prompt}
Response: {draft}
"""}]
)
verdict = val_resp.content[0].text.strip()
if verdict.upper().startswith("PASS"):
return {"model": "haiku", "content": draft}
# Regenerate with Opus
final_resp = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return {
"model": "opus",
"content": final_resp.content[0].text,
"fallback_reason": verdict
}
This pattern reduces Opus usage to only the cases where Haiku's quality is demonstrably insufficient.
2.6 Extended Thinking: When to Enable It
claude-opus-4-6 and claude-sonnet-4-6 support Extended Thinking, which allocates a token budget for the model to reason through a problem before generating the final response. Thinking tokens are billed at output prices.
Enable Extended Thinking When:
| Task | Recommended? |
|---|---|
| Mathematical proofs | Strongly yes |
| Complex debugging | Yes |
| Multi-factor decision analysis | Yes |
| Simple Q&A | No โ adds latency and cost |
| Format conversion | No |
| Creative writing | Usually not |
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16_000,
thinking={
"type": "enabled",
"budget_tokens": 8_000 # start conservative, increase if quality insufficient
},
messages=[{
"role": "user",
"content": "Prove that for any positive integer n, n^5 - n is divisible by 30."
}]
)
for block in response.content:
if block.type == "thinking":
print(f"[Thinking: {len(block.thinking)} chars]")
elif block.type == "text":
print(f"[Answer]\n{block.text}")
Cost warning: budget_tokens=8000 with Opus means a worst-case additional cost of 8000 / 1M * $75 = $0.60 per request for thinking alone. Start with 2,000โ4,000 tokens and increase only if the quality improvement justifies it.
2.7 Common Decision Mistakes
Mistake 1: Defaulting to the Newest Model
A higher version number does not mean better performance on your specific task. claude-haiku-4-5-20251001 may be 60x more cost-effective than claude-opus-4-6 for simple extraction, with negligible quality difference.
Mistake 2: Ignoring Compounding Latency in Agent Pipelines
In a 10-step agentic workflow, per-call latency compounds. If Opus averages 8 seconds per call and Sonnet averages 3 seconds, the 10-step pipeline takes 80s vs 30s. For interactive agentic applications, this is a dealbreaker.
Mistake 3: Trusting Academic Benchmarks Over Business Tests
MMLU and HumanEval measure general academic performance. Your task distribution may look nothing like those benchmarks. Building a 50-100 item evaluation set from real production samples is one of the highest-ROI investments in model selection.
Mistake 4: Skipping Prompt Optimization Before Upgrading
Before escalating to a more expensive model, verify that your prompt is well-structured. A thoughtfully designed Sonnet prompt often outperforms a poorly designed Opus promptโat one-fifth the cost.
Summary
Model selection is an architecture decision with direct cost and quality implications. The core principles:
- Haiku for high-throughput, low-latency, and error-tolerant tasks
- Sonnet as the sensible default for most production workloads
- Opus reserved for tasks where deep reasoning demonstrably improves outcomes and errors are costly
- Hybrid architectures (router + cascade) are usually the most economical path to high quality
- Measure before deciding: build task-specific evaluation sets and let data drive the choice
The next chapter moves from architecture into hands-on practice: obtaining an API key, understanding rate limits, installing the SDK, and making your first successful request.