Chapter 2

Model Family Deep Dive: Opus 4.7 / Sonnet 4.6 / Haiku 4.5 Selection Decision Tree

Chapter 2: Model Selection Guide: Capability Boundaries and Cost Trade-offs for Opus, Sonnet, and Haiku

2.1 Why Model Selection Is an Architecture Decision

The instinct to use the most capable model is understandable but economically unsound in production. A mid-sized SaaS product routing all requests to claude-opus-4-6 can easily spend 10–20x more on model inference than a system that matches the model to the task. Model selection belongs in the architecture review, not in the configuration file.

The decision is a multi-dimensional optimization across:

Task complexity: How much reasoning depth does this task actually require?
Latency requirements: How long can the user wait?
Throughput volume: How many requests per day?
Error tolerance: Is a 5% error rate acceptable, or catastrophic?
Cost budget: What is the maximum acceptable cost per 1,000 requests?

This chapter provides a systematic decision framework covering each dimension.

2.2 Quantifying Capability Boundaries

Multi-Step Reasoning: The Largest Gap

The most significant performance gap between the three tiers appears in tasks requiring extended chains of inference. Consider a representative task: given 500 lines of Python, identify thread-safety issues and explain why they trigger intermittently rather than on every run.

Model	Typical Performance
Haiku	Finds obvious race conditions; explanation of intermittency is imprecise; complex lock-ordering issues may be missed
Sonnet	Finds most issues; provides reasonable intermittency analysis; occasional misses on subtle ABA problems
Opus	Systematically identifies all issues; precise trigger analysis including CPU cache-line and memory barrier considerations; prioritized recommendations

Academic benchmarks quantify these gaps:

MMLU (knowledge breadth):
  Opus:   ~88%
  Sonnet: ~83%
  Haiku:  ~75%

HumanEval (code generation):
  Opus:   ~75%
  Sonnet: ~70%
  Haiku:  ~60%

MATH (mathematical reasoning):
  Opus:   ~60%
  Sonnet: ~50%
  Haiku:  ~35%

The MATH benchmark shows the largest absolute gap—25 percentage points between Opus and Haiku. Tasks requiring formal mathematical reasoning amplify model tier differences more than almost any other category.

Instruction Following: A Smaller Gap

For tasks primarily about following complex formatting or structural instructions, the three tiers converge. A 30-rule system prompt governing output structure will typically see Haiku comply with 25–28 rules reliably. This means structured extraction, format conversion, and template-filling tasks often run well on Haiku.

Long-Context Retrieval: The "Lost in the Middle" Effect

All three models support 200K token context windows, but long-context recall quality varies. Empirical research has documented a consistent pattern: information placed in the middle of a long document is harder to retrieve accurately than information at the beginning or end.

Recall accuracy by document position:

Accuracy
  100% ┤
   90% ┤ ████████████████              ██████████████████
   80% ┤                  ████████████
   70% ┤
       └────────────────────────────────────────────────→ Position
        Document start                          Document end

Opus maintains roughly 85% accuracy at the middle-of-document position; Haiku drops to around 65%. If your task requires reliably retrieving information from the middle of a long document (a contract clause buried on page 40 of 80, for example), this gap is consequential.

2.3 Cost Analysis: Real Numbers for Real Scenarios

Cost calculations must be grounded in actual token consumption patterns, not the per-million-token rate card in isolation.

Scenario 1: Customer Service Chatbot

Assumptions:

Per-turn token breakdown: user input 150 tokens + system prompt 500 tokens + conversation history 800 tokens = 1,450 input tokens
Average output: 300 tokens
Daily request volume: 10,000

# Daily cost by model tier

haiku_daily = (10_000 * 1_450 / 1_000_000 * 0.25) + (10_000 * 300 / 1_000_000 * 1.25)
# = $3.625 + $3.75 = $7.375 / day → ~$2,700 / year

sonnet_daily = (10_000 * 1_450 / 1_000_000 * 3.00) + (10_000 * 300 / 1_000_000 * 15.00)
# = $43.50 + $45.00 = $88.50 / day → ~$32,300 / year

opus_daily = (10_000 * 1_450 / 1_000_000 * 15.00) + (10_000 * 300 / 1_000_000 * 75.00)
# = $217.50 + $225.00 = $442.50 / day → ~$161,500 / year

For a customer service bot where Haiku's error rate is within acceptable bounds, the choice saves ~95% of model costs compared to Opus.

Scenario 2: Code Review Tool

Assumptions:

Per-review: system prompt 1,000 tokens + code 3,000 tokens + output 1,500 tokens
Daily volume: 500 reviews

haiku_daily  = (500 * 4_000 / 1e6 * 0.25) + (500 * 1_500 / 1e6 * 1.25)  # $1.44
sonnet_daily = (500 * 4_000 / 1e6 * 3.00) + (500 * 1_500 / 1e6 * 15.00) # $17.25
opus_daily   = (500 * 4_000 / 1e6 * 15.0) + (500 * 1_500 / 1e6 * 75.00) # $86.25

Code review is a case where quality matters more. If Sonnet achieves 95% of Opus's quality at 20% of the cost, Sonnet is typically the correct choice.

Scenario 3: Batch Content Generation

Assumptions:

Per generation: system prompt 500 tokens + instructions 200 tokens + output 2,000 tokens
Volume: 2,000 items/day, async batch (not real-time)

Batch API provides 50% cost reduction for asynchronous workloads:

# Sonnet with Batch API discount
sonnet_batch_input_price  = 3.00 * 0.50   # $1.50/M tokens
sonnet_batch_output_price = 15.00 * 0.50  # $7.50/M tokens

sonnet_batch_daily = (2_000 * 700 / 1e6 * 1.50) + (2_000 * 2_000 / 1e6 * 7.50)
# = $2.10 + $30.00 = $32.10 / day → ~$11,700 / year

2.4 A Five-Step Decision Framework

Step 1: Classify the Task

Map your task to one of these categories:

Task classification tree:

Does the task require chained reasoning (>3 logical steps)?
├─ Yes → Does it involve math, formal logic, or complex code?
│         ├─ Yes → Opus (or Sonnet + Extended Thinking)
│         └─ No  → Sonnet
└─ No  → Is the task primarily about structured output (extraction, classification)?
          ├─ Yes → Haiku
          └─ No  → Does it require high-quality writing or nuanced analysis?
                    ├─ Yes → Sonnet
                    └─ No  → Haiku

Step 2: Evaluate Latency Requirements

Latency budget	Recommended tier
< 1 second (real-time interaction)	Haiku
1–5 seconds (normal conversational)	Haiku or Sonnet
5–30 seconds (acceptable thinking time)	Sonnet or Opus
> 30 seconds (async / batch)	Any tier; optimize for cost

Step 3: Assess the Cost of Errors

Error cost	Example	Strategy
Low (human can easily correct)	Draft generation, initial triage	Haiku
Medium (affects user experience)	Customer responses, code suggestions	Sonnet
High (impacts business decisions)	Contract analysis, financial summaries	Opus + human review
Very high (legal/safety-critical)	Compliance checking, medical triage	Opus + expert review

Step 4: Run Task-Specific A/B Tests

Do not rely on intuition. Build an evaluation set from real production samples and measure directly:

import anthropic
import json

client = anthropic.Anthropic()

def run_evaluation(
    test_cases: list[dict],
    models: list[str],
    system_prompt: str = ""
) -> dict[str, dict]:
    """
    Run identical test cases across multiple models and collect quality + cost metrics.
    
    test_cases: [{"prompt": str, "expected": str, "criteria": list[str]}]
    Returns per-model scores and token usage.
    """
    results = {
        model: {"scores": [], "input_tokens": 0, "output_tokens": 0}
        for model in models
    }
    
    for case in test_cases:
        for model in models:
            kwargs = {
                "model": model,
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": case["prompt"]}]
            }
            if system_prompt:
                kwargs["system"] = system_prompt
            
            resp = client.messages.create(**kwargs)
            output = resp.content[0].text
            results[model]["input_tokens"] += resp.usage.input_tokens
            results[model]["output_tokens"] += resp.usage.output_tokens
            
            score = grade_output(output, case["expected"], case["criteria"])
            results[model]["scores"].append(score)
    
    # Compute summary statistics
    for model in models:
        scores = results[model]["scores"]
        results[model]["mean_score"] = sum(scores) / len(scores) if scores else 0
    
    return results

def grade_output(output: str, expected: str, criteria: list[str]) -> float:
    """Use Opus as an evaluation judge."""
    judge_prompt = f"""
    Grade the following response on a scale of 0-100.
    
    Grading criteria: {json.dumps(criteria)}
    Expected outcome: {expected}
    Actual response: {output}
    
    Return only an integer score with no explanation.
    """
    resp = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=10,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    try:
        return int(resp.content[0].text.strip())
    except ValueError:
        return 0

Step 5: Monitor and Adjust in Production

Model performance drifts as your task distribution changes. Instrument your production system with these metrics:

Key metrics to track:
- User satisfaction (CSAT / thumbs up-down)
- Error / complaint rate
- Task completion rate
- p50 / p95 response latency
- Cost per request (rolling 7-day average)

Alert thresholds (example):
- Error rate > 5%  → evaluate upgrading model tier
- Cost growth > 20% month-over-month → audit for prompt size bloat
- p95 latency > 5s → investigate prompt length or consider tier downgrade

2.5 Hybrid Model Architectures

Production systems rarely use a single model. The most cost-effective architectures route different task types to different models.

The Router Pattern

User request
     ↓
[Classifier: Haiku]  (cheap, fast)
     ↓
┌─────────────────────────────────────────────┐
│ Simple lookup / format task → Haiku          │
│ Moderate reasoning / generation → Sonnet     │
│ Deep reasoning / complex code → Opus         │
│ Tool-use with retrieval → Haiku + tool calls │
└─────────────────────────────────────────────┘

import anthropic

client = anthropic.Anthropic()

ROUTING_PROMPT = """
Classify the complexity of the user's request. Reply with exactly one word:
- "simple"  — factual lookup, format conversion, classification
- "medium"  — reasoning up to 3-4 steps, code generation, summarization
- "complex" — multi-step reasoning, architecture design, math proofs

User request: {message}
"""

MODEL_MAP = {
    "simple":  "claude-haiku-4-5-20251001",
    "medium":  "claude-sonnet-4-6",
    "complex": "claude-opus-4-6",
}

def route_and_complete(user_message: str, system: str = "") -> str:
    # Step 1: route (Haiku classifier)
    route_resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=10,
        messages=[{"role": "user",
                   "content": ROUTING_PROMPT.format(message=user_message)}]
    )
    tier = route_resp.content[0].text.strip().lower()
    model = MODEL_MAP.get(tier, "claude-sonnet-4-6")

    # Step 2: generate (selected model)
    kwargs = {
        "model": model,
        "max_tokens": 2048,
        "messages": [{"role": "user", "content": user_message}],
    }
    if system:
        kwargs["system"] = system

    resp = client.messages.create(**kwargs)
    print(f"[Routed to: {model}]")
    return resp.content[0].text

The Cascade Validation Pattern

For high-accuracy requirements, generate cheaply and validate expensively:

def cascade_generate(prompt: str) -> dict:
    """
    1. Haiku generates a draft.
    2. Opus validates. If it passes, return the draft.
    3. If it fails, Opus regenerates.
    """
    # Draft
    draft_resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    draft = draft_resp.content[0].text

    # Validate
    val_resp = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=80,
        messages=[{"role": "user", "content": f"""
Is the following response accurate and complete for the given question?
Reply "PASS" or "FAIL: <reason>".

Question: {prompt}
Response: {draft}
"""}]
    )
    verdict = val_resp.content[0].text.strip()

    if verdict.upper().startswith("PASS"):
        return {"model": "haiku", "content": draft}

    # Regenerate with Opus
    final_resp = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return {
        "model": "opus",
        "content": final_resp.content[0].text,
        "fallback_reason": verdict
    }

This pattern reduces Opus usage to only the cases where Haiku's quality is demonstrably insufficient.

2.6 Extended Thinking: When to Enable It

claude-opus-4-6 and claude-sonnet-4-6 support Extended Thinking, which allocates a token budget for the model to reason through a problem before generating the final response. Thinking tokens are billed at output prices.

Enable Extended Thinking When:

Task	Recommended?
Mathematical proofs	Strongly yes
Complex debugging	Yes
Multi-factor decision analysis	Yes
Simple Q&A	No — adds latency and cost
Format conversion	No
Creative writing	Usually not

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16_000,
    thinking={
        "type": "enabled",
        "budget_tokens": 8_000   # start conservative, increase if quality insufficient
    },
    messages=[{
        "role": "user",
        "content": "Prove that for any positive integer n, n^5 - n is divisible by 30."
    }]
)

for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking: {len(block.thinking)} chars]")
    elif block.type == "text":
        print(f"[Answer]\n{block.text}")

Cost warning: budget_tokens=8000 with Opus means a worst-case additional cost of 8000 / 1M * $75 = $0.60 per request for thinking alone. Start with 2,000–4,000 tokens and increase only if the quality improvement justifies it.

2.7 Common Decision Mistakes

Mistake 1: Defaulting to the Newest Model

A higher version number does not mean better performance on your specific task. claude-haiku-4-5-20251001 may be 60x more cost-effective than claude-opus-4-6 for simple extraction, with negligible quality difference.

Mistake 2: Ignoring Compounding Latency in Agent Pipelines

In a 10-step agentic workflow, per-call latency compounds. If Opus averages 8 seconds per call and Sonnet averages 3 seconds, the 10-step pipeline takes 80s vs 30s. For interactive agentic applications, this is a dealbreaker.

Mistake 3: Trusting Academic Benchmarks Over Business Tests

MMLU and HumanEval measure general academic performance. Your task distribution may look nothing like those benchmarks. Building a 50-100 item evaluation set from real production samples is one of the highest-ROI investments in model selection.

Mistake 4: Skipping Prompt Optimization Before Upgrading

Before escalating to a more expensive model, verify that your prompt is well-structured. A thoughtfully designed Sonnet prompt often outperforms a poorly designed Opus prompt—at one-fifth the cost.

Summary

Model selection is an architecture decision with direct cost and quality implications. The core principles:

Haiku for high-throughput, low-latency, and error-tolerant tasks
Sonnet as the sensible default for most production workloads
Opus reserved for tasks where deep reasoning demonstrably improves outcomes and errors are costly
Hybrid architectures (router + cascade) are usually the most economical path to high quality
Measure before deciding: build task-specific evaluation sets and let data drive the choice

The next chapter moves from architecture into hands-on practice: obtaining an API key, understanding rate limits, installing the SDK, and making your first successful request.

Rate this chapter

4.6 / 5 (140 ratings)

Model Family Deep Dive: Opus 4.7 / Sonnet 4.6 / Haiku 4.5 Selection Decision Tree

Chapter 2: Model Selection Guide: Capability Boundaries and Cost Trade-offs for Opus, Sonnet, and Haiku

2.1 Why Model Selection Is an Architecture Decision

2.2 Quantifying Capability Boundaries

Multi-Step Reasoning: The Largest Gap

Instruction Following: A Smaller Gap

Long-Context Retrieval: The "Lost in the Middle" Effect

2.3 Cost Analysis: Real Numbers for Real Scenarios

Scenario 1: Customer Service Chatbot

Scenario 2: Code Review Tool

Scenario 3: Batch Content Generation

2.4 A Five-Step Decision Framework

Step 1: Classify the Task

Step 2: Evaluate Latency Requirements

Step 3: Assess the Cost of Errors

Step 4: Run Task-Specific A/B Tests

Step 5: Monitor and Adjust in Production

2.5 Hybrid Model Architectures

The Router Pattern

The Cascade Validation Pattern

2.6 Extended Thinking: When to Enable It

Enable Extended Thinking When:

2.7 Common Decision Mistakes

Mistake 1: Defaulting to the Newest Model

Mistake 2: Ignoring Compounding Latency in Agent Pipelines

Mistake 3: Trusting Academic Benchmarks Over Business Tests

Mistake 4: Skipping Prompt Optimization Before Upgrading

Summary

💬 Comments