Chapter 2

Model Family Deep Dive: Opus 4.7 / Sonnet 4.6 / Haiku 4.5 Selection Decision Tree

Chapter 2: Model Selection Guide: Capability Boundaries and Cost Trade-offs for Opus, Sonnet, and Haiku

2.1 Why Model Selection Is an Architecture Decision

The instinct to use the most capable model is understandable but economically unsound in production. A mid-sized SaaS product routing all requests to claude-opus-4-6 can easily spend 10–20x more on model inference than a system that matches the model to the task. Model selection belongs in the architecture review, not in the configuration file.

The decision is a multi-dimensional optimization across:

This chapter provides a systematic decision framework covering each dimension.

2.2 Quantifying Capability Boundaries

Multi-Step Reasoning: The Largest Gap

The most significant performance gap between the three tiers appears in tasks requiring extended chains of inference. Consider a representative task: given 500 lines of Python, identify thread-safety issues and explain why they trigger intermittently rather than on every run.

Model Typical Performance
Haiku Finds obvious race conditions; explanation of intermittency is imprecise; complex lock-ordering issues may be missed
Sonnet Finds most issues; provides reasonable intermittency analysis; occasional misses on subtle ABA problems
Opus Systematically identifies all issues; precise trigger analysis including CPU cache-line and memory barrier considerations; prioritized recommendations

Academic benchmarks quantify these gaps:

MMLU (knowledge breadth):
  Opus:   ~88%
  Sonnet: ~83%
  Haiku:  ~75%

HumanEval (code generation):
  Opus:   ~75%
  Sonnet: ~70%
  Haiku:  ~60%

MATH (mathematical reasoning):
  Opus:   ~60%
  Sonnet: ~50%
  Haiku:  ~35%

The MATH benchmark shows the largest absolute gap—25 percentage points between Opus and Haiku. Tasks requiring formal mathematical reasoning amplify model tier differences more than almost any other category.

Instruction Following: A Smaller Gap

For tasks primarily about following complex formatting or structural instructions, the three tiers converge. A 30-rule system prompt governing output structure will typically see Haiku comply with 25–28 rules reliably. This means structured extraction, format conversion, and template-filling tasks often run well on Haiku.

Long-Context Retrieval: The "Lost in the Middle" Effect

All three models support 200K token context windows, but long-context recall quality varies. Empirical research has documented a consistent pattern: information placed in the middle of a long document is harder to retrieve accurately than information at the beginning or end.

Recall accuracy by document position:

Accuracy
  100% ┤
   90% ┤ ████████████████              ██████████████████
   80% ┤                  ████████████
   70% ┤
       └────────────────────────────────────────────────→ Position
        Document start                          Document end

Opus maintains roughly 85% accuracy at the middle-of-document position; Haiku drops to around 65%. If your task requires reliably retrieving information from the middle of a long document (a contract clause buried on page 40 of 80, for example), this gap is consequential.

2.3 Cost Analysis: Real Numbers for Real Scenarios

Cost calculations must be grounded in actual token consumption patterns, not the per-million-token rate card in isolation.

Scenario 1: Customer Service Chatbot

Assumptions:

# Daily cost by model tier

haiku_daily = (10_000 * 1_450 / 1_000_000 * 0.25) + (10_000 * 300 / 1_000_000 * 1.25)
# = $3.625 + $3.75 = $7.375 / day → ~$2,700 / year

sonnet_daily = (10_000 * 1_450 / 1_000_000 * 3.00) + (10_000 * 300 / 1_000_000 * 15.00)
# = $43.50 + $45.00 = $88.50 / day → ~$32,300 / year

opus_daily = (10_000 * 1_450 / 1_000_000 * 15.00) + (10_000 * 300 / 1_000_000 * 75.00)
# = $217.50 + $225.00 = $442.50 / day → ~$161,500 / year

For a customer service bot where Haiku's error rate is within acceptable bounds, the choice saves ~95% of model costs compared to Opus.

Scenario 2: Code Review Tool

Assumptions:

haiku_daily  = (500 * 4_000 / 1e6 * 0.25) + (500 * 1_500 / 1e6 * 1.25)  # $1.44
sonnet_daily = (500 * 4_000 / 1e6 * 3.00) + (500 * 1_500 / 1e6 * 15.00) # $17.25
opus_daily   = (500 * 4_000 / 1e6 * 15.0) + (500 * 1_500 / 1e6 * 75.00) # $86.25

Code review is a case where quality matters more. If Sonnet achieves 95% of Opus's quality at 20% of the cost, Sonnet is typically the correct choice.

Scenario 3: Batch Content Generation

Assumptions:

Batch API provides 50% cost reduction for asynchronous workloads:

# Sonnet with Batch API discount
sonnet_batch_input_price  = 3.00 * 0.50   # $1.50/M tokens
sonnet_batch_output_price = 15.00 * 0.50  # $7.50/M tokens

sonnet_batch_daily = (2_000 * 700 / 1e6 * 1.50) + (2_000 * 2_000 / 1e6 * 7.50)
# = $2.10 + $30.00 = $32.10 / day → ~$11,700 / year

2.4 A Five-Step Decision Framework

Step 1: Classify the Task

Map your task to one of these categories:

Task classification tree:

Does the task require chained reasoning (>3 logical steps)?
├─ Yes → Does it involve math, formal logic, or complex code?
│         ├─ Yes → Opus (or Sonnet + Extended Thinking)
│         └─ No  → Sonnet
└─ No  → Is the task primarily about structured output (extraction, classification)?
          ├─ Yes → Haiku
          └─ No  → Does it require high-quality writing or nuanced analysis?
                    ├─ Yes → Sonnet
                    └─ No  → Haiku

Step 2: Evaluate Latency Requirements

Latency budget Recommended tier
< 1 second (real-time interaction) Haiku
1–5 seconds (normal conversational) Haiku or Sonnet
5–30 seconds (acceptable thinking time) Sonnet or Opus
> 30 seconds (async / batch) Any tier; optimize for cost

Step 3: Assess the Cost of Errors

Error cost Example Strategy
Low (human can easily correct) Draft generation, initial triage Haiku
Medium (affects user experience) Customer responses, code suggestions Sonnet
High (impacts business decisions) Contract analysis, financial summaries Opus + human review
Very high (legal/safety-critical) Compliance checking, medical triage Opus + expert review

Step 4: Run Task-Specific A/B Tests

Do not rely on intuition. Build an evaluation set from real production samples and measure directly:

import anthropic
import json

client = anthropic.Anthropic()

def run_evaluation(
    test_cases: list[dict],
    models: list[str],
    system_prompt: str = ""
) -> dict[str, dict]:
    """
    Run identical test cases across multiple models and collect quality + cost metrics.
    
    test_cases: [{"prompt": str, "expected": str, "criteria": list[str]}]
    Returns per-model scores and token usage.
    """
    results = {
        model: {"scores": [], "input_tokens": 0, "output_tokens": 0}
        for model in models
    }
    
    for case in test_cases:
        for model in models:
            kwargs = {
                "model": model,
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": case["prompt"]}]
            }
            if system_prompt:
                kwargs["system"] = system_prompt
            
            resp = client.messages.create(**kwargs)
            output = resp.content[0].text
            results[model]["input_tokens"] += resp.usage.input_tokens
            results[model]["output_tokens"] += resp.usage.output_tokens
            
            score = grade_output(output, case["expected"], case["criteria"])
            results[model]["scores"].append(score)
    
    # Compute summary statistics
    for model in models:
        scores = results[model]["scores"]
        results[model]["mean_score"] = sum(scores) / len(scores) if scores else 0
    
    return results

def grade_output(output: str, expected: str, criteria: list[str]) -> float:
    """Use Opus as an evaluation judge."""
    judge_prompt = f"""
    Grade the following response on a scale of 0-100.
    
    Grading criteria: {json.dumps(criteria)}
    Expected outcome: {expected}
    Actual response: {output}
    
    Return only an integer score with no explanation.
    """
    resp = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=10,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    try:
        return int(resp.content[0].text.strip())
    except ValueError:
        return 0

Step 5: Monitor and Adjust in Production

Model performance drifts as your task distribution changes. Instrument your production system with these metrics:

Key metrics to track:
- User satisfaction (CSAT / thumbs up-down)
- Error / complaint rate
- Task completion rate
- p50 / p95 response latency
- Cost per request (rolling 7-day average)

Alert thresholds (example):
- Error rate > 5%  → evaluate upgrading model tier
- Cost growth > 20% month-over-month → audit for prompt size bloat
- p95 latency > 5s → investigate prompt length or consider tier downgrade

2.5 Hybrid Model Architectures

Production systems rarely use a single model. The most cost-effective architectures route different task types to different models.

The Router Pattern

User request
     ↓
[Classifier: Haiku]  (cheap, fast)
     ↓
┌─────────────────────────────────────────────┐
│ Simple lookup / format task → Haiku          │
│ Moderate reasoning / generation → Sonnet     │
│ Deep reasoning / complex code → Opus         │
│ Tool-use with retrieval → Haiku + tool calls │
└─────────────────────────────────────────────┘
import anthropic

client = anthropic.Anthropic()

ROUTING_PROMPT = """
Classify the complexity of the user's request. Reply with exactly one word:
- "simple"  — factual lookup, format conversion, classification
- "medium"  — reasoning up to 3-4 steps, code generation, summarization
- "complex" — multi-step reasoning, architecture design, math proofs

User request: {message}
"""

MODEL_MAP = {
    "simple":  "claude-haiku-4-5-20251001",
    "medium":  "claude-sonnet-4-6",
    "complex": "claude-opus-4-6",
}

def route_and_complete(user_message: str, system: str = "") -> str:
    # Step 1: route (Haiku classifier)
    route_resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=10,
        messages=[{"role": "user",
                   "content": ROUTING_PROMPT.format(message=user_message)}]
    )
    tier = route_resp.content[0].text.strip().lower()
    model = MODEL_MAP.get(tier, "claude-sonnet-4-6")

    # Step 2: generate (selected model)
    kwargs = {
        "model": model,
        "max_tokens": 2048,
        "messages": [{"role": "user", "content": user_message}],
    }
    if system:
        kwargs["system"] = system

    resp = client.messages.create(**kwargs)
    print(f"[Routed to: {model}]")
    return resp.content[0].text

The Cascade Validation Pattern

For high-accuracy requirements, generate cheaply and validate expensively:

def cascade_generate(prompt: str) -> dict:
    """
    1. Haiku generates a draft.
    2. Opus validates. If it passes, return the draft.
    3. If it fails, Opus regenerates.
    """
    # Draft
    draft_resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    draft = draft_resp.content[0].text

    # Validate
    val_resp = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=80,
        messages=[{"role": "user", "content": f"""
Is the following response accurate and complete for the given question?
Reply "PASS" or "FAIL: <reason>".

Question: {prompt}
Response: {draft}
"""}]
    )
    verdict = val_resp.content[0].text.strip()

    if verdict.upper().startswith("PASS"):
        return {"model": "haiku", "content": draft}

    # Regenerate with Opus
    final_resp = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return {
        "model": "opus",
        "content": final_resp.content[0].text,
        "fallback_reason": verdict
    }

This pattern reduces Opus usage to only the cases where Haiku's quality is demonstrably insufficient.

2.6 Extended Thinking: When to Enable It

claude-opus-4-6 and claude-sonnet-4-6 support Extended Thinking, which allocates a token budget for the model to reason through a problem before generating the final response. Thinking tokens are billed at output prices.

Enable Extended Thinking When:

Task Recommended?
Mathematical proofs Strongly yes
Complex debugging Yes
Multi-factor decision analysis Yes
Simple Q&A No — adds latency and cost
Format conversion No
Creative writing Usually not
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16_000,
    thinking={
        "type": "enabled",
        "budget_tokens": 8_000   # start conservative, increase if quality insufficient
    },
    messages=[{
        "role": "user",
        "content": "Prove that for any positive integer n, n^5 - n is divisible by 30."
    }]
)

for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking: {len(block.thinking)} chars]")
    elif block.type == "text":
        print(f"[Answer]\n{block.text}")

Cost warning: budget_tokens=8000 with Opus means a worst-case additional cost of 8000 / 1M * $75 = $0.60 per request for thinking alone. Start with 2,000–4,000 tokens and increase only if the quality improvement justifies it.

2.7 Common Decision Mistakes

Mistake 1: Defaulting to the Newest Model

A higher version number does not mean better performance on your specific task. claude-haiku-4-5-20251001 may be 60x more cost-effective than claude-opus-4-6 for simple extraction, with negligible quality difference.

Mistake 2: Ignoring Compounding Latency in Agent Pipelines

In a 10-step agentic workflow, per-call latency compounds. If Opus averages 8 seconds per call and Sonnet averages 3 seconds, the 10-step pipeline takes 80s vs 30s. For interactive agentic applications, this is a dealbreaker.

Mistake 3: Trusting Academic Benchmarks Over Business Tests

MMLU and HumanEval measure general academic performance. Your task distribution may look nothing like those benchmarks. Building a 50-100 item evaluation set from real production samples is one of the highest-ROI investments in model selection.

Mistake 4: Skipping Prompt Optimization Before Upgrading

Before escalating to a more expensive model, verify that your prompt is well-structured. A thoughtfully designed Sonnet prompt often outperforms a poorly designed Opus prompt—at one-fifth the cost.


Summary

Model selection is an architecture decision with direct cost and quality implications. The core principles:

  1. Haiku for high-throughput, low-latency, and error-tolerant tasks
  2. Sonnet as the sensible default for most production workloads
  3. Opus reserved for tasks where deep reasoning demonstrably improves outcomes and errors are costly
  4. Hybrid architectures (router + cascade) are usually the most economical path to high quality
  5. Measure before deciding: build task-specific evaluation sets and let data drive the choice

The next chapter moves from architecture into hands-on practice: obtaining an API key, understanding rate limits, installing the SDK, and making your first successful request.

Rate this chapter
4.6  / 5  (140 ratings)

💬 Comments