Chapter 15

Extended Thinking Deep Dive: budget_tokens, display Modes and Multi-Turn Propagation Mechanics

Chapter 15: Extended Thinking: Enabling Deep Reasoning, Budget Control, and Token Economics

15.1 What Is Extended Thinking?

Extended Thinking is a Claude capability that lets the model reason through a problem in a private "scratchpad" before producing its final answer. The scratchpad content — called a thinking block — is returned to the developer alongside the text response, but is typically not displayed to end users.

Conceptually, it mirrors what a thoughtful human does when facing a hard problem: writing out intermediate steps, exploring multiple angles, noticing errors, and correcting course — before committing to a final answer.

Where Extended Thinking shines

Multi-step math and logic — proofs, algebraic derivations, combinatorics
Software architecture decisions — weighing design trade-offs with many interdependencies
Strategic analysis — game theory, competitive strategy, multi-variable optimization
Research synthesis — integrating evidence from multiple sources into a coherent conclusion
Complex debugging — identifying root causes in intricate systems

Where it adds little value

Simple factual lookups ("What is the capital of France?")
Pure format conversion (translation, reformatting)
High-throughput, cost-sensitive batch workloads
Tasks where latency is critical (streaming chat with sub-second TTFT requirements)

15.2 Basic Usage: Enabling Extended Thinking

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-6",   # Best reasoning performance
    max_tokens=16000,           # Must be large enough for thinking + output
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Max tokens the model may spend thinking
    },
    messages=[{
        "role": "user",
        "content": "Prove that the sum 1+2+3+...+n equals n(n+1)/2 for all positive integers n."
    }]
)

# The response contains two content block types
for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking]\n{block.thinking}\n")
    elif block.type == "text":
        print(f"[Answer]\n{block.text}")

The relationship between `max_tokens` and `budget_tokens`

This is the single most common configuration mistake:

max_tokens ≥ budget_tokens + expected_output_tokens

budget_tokens: maximum tokens the model may spend in the thinking block
max_tokens: hard cap on all output tokens combined (thinking + text)
If max_tokens is too small, the response will be truncated mid-reasoning

# WRONG: max_tokens too small
bad = {
    "max_tokens": 1024,      # Smaller than budget_tokens alone!
    "thinking": {"type": "enabled", "budget_tokens": 10000}
}

# CORRECT: leave room for both thinking and final answer
good = {
    "max_tokens": 16000,     # 10000 for thinking + 6000 for output
    "thinking": {"type": "enabled", "budget_tokens": 10000}
}

15.3 Budget Control

How budget affects quality

budget_tokens is an upper bound, not a minimum. The model decides how much of the budget to actually use based on the problem's complexity:

Simple problems (even with a large budget): the model may use only 200–500 tokens of thinking
Complex problems with a generous budget: the model uses the budget fully for deeper exploration
Complex problems with a tight budget: the model does its best within the constraint, but quality may suffer

def solve_at_budget(problem: str, budget: int) -> dict:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=budget + 4096,
        thinking={"type": "enabled", "budget_tokens": budget},
        messages=[{"role": "user", "content": problem}]
    )

    thinking_text = ""
    answer = ""
    for block in response.content:
        if block.type == "thinking":
            thinking_text = block.thinking
        elif block.type == "text":
            answer = block.text

    return {
        "budget": budget,
        "thinking_chars": len(thinking_text),
        "output_tokens": response.usage.output_tokens,
        "answer": answer[:200] + "..." if len(answer) > 200 else answer
    }

problem = "How many ways can 8 queens be placed on an 8×8 chessboard so no two queens attack each other?"
for budget in [1000, 5000, 10000, 20000]:
    r = solve_at_budget(problem, budget)
    print(f"Budget {budget:6d}: {r['thinking_chars']:6d} thinking chars, {r['output_tokens']} output tokens")

Recommended budget settings

Task type	budget_tokens	max_tokens
Simple reasoning (a few steps)	1,000–3,000	4,000–6,000
Moderate complexity (math, logic)	5,000–10,000	12,000–16,000
Complex analysis (architecture, strategy)	10,000–20,000	20,000–30,000
Research-level problems	20,000–50,000	60,000+

Extending output beyond 8K with betas

# Use the output-128k beta for very large outputs
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=100000,
    betas=["output-128k-2025-02-19"],   # Extends output ceiling to 128K tokens
    thinking={
        "type": "enabled",
        "budget_tokens": 80000
    },
    messages=[{"role": "user", "content": "Conduct a comprehensive architectural analysis of..."}]
)

15.4 Token Economics

How thinking tokens are billed

Thinking block tokens are charged at the output token rate, not the input rate:

Total cost = input_tokens × input_price + (thinking_tokens + text_tokens) × output_price

Example with claude-opus-4-6 ($15/MTok input, $75/MTok output):

def calculate_cost(
    input_tokens: int,
    thinking_tokens: int,
    output_tokens: int,
    model: str = "claude-opus-4-6"
) -> dict:
    PRICES = {
        "claude-opus-4-6": {"in": 15.0, "out": 75.0},
        "claude-sonnet-4-6": {"in": 3.0, "out": 15.0},
    }
    p = PRICES[model]

    in_cost = (input_tokens / 1e6) * p["in"]
    think_cost = (thinking_tokens / 1e6) * p["out"]
    out_cost = (output_tokens / 1e6) * p["out"]
    total = in_cost + think_cost + out_cost

    return {
        "input_cost": round(in_cost, 6),
        "thinking_cost": round(think_cost, 6),
        "output_cost": round(out_cost, 6),
        "total_cost": round(total, 6),
        "thinking_share": f"{think_cost / total * 100:.1f}%"
    }

# A request with 500 input tokens, 8000 thinking tokens, 500 output tokens
print(calculate_cost(500, 8000, 500))
# {'input_cost': 0.0075, 'thinking_cost': 0.6, 'output_cost': 0.0375,
#  'total_cost': 0.645, 'thinking_share': '93.0%'}

Thinking tokens typically account for 85–95% of total cost on Extended Thinking requests. Budget control is cost control.

Tiered complexity routing

def adaptive_solve(client, problem: str, complexity: str = "auto") -> str:
    """Route to the right model and thinking budget based on complexity."""

    if complexity == "auto":
        # Quick probe with Haiku to estimate complexity
        probe = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": f"Rate complexity (simple/medium/complex), one word:\n{problem}"
            }]
        )
        complexity = probe.content[0].text.strip().lower()
        if complexity not in ("simple", "medium", "complex"):
            complexity = "medium"

    configs = {
        "simple": {
            "model": "claude-haiku-4-5-20251001",
            "max_tokens": 1024,
            "thinking": None
        },
        "medium": {
            "model": "claude-sonnet-4-6",
            "max_tokens": 8000,
            "thinking": {"type": "enabled", "budget_tokens": 5000}
        },
        "complex": {
            "model": "claude-opus-4-6",
            "max_tokens": 20000,
            "thinking": {"type": "enabled", "budget_tokens": 15000}
        }
    }
    cfg = configs[complexity]

    kwargs = {
        "model": cfg["model"],
        "max_tokens": cfg["max_tokens"],
        "messages": [{"role": "user", "content": problem}]
    }
    if cfg["thinking"]:
        kwargs["thinking"] = cfg["thinking"]

    response = client.messages.create(**kwargs)
    return next((b.text for b in response.content if b.type == "text"), "")

15.5 Working with Thinking Content

Accessing thinking blocks

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "What's the optimal strategy in this scenario?"}]
)

for block in response.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking)
        print(f"(Signature: {block.signature[:20]}...)")
    elif block.type == "text":
        print("=== FINAL ANSWER ===")
        print(block.text)

Multi-turn conversations preserving thinking context

def multi_turn_thinking(client, initial_problem: str) -> None:
    """Maintain thinking context across conversation turns."""

    messages = [{"role": "user", "content": initial_problem}]

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=16000,
        thinking={"type": "enabled", "budget_tokens": 10000},
        messages=messages
    )

    # Preserve ALL content blocks (including thinking) in history
    messages.append({
        "role": "assistant",
        "content": response.content   # Include thinking blocks
    })

    # Follow-up question
    messages.append({
        "role": "user",
        "content": "What edge cases should we consider based on your analysis?"
    })

    response2 = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=16000,
        thinking={"type": "enabled", "budget_tokens": 8000},
        messages=messages
    )

    for block in response2.content:
        if block.type == "text":
            print(block.text)

Key rule: When including a previous assistant turn with thinking blocks in the conversation history, the thinking blocks must be preserved exactly as returned. They cannot be modified, and their signature field is used by the model to verify integrity.

Streaming with thinking

with client.messages.stream(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "Solve this complex problem..."}]
) as stream:
    current_type = None

    for event in stream:
        if event.type == "content_block_start":
            current_type = event.content_block.type
            if current_type == "thinking":
                print("\n[Thinking...]\n", flush=True)
            elif current_type == "text":
                print("\n[Answer]\n", flush=True)

        elif event.type == "content_block_delta":
            if event.delta.type == "thinking_delta":
                print(event.delta.thinking, end="", flush=True)
            elif event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)

15.6 Extended Thinking with Tool Use

Extended Thinking and tool calling work together, enabling the model to reason about which tools to call and how to interpret their results:

import anthropic, json

client = anthropic.Anthropic()

tools = [{
    "name": "query_database",
    "description": "Query a database and return results.",
    "input_schema": {
        "type": "object",
        "properties": {
            "sql": {"type": "string", "description": "SQL query to execute"}
        },
        "required": ["sql"]
    }
}]

def research_agent(question: str) -> str:
    messages = [{"role": "user", "content": question}]

    while True:
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=16000,
            thinking={"type": "enabled", "budget_tokens": 8000},
            tools=tools,
            messages=messages
        )

        tool_uses = [b for b in response.content if b.type == "tool_use"]
        if not tool_uses or response.stop_reason == "end_turn":
            return next((b.text for b in response.content if b.type == "text"), "")

        messages.append({"role": "assistant", "content": response.content})
        tool_results = []

        for tu in tool_uses:
            # Simulate database query
            result = {"rows": [{"count": 42}], "query": tu.input["sql"]}
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": tu.id,
                "content": json.dumps(result)
            })

        messages.append({"role": "user", "content": tool_results})

15.7 Constraints and Common Errors

Constraint 1: temperature must be 1

Extended Thinking requires the default temperature of 1. Setting any other value raises an error.

# WRONG
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    temperature=0.5,    # Not allowed with thinking enabled
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[...]
)
# anthropic.BadRequestError: temperature must be 1 when thinking is enabled

Constraint 2: no prefill

# WRONG: cannot use assistant prefill with extended thinking
messages = [
    {"role": "user", "content": "Solve this problem."},
    {"role": "assistant", "content": "The answer is"}   # Not allowed
]

# CORRECT: no prefill
messages = [{"role": "user", "content": "Solve this problem."}]

Constraint 3: thinking blocks must precede text blocks

When appending an assistant turn to the message history, thinking blocks must come before text blocks, exactly as the API returned them.

def validate_content_order(blocks: list) -> bool:
    """Check that thinking blocks appear before text blocks."""
    seen_text = False
    for block in blocks:
        if hasattr(block, "type"):
            if block.type == "text":
                seen_text = True
            elif block.type == "thinking" and seen_text:
                return False   # thinking after text: invalid
    return True

15.8 Benchmarking Thinking Quality

import time

def benchmark(problems: list[dict], budgets: list[int] = [0, 2000, 8000, 20000]) -> dict:
    """
    Measure accuracy and cost at different thinking budgets.
    problems: list of {"problem": str, "expected": str}
    """
    results = {}

    for budget in budgets:
        correct, total_cost, total_time = 0, 0.0, 0.0

        for item in problems:
            t0 = time.time()
            kwargs = {
                "model": "claude-opus-4-6",
                "max_tokens": (budget + 2048) if budget > 0 else 2048,
                "messages": [{"role": "user", "content": item["problem"]}]
            }
            if budget > 0:
                kwargs["thinking"] = {"type": "enabled", "budget_tokens": budget}

            r = client.messages.create(**kwargs)
            elapsed = time.time() - t0

            answer = next((b.text for b in r.content if b.type == "text"), "")
            if item["expected"].lower() in answer.lower():
                correct += 1

            cost = (r.usage.input_tokens * 15 + r.usage.output_tokens * 75) / 1e6
            total_cost += cost
            total_time += elapsed

        n = len(problems)
        results[budget] = {
            "accuracy": f"{correct/n:.0%}",
            "avg_cost_usd": round(total_cost / n, 5),
            "avg_latency_s": round(total_time / n, 2)
        }

    return results

Summary

Extended Thinking is Claude's primary mechanism for tackling problems that require sustained multi-step reasoning. Key takeaways:

max_tokens must exceed budget_tokens — the most common configuration error
Thinking tokens are billed at output prices — they typically represent 85–95% of total cost
Temperature must be 1 and no prefill when thinking is enabled
Multi-turn conversations should preserve thinking blocks for context continuity
Use tiered routing — reserve Extended Thinking for genuinely complex tasks; use Haiku without thinking for simple ones
betas=["output-128k-2025-02-19"] extends the output ceiling to 128K tokens for ultra-deep analysis

Rate this chapter

4.8 / 5 (26 ratings)