Chapter 15

Extended Thinking Deep Dive: budget_tokens, display Modes and Multi-Turn Propagation Mechanics

Chapter 15: Extended Thinking: Enabling Deep Reasoning, Budget Control, and Token Economics

15.1 What Is Extended Thinking?

Extended Thinking is a Claude capability that lets the model reason through a problem in a private "scratchpad" before producing its final answer. The scratchpad content โ€” called a thinking block โ€” is returned to the developer alongside the text response, but is typically not displayed to end users.

Conceptually, it mirrors what a thoughtful human does when facing a hard problem: writing out intermediate steps, exploring multiple angles, noticing errors, and correcting course โ€” before committing to a final answer.

Where Extended Thinking shines

Where it adds little value

15.2 Basic Usage: Enabling Extended Thinking

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-6",   # Best reasoning performance
    max_tokens=16000,           # Must be large enough for thinking + output
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Max tokens the model may spend thinking
    },
    messages=[{
        "role": "user",
        "content": "Prove that the sum 1+2+3+...+n equals n(n+1)/2 for all positive integers n."
    }]
)

# The response contains two content block types
for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking]\n{block.thinking}\n")
    elif block.type == "text":
        print(f"[Answer]\n{block.text}")

The relationship between max_tokens and budget_tokens

This is the single most common configuration mistake:

max_tokens โ‰ฅ budget_tokens + expected_output_tokens
# WRONG: max_tokens too small
bad = {
    "max_tokens": 1024,      # Smaller than budget_tokens alone!
    "thinking": {"type": "enabled", "budget_tokens": 10000}
}

# CORRECT: leave room for both thinking and final answer
good = {
    "max_tokens": 16000,     # 10000 for thinking + 6000 for output
    "thinking": {"type": "enabled", "budget_tokens": 10000}
}

15.3 Budget Control

How budget affects quality

budget_tokens is an upper bound, not a minimum. The model decides how much of the budget to actually use based on the problem's complexity:

def solve_at_budget(problem: str, budget: int) -> dict:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=budget + 4096,
        thinking={"type": "enabled", "budget_tokens": budget},
        messages=[{"role": "user", "content": problem}]
    )

    thinking_text = ""
    answer = ""
    for block in response.content:
        if block.type == "thinking":
            thinking_text = block.thinking
        elif block.type == "text":
            answer = block.text

    return {
        "budget": budget,
        "thinking_chars": len(thinking_text),
        "output_tokens": response.usage.output_tokens,
        "answer": answer[:200] + "..." if len(answer) > 200 else answer
    }

problem = "How many ways can 8 queens be placed on an 8ร—8 chessboard so no two queens attack each other?"
for budget in [1000, 5000, 10000, 20000]:
    r = solve_at_budget(problem, budget)
    print(f"Budget {budget:6d}: {r['thinking_chars']:6d} thinking chars, {r['output_tokens']} output tokens")
Task type budget_tokens max_tokens
Simple reasoning (a few steps) 1,000โ€“3,000 4,000โ€“6,000
Moderate complexity (math, logic) 5,000โ€“10,000 12,000โ€“16,000
Complex analysis (architecture, strategy) 10,000โ€“20,000 20,000โ€“30,000
Research-level problems 20,000โ€“50,000 60,000+

Extending output beyond 8K with betas

# Use the output-128k beta for very large outputs
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=100000,
    betas=["output-128k-2025-02-19"],   # Extends output ceiling to 128K tokens
    thinking={
        "type": "enabled",
        "budget_tokens": 80000
    },
    messages=[{"role": "user", "content": "Conduct a comprehensive architectural analysis of..."}]
)

15.4 Token Economics

How thinking tokens are billed

Thinking block tokens are charged at the output token rate, not the input rate:

Total cost = input_tokens ร— input_price + (thinking_tokens + text_tokens) ร— output_price

Example with claude-opus-4-6 ($15/MTok input, $75/MTok output):

def calculate_cost(
    input_tokens: int,
    thinking_tokens: int,
    output_tokens: int,
    model: str = "claude-opus-4-6"
) -> dict:
    PRICES = {
        "claude-opus-4-6": {"in": 15.0, "out": 75.0},
        "claude-sonnet-4-6": {"in": 3.0, "out": 15.0},
    }
    p = PRICES[model]

    in_cost = (input_tokens / 1e6) * p["in"]
    think_cost = (thinking_tokens / 1e6) * p["out"]
    out_cost = (output_tokens / 1e6) * p["out"]
    total = in_cost + think_cost + out_cost

    return {
        "input_cost": round(in_cost, 6),
        "thinking_cost": round(think_cost, 6),
        "output_cost": round(out_cost, 6),
        "total_cost": round(total, 6),
        "thinking_share": f"{think_cost / total * 100:.1f}%"
    }

# A request with 500 input tokens, 8000 thinking tokens, 500 output tokens
print(calculate_cost(500, 8000, 500))
# {'input_cost': 0.0075, 'thinking_cost': 0.6, 'output_cost': 0.0375,
#  'total_cost': 0.645, 'thinking_share': '93.0%'}

Thinking tokens typically account for 85โ€“95% of total cost on Extended Thinking requests. Budget control is cost control.

Tiered complexity routing

def adaptive_solve(client, problem: str, complexity: str = "auto") -> str:
    """Route to the right model and thinking budget based on complexity."""

    if complexity == "auto":
        # Quick probe with Haiku to estimate complexity
        probe = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": f"Rate complexity (simple/medium/complex), one word:\n{problem}"
            }]
        )
        complexity = probe.content[0].text.strip().lower()
        if complexity not in ("simple", "medium", "complex"):
            complexity = "medium"

    configs = {
        "simple": {
            "model": "claude-haiku-4-5-20251001",
            "max_tokens": 1024,
            "thinking": None
        },
        "medium": {
            "model": "claude-sonnet-4-6",
            "max_tokens": 8000,
            "thinking": {"type": "enabled", "budget_tokens": 5000}
        },
        "complex": {
            "model": "claude-opus-4-6",
            "max_tokens": 20000,
            "thinking": {"type": "enabled", "budget_tokens": 15000}
        }
    }
    cfg = configs[complexity]

    kwargs = {
        "model": cfg["model"],
        "max_tokens": cfg["max_tokens"],
        "messages": [{"role": "user", "content": problem}]
    }
    if cfg["thinking"]:
        kwargs["thinking"] = cfg["thinking"]

    response = client.messages.create(**kwargs)
    return next((b.text for b in response.content if b.type == "text"), "")

15.5 Working with Thinking Content

Accessing thinking blocks

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "What's the optimal strategy in this scenario?"}]
)

for block in response.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking)
        print(f"(Signature: {block.signature[:20]}...)")
    elif block.type == "text":
        print("=== FINAL ANSWER ===")
        print(block.text)

Multi-turn conversations preserving thinking context

def multi_turn_thinking(client, initial_problem: str) -> None:
    """Maintain thinking context across conversation turns."""

    messages = [{"role": "user", "content": initial_problem}]

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=16000,
        thinking={"type": "enabled", "budget_tokens": 10000},
        messages=messages
    )

    # Preserve ALL content blocks (including thinking) in history
    messages.append({
        "role": "assistant",
        "content": response.content   # Include thinking blocks
    })

    # Follow-up question
    messages.append({
        "role": "user",
        "content": "What edge cases should we consider based on your analysis?"
    })

    response2 = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=16000,
        thinking={"type": "enabled", "budget_tokens": 8000},
        messages=messages
    )

    for block in response2.content:
        if block.type == "text":
            print(block.text)

Key rule: When including a previous assistant turn with thinking blocks in the conversation history, the thinking blocks must be preserved exactly as returned. They cannot be modified, and their signature field is used by the model to verify integrity.

Streaming with thinking

with client.messages.stream(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "Solve this complex problem..."}]
) as stream:
    current_type = None

    for event in stream:
        if event.type == "content_block_start":
            current_type = event.content_block.type
            if current_type == "thinking":
                print("\n[Thinking...]\n", flush=True)
            elif current_type == "text":
                print("\n[Answer]\n", flush=True)

        elif event.type == "content_block_delta":
            if event.delta.type == "thinking_delta":
                print(event.delta.thinking, end="", flush=True)
            elif event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)

15.6 Extended Thinking with Tool Use

Extended Thinking and tool calling work together, enabling the model to reason about which tools to call and how to interpret their results:

import anthropic, json

client = anthropic.Anthropic()

tools = [{
    "name": "query_database",
    "description": "Query a database and return results.",
    "input_schema": {
        "type": "object",
        "properties": {
            "sql": {"type": "string", "description": "SQL query to execute"}
        },
        "required": ["sql"]
    }
}]

def research_agent(question: str) -> str:
    messages = [{"role": "user", "content": question}]

    while True:
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=16000,
            thinking={"type": "enabled", "budget_tokens": 8000},
            tools=tools,
            messages=messages
        )

        tool_uses = [b for b in response.content if b.type == "tool_use"]
        if not tool_uses or response.stop_reason == "end_turn":
            return next((b.text for b in response.content if b.type == "text"), "")

        messages.append({"role": "assistant", "content": response.content})
        tool_results = []

        for tu in tool_uses:
            # Simulate database query
            result = {"rows": [{"count": 42}], "query": tu.input["sql"]}
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": tu.id,
                "content": json.dumps(result)
            })

        messages.append({"role": "user", "content": tool_results})

15.7 Constraints and Common Errors

Constraint 1: temperature must be 1

Extended Thinking requires the default temperature of 1. Setting any other value raises an error.

# WRONG
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    temperature=0.5,    # Not allowed with thinking enabled
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[...]
)
# anthropic.BadRequestError: temperature must be 1 when thinking is enabled

Constraint 2: no prefill

# WRONG: cannot use assistant prefill with extended thinking
messages = [
    {"role": "user", "content": "Solve this problem."},
    {"role": "assistant", "content": "The answer is"}   # Not allowed
]

# CORRECT: no prefill
messages = [{"role": "user", "content": "Solve this problem."}]

Constraint 3: thinking blocks must precede text blocks

When appending an assistant turn to the message history, thinking blocks must come before text blocks, exactly as the API returned them.

def validate_content_order(blocks: list) -> bool:
    """Check that thinking blocks appear before text blocks."""
    seen_text = False
    for block in blocks:
        if hasattr(block, "type"):
            if block.type == "text":
                seen_text = True
            elif block.type == "thinking" and seen_text:
                return False   # thinking after text: invalid
    return True

15.8 Benchmarking Thinking Quality

import time

def benchmark(problems: list[dict], budgets: list[int] = [0, 2000, 8000, 20000]) -> dict:
    """
    Measure accuracy and cost at different thinking budgets.
    problems: list of {"problem": str, "expected": str}
    """
    results = {}

    for budget in budgets:
        correct, total_cost, total_time = 0, 0.0, 0.0

        for item in problems:
            t0 = time.time()
            kwargs = {
                "model": "claude-opus-4-6",
                "max_tokens": (budget + 2048) if budget > 0 else 2048,
                "messages": [{"role": "user", "content": item["problem"]}]
            }
            if budget > 0:
                kwargs["thinking"] = {"type": "enabled", "budget_tokens": budget}

            r = client.messages.create(**kwargs)
            elapsed = time.time() - t0

            answer = next((b.text for b in r.content if b.type == "text"), "")
            if item["expected"].lower() in answer.lower():
                correct += 1

            cost = (r.usage.input_tokens * 15 + r.usage.output_tokens * 75) / 1e6
            total_cost += cost
            total_time += elapsed

        n = len(problems)
        results[budget] = {
            "accuracy": f"{correct/n:.0%}",
            "avg_cost_usd": round(total_cost / n, 5),
            "avg_latency_s": round(total_time / n, 2)
        }

    return results

Summary

Extended Thinking is Claude's primary mechanism for tackling problems that require sustained multi-step reasoning. Key takeaways:

  1. max_tokens must exceed budget_tokens โ€” the most common configuration error
  2. Thinking tokens are billed at output prices โ€” they typically represent 85โ€“95% of total cost
  3. Temperature must be 1 and no prefill when thinking is enabled
  4. Multi-turn conversations should preserve thinking blocks for context continuity
  5. Use tiered routing โ€” reserve Extended Thinking for genuinely complex tasks; use Haiku without thinking for simple ones
  6. betas=["output-128k-2025-02-19"] extends the output ceiling to 128K tokens for ultra-deep analysis
Rate this chapter
4.8  / 5  (26 ratings)

๐Ÿ’ฌ Comments