Extended Thinking Deep Dive: budget_tokens, display Modes and Multi-Turn Propagation Mechanics
Chapter 15: Extended Thinking: Enabling Deep Reasoning, Budget Control, and Token Economics
15.1 What Is Extended Thinking?
Extended Thinking is a Claude capability that lets the model reason through a problem in a private "scratchpad" before producing its final answer. The scratchpad content — called a thinking block — is returned to the developer alongside the text response, but is typically not displayed to end users.
Conceptually, it mirrors what a thoughtful human does when facing a hard problem: writing out intermediate steps, exploring multiple angles, noticing errors, and correcting course — before committing to a final answer.
Where Extended Thinking shines
- Multi-step math and logic — proofs, algebraic derivations, combinatorics
- Software architecture decisions — weighing design trade-offs with many interdependencies
- Strategic analysis — game theory, competitive strategy, multi-variable optimization
- Research synthesis — integrating evidence from multiple sources into a coherent conclusion
- Complex debugging — identifying root causes in intricate systems
Where it adds little value
- Simple factual lookups ("What is the capital of France?")
- Pure format conversion (translation, reformatting)
- High-throughput, cost-sensitive batch workloads
- Tasks where latency is critical (streaming chat with sub-second TTFT requirements)
15.2 Basic Usage: Enabling Extended Thinking
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-6", # Best reasoning performance
max_tokens=16000, # Must be large enough for thinking + output
thinking={
"type": "enabled",
"budget_tokens": 10000 # Max tokens the model may spend thinking
},
messages=[{
"role": "user",
"content": "Prove that the sum 1+2+3+...+n equals n(n+1)/2 for all positive integers n."
}]
)
# The response contains two content block types
for block in response.content:
if block.type == "thinking":
print(f"[Thinking]\n{block.thinking}\n")
elif block.type == "text":
print(f"[Answer]\n{block.text}")
The relationship between max_tokens and budget_tokens
This is the single most common configuration mistake:
max_tokens ≥ budget_tokens + expected_output_tokens
budget_tokens: maximum tokens the model may spend in the thinking blockmax_tokens: hard cap on all output tokens combined (thinking + text)- If
max_tokensis too small, the response will be truncated mid-reasoning
# WRONG: max_tokens too small
bad = {
"max_tokens": 1024, # Smaller than budget_tokens alone!
"thinking": {"type": "enabled", "budget_tokens": 10000}
}
# CORRECT: leave room for both thinking and final answer
good = {
"max_tokens": 16000, # 10000 for thinking + 6000 for output
"thinking": {"type": "enabled", "budget_tokens": 10000}
}
15.3 Budget Control
How budget affects quality
budget_tokens is an upper bound, not a minimum. The model decides how much of the budget to actually use based on the problem's complexity:
- Simple problems (even with a large budget): the model may use only 200–500 tokens of thinking
- Complex problems with a generous budget: the model uses the budget fully for deeper exploration
- Complex problems with a tight budget: the model does its best within the constraint, but quality may suffer
def solve_at_budget(problem: str, budget: int) -> dict:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=budget + 4096,
thinking={"type": "enabled", "budget_tokens": budget},
messages=[{"role": "user", "content": problem}]
)
thinking_text = ""
answer = ""
for block in response.content:
if block.type == "thinking":
thinking_text = block.thinking
elif block.type == "text":
answer = block.text
return {
"budget": budget,
"thinking_chars": len(thinking_text),
"output_tokens": response.usage.output_tokens,
"answer": answer[:200] + "..." if len(answer) > 200 else answer
}
problem = "How many ways can 8 queens be placed on an 8×8 chessboard so no two queens attack each other?"
for budget in [1000, 5000, 10000, 20000]:
r = solve_at_budget(problem, budget)
print(f"Budget {budget:6d}: {r['thinking_chars']:6d} thinking chars, {r['output_tokens']} output tokens")
Recommended budget settings
| Task type | budget_tokens | max_tokens |
|---|---|---|
| Simple reasoning (a few steps) | 1,000–3,000 | 4,000–6,000 |
| Moderate complexity (math, logic) | 5,000–10,000 | 12,000–16,000 |
| Complex analysis (architecture, strategy) | 10,000–20,000 | 20,000–30,000 |
| Research-level problems | 20,000–50,000 | 60,000+ |
Extending output beyond 8K with betas
# Use the output-128k beta for very large outputs
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=100000,
betas=["output-128k-2025-02-19"], # Extends output ceiling to 128K tokens
thinking={
"type": "enabled",
"budget_tokens": 80000
},
messages=[{"role": "user", "content": "Conduct a comprehensive architectural analysis of..."}]
)
15.4 Token Economics
How thinking tokens are billed
Thinking block tokens are charged at the output token rate, not the input rate:
Total cost = input_tokens × input_price + (thinking_tokens + text_tokens) × output_price
Example with claude-opus-4-6 ($15/MTok input, $75/MTok output):
def calculate_cost(
input_tokens: int,
thinking_tokens: int,
output_tokens: int,
model: str = "claude-opus-4-6"
) -> dict:
PRICES = {
"claude-opus-4-6": {"in": 15.0, "out": 75.0},
"claude-sonnet-4-6": {"in": 3.0, "out": 15.0},
}
p = PRICES[model]
in_cost = (input_tokens / 1e6) * p["in"]
think_cost = (thinking_tokens / 1e6) * p["out"]
out_cost = (output_tokens / 1e6) * p["out"]
total = in_cost + think_cost + out_cost
return {
"input_cost": round(in_cost, 6),
"thinking_cost": round(think_cost, 6),
"output_cost": round(out_cost, 6),
"total_cost": round(total, 6),
"thinking_share": f"{think_cost / total * 100:.1f}%"
}
# A request with 500 input tokens, 8000 thinking tokens, 500 output tokens
print(calculate_cost(500, 8000, 500))
# {'input_cost': 0.0075, 'thinking_cost': 0.6, 'output_cost': 0.0375,
# 'total_cost': 0.645, 'thinking_share': '93.0%'}
Thinking tokens typically account for 85–95% of total cost on Extended Thinking requests. Budget control is cost control.
Tiered complexity routing
def adaptive_solve(client, problem: str, complexity: str = "auto") -> str:
"""Route to the right model and thinking budget based on complexity."""
if complexity == "auto":
# Quick probe with Haiku to estimate complexity
probe = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{
"role": "user",
"content": f"Rate complexity (simple/medium/complex), one word:\n{problem}"
}]
)
complexity = probe.content[0].text.strip().lower()
if complexity not in ("simple", "medium", "complex"):
complexity = "medium"
configs = {
"simple": {
"model": "claude-haiku-4-5-20251001",
"max_tokens": 1024,
"thinking": None
},
"medium": {
"model": "claude-sonnet-4-6",
"max_tokens": 8000,
"thinking": {"type": "enabled", "budget_tokens": 5000}
},
"complex": {
"model": "claude-opus-4-6",
"max_tokens": 20000,
"thinking": {"type": "enabled", "budget_tokens": 15000}
}
}
cfg = configs[complexity]
kwargs = {
"model": cfg["model"],
"max_tokens": cfg["max_tokens"],
"messages": [{"role": "user", "content": problem}]
}
if cfg["thinking"]:
kwargs["thinking"] = cfg["thinking"]
response = client.messages.create(**kwargs)
return next((b.text for b in response.content if b.type == "text"), "")
15.5 Working with Thinking Content
Accessing thinking blocks
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[{"role": "user", "content": "What's the optimal strategy in this scenario?"}]
)
for block in response.content:
if block.type == "thinking":
print("=== THINKING ===")
print(block.thinking)
print(f"(Signature: {block.signature[:20]}...)")
elif block.type == "text":
print("=== FINAL ANSWER ===")
print(block.text)
Multi-turn conversations preserving thinking context
def multi_turn_thinking(client, initial_problem: str) -> None:
"""Maintain thinking context across conversation turns."""
messages = [{"role": "user", "content": initial_problem}]
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=messages
)
# Preserve ALL content blocks (including thinking) in history
messages.append({
"role": "assistant",
"content": response.content # Include thinking blocks
})
# Follow-up question
messages.append({
"role": "user",
"content": "What edge cases should we consider based on your analysis?"
})
response2 = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 8000},
messages=messages
)
for block in response2.content:
if block.type == "text":
print(block.text)
Key rule: When including a previous assistant turn with thinking blocks in the conversation history, the thinking blocks must be preserved exactly as returned. They cannot be modified, and their signature field is used by the model to verify integrity.
Streaming with thinking
with client.messages.stream(
model="claude-opus-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[{"role": "user", "content": "Solve this complex problem..."}]
) as stream:
current_type = None
for event in stream:
if event.type == "content_block_start":
current_type = event.content_block.type
if current_type == "thinking":
print("\n[Thinking...]\n", flush=True)
elif current_type == "text":
print("\n[Answer]\n", flush=True)
elif event.type == "content_block_delta":
if event.delta.type == "thinking_delta":
print(event.delta.thinking, end="", flush=True)
elif event.delta.type == "text_delta":
print(event.delta.text, end="", flush=True)
15.6 Extended Thinking with Tool Use
Extended Thinking and tool calling work together, enabling the model to reason about which tools to call and how to interpret their results:
import anthropic, json
client = anthropic.Anthropic()
tools = [{
"name": "query_database",
"description": "Query a database and return results.",
"input_schema": {
"type": "object",
"properties": {
"sql": {"type": "string", "description": "SQL query to execute"}
},
"required": ["sql"]
}
}]
def research_agent(question: str) -> str:
messages = [{"role": "user", "content": question}]
while True:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 8000},
tools=tools,
messages=messages
)
tool_uses = [b for b in response.content if b.type == "tool_use"]
if not tool_uses or response.stop_reason == "end_turn":
return next((b.text for b in response.content if b.type == "text"), "")
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for tu in tool_uses:
# Simulate database query
result = {"rows": [{"count": 42}], "query": tu.input["sql"]}
tool_results.append({
"type": "tool_result",
"tool_use_id": tu.id,
"content": json.dumps(result)
})
messages.append({"role": "user", "content": tool_results})
15.7 Constraints and Common Errors
Constraint 1: temperature must be 1
Extended Thinking requires the default temperature of 1. Setting any other value raises an error.
# WRONG
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
temperature=0.5, # Not allowed with thinking enabled
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[...]
)
# anthropic.BadRequestError: temperature must be 1 when thinking is enabled
Constraint 2: no prefill
# WRONG: cannot use assistant prefill with extended thinking
messages = [
{"role": "user", "content": "Solve this problem."},
{"role": "assistant", "content": "The answer is"} # Not allowed
]
# CORRECT: no prefill
messages = [{"role": "user", "content": "Solve this problem."}]
Constraint 3: thinking blocks must precede text blocks
When appending an assistant turn to the message history, thinking blocks must come before text blocks, exactly as the API returned them.
def validate_content_order(blocks: list) -> bool:
"""Check that thinking blocks appear before text blocks."""
seen_text = False
for block in blocks:
if hasattr(block, "type"):
if block.type == "text":
seen_text = True
elif block.type == "thinking" and seen_text:
return False # thinking after text: invalid
return True
15.8 Benchmarking Thinking Quality
import time
def benchmark(problems: list[dict], budgets: list[int] = [0, 2000, 8000, 20000]) -> dict:
"""
Measure accuracy and cost at different thinking budgets.
problems: list of {"problem": str, "expected": str}
"""
results = {}
for budget in budgets:
correct, total_cost, total_time = 0, 0.0, 0.0
for item in problems:
t0 = time.time()
kwargs = {
"model": "claude-opus-4-6",
"max_tokens": (budget + 2048) if budget > 0 else 2048,
"messages": [{"role": "user", "content": item["problem"]}]
}
if budget > 0:
kwargs["thinking"] = {"type": "enabled", "budget_tokens": budget}
r = client.messages.create(**kwargs)
elapsed = time.time() - t0
answer = next((b.text for b in r.content if b.type == "text"), "")
if item["expected"].lower() in answer.lower():
correct += 1
cost = (r.usage.input_tokens * 15 + r.usage.output_tokens * 75) / 1e6
total_cost += cost
total_time += elapsed
n = len(problems)
results[budget] = {
"accuracy": f"{correct/n:.0%}",
"avg_cost_usd": round(total_cost / n, 5),
"avg_latency_s": round(total_time / n, 2)
}
return results
Summary
Extended Thinking is Claude's primary mechanism for tackling problems that require sustained multi-step reasoning. Key takeaways:
max_tokensmust exceedbudget_tokens— the most common configuration error- Thinking tokens are billed at output prices — they typically represent 85–95% of total cost
- Temperature must be 1 and no prefill when thinking is enabled
- Multi-turn conversations should preserve thinking blocks for context continuity
- Use tiered routing — reserve Extended Thinking for genuinely complex tasks; use Haiku without thinking for simple ones
betas=["output-128k-2025-02-19"]extends the output ceiling to 128K tokens for ultra-deep analysis