Claude API Complete Guide — First Request to Production
Chapter 13: Claude API Complete Practical Guide — First Request to Production
Learning goals for this chapter: run your first API request within 5 minutes; understand the real effect of every Messages API parameter; implement streaming output in Python and TypeScript; master the complete Tool Use loop; use Prompt Caching to cut repeat-request costs by 90%; write production-grade error handling with exponential backoff.
Quick Start: First Request in 5 Minutes
pip install anthropic
export ANTHROPIC_API_KEY="sk-ant-api03-..."
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY automatically
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are a code review expert. Be concise and technical.",
messages=[
{"role": "user", "content": "Review this code:\n```python\ndef divide(a, b):\n return a / b\n```"}
]
)
print(message.content[0].text)
print(f"Tokens: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
Messages API Parameters — What Each One Actually Does
| Parameter | Common Values | Effect | If Omitted |
|---|---|---|---|
model |
claude-sonnet-4-6 | Determines intelligence level and cost | Required — errors without it |
max_tokens |
1024–4096 | Caps output length; truncates if hit | Required — errors without it |
temperature |
0 (code) / 0.7 (creative) | Output randomness; 0 = fully deterministic | Defaults to 1.0 — bad for code tasks |
system |
Role definition and rules | Constrains model behavior throughout | Model replies without persona or constraints |
top_p |
Usually leave as-is | Alternative randomness control | Defaults to 1.0 — pick either this or temperature, not both |
temperature by task type: Code generation → 0. Analysis/explanation → 0.3. Creative writing → 0.7–1.0. Never use high temperature for code tasks — output becomes unstable and incorrect.
Streaming Output
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": "Write a complete FastAPI CRUD endpoint"}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
final = stream.get_final_message()
print(f"\n{final.usage.input_tokens} in, {final.usage.output_tokens} out")
// src/app/api/generate/route.ts
import Anthropic from "@anthropic-ai/sdk";
import { NextRequest } from "next/server";
const client = new Anthropic();
export async function POST(req: NextRequest) {
const { message } = await req.json();
const stream = await client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 2048,
messages: [{ role: "user", content: message }],
});
const readableStream = new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
if (
chunk.type === "content_block_delta" &&
chunk.delta.type === "text_delta"
) {
controller.enqueue(new TextEncoder().encode(chunk.delta.text));
}
}
controller.close();
},
});
return new Response(readableStream, {
headers: { "Content-Type": "text/plain; charset=utf-8" },
});
}
Tool Use: Complete Working Example
Tool Use lets Claude call functions you define — the foundation of any Agent. Flow: you define tool schemas → Claude decides which tools to call → you execute → Claude uses results to give a final reply.
import anthropic
import json
client = anthropic.Anthropic()
tools = [
{
"name": "get_weather",
"description": "Get current weather for a city",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
},
{
"name": "query_database",
"description": "Query order information from the database",
"input_schema": {
"type": "object",
"properties": {
"order_id": {"type": "string"},
"status": {"type": "string", "enum": ["pending", "completed", "cancelled"]}
}
}
}
]
def handle_tool_call(tool_name: str, tool_input: dict) -> dict:
if tool_name == "get_weather":
return {"temperature": 25, "condition": "Sunny", "city": tool_input["city"]}
elif tool_name == "query_database":
return {"order_id": tool_input["order_id"], "status": "completed", "amount": 199.00}
return {}
def chat_with_tools(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
while True:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
messages=messages
)
if response.stop_reason == "tool_use":
# Process ALL tool_use blocks — Claude may call multiple tools at once
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = handle_tool_call(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
else:
return response.content[0].text
print(chat_with_tools("What's the weather in Tokyo? Also check order ORD-12345."))
Claude may call multiple tools in one response: A single response can contain multiple
tool_useblocks. The code above iterates over all of them — this is correct. Processing only the first block will fail in multi-tool scenarios.
Prompt Caching — Cut Costs by 90%
When you include the same large content in every request (system prompts, reference docs, codebase context), Prompt Caching stores that content server-side. Cache hits are billed at roughly 10% of normal input token price.
import anthropic
client = anthropic.Anthropic()
long_codebase_context = "...(thousands of lines of project files)..."
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a code review expert. Only flag real bugs.",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": long_codebase_context,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Review the latest commit to auth.py"}]
)
usage = response.usage
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}") # billed normally, first time
print(f"Cache read tokens: {usage.cache_read_input_tokens}") # ~90% cheaper on hits
Real savings example: 5,000-token system prompt at Sonnet pricing. Each cache hit saves ~$0.014. At 100 requests/day, that's $1.35/day saved, $40/month. Cache TTL is ~5 minutes, so high-frequency usage benefits most.
Production Error Handling with Retry
import anthropic
import time
from anthropic import APIStatusError, APIConnectionError, RateLimitError
client = anthropic.Anthropic()
def call_claude_with_retry(messages: list, max_retries: int = 3) -> str:
for attempt in range(max_retries):
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages
)
return response.content[0].text
except RateLimitError:
if attempt < max_retries - 1:
wait_time = 2 ** attempt # exponential backoff: 1s, 2s, 4s
time.sleep(wait_time)
else:
raise
except APIConnectionError:
if attempt < max_retries - 1:
time.sleep(1)
else:
raise
except APIStatusError as e:
if e.status_code >= 500:
if attempt < max_retries - 1:
time.sleep(2 ** attempt)
continue
raise # 4xx client errors: don't retry, fix the request
raise RuntimeError(f"Failed after {max_retries} retries")
Model Selection and Pricing (2025)
| Model | Input Price | Output Price | Best For |
|---|---|---|---|
| claude-haiku-4-5 | $0.80/MTok | $4/MTok | Classification, quick Q&A, bulk processing, intent detection |
| claude-sonnet-4-6 | $3/MTok | $15/MTok | Complex code, deep analysis, primary workhorse model |
| claude-opus-4-6 | $15/MTok | $75/MTok | Hardest reasoning, architecture design, highest-accuracy tasks |
Chapter Key Points
- model and max_tokens are required. temperature defaults to 1.0 which is wrong for code generation — set it to 0 for deterministic output.
- Streaming key events: listen for
content_block_deltawithtext_deltatype. Python uses the.stream()context manager; TypeScript usesfor await. - Tool Use requires a loop: Claude may call tools multiple times before giving a final reply. One response can contain multiple tool_use blocks — always iterate over all of them.
- Prompt Caching only saves money on large repeated content. Add
cache_controlonly to content that is identical across requests. TTL is ~5 minutes; low-frequency use gets little benefit. - Retry logic must distinguish error types. RateLimitError and 5xx → exponential backoff retry. 4xx client errors → don't retry, fix the request.