Chapter 13

Claude API Complete Guide — First Request to Production

Chapter 13: Claude API Complete Practical Guide — First Request to Production

Learning goals for this chapter: run your first API request within 5 minutes; understand the real effect of every Messages API parameter; implement streaming output in Python and TypeScript; master the complete Tool Use loop; use Prompt Caching to cut repeat-request costs by 90%; write production-grade error handling with exponential backoff.

Quick Start: First Request in 5 Minutes

pip install anthropic
export ANTHROPIC_API_KEY="sk-ant-api03-..."
import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY automatically

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a code review expert. Be concise and technical.",
    messages=[
        {"role": "user", "content": "Review this code:\n```python\ndef divide(a, b):\n    return a / b\n```"}
    ]
)

print(message.content[0].text)
print(f"Tokens: {message.usage.input_tokens} in, {message.usage.output_tokens} out")

Messages API Parameters — What Each One Actually Does

Parameter Common Values Effect If Omitted
model claude-sonnet-4-6 Determines intelligence level and cost Required — errors without it
max_tokens 1024–4096 Caps output length; truncates if hit Required — errors without it
temperature 0 (code) / 0.7 (creative) Output randomness; 0 = fully deterministic Defaults to 1.0 — bad for code tasks
system Role definition and rules Constrains model behavior throughout Model replies without persona or constraints
top_p Usually leave as-is Alternative randomness control Defaults to 1.0 — pick either this or temperature, not both

temperature by task type: Code generation → 0. Analysis/explanation → 0.3. Creative writing → 0.7–1.0. Never use high temperature for code tasks — output becomes unstable and incorrect.

Streaming Output

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[{"role": "user", "content": "Write a complete FastAPI CRUD endpoint"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

final = stream.get_final_message()
print(f"\n{final.usage.input_tokens} in, {final.usage.output_tokens} out")
// src/app/api/generate/route.ts
import Anthropic from "@anthropic-ai/sdk";
import { NextRequest } from "next/server";

const client = new Anthropic();

export async function POST(req: NextRequest) {
  const { message } = await req.json();

  const stream = await client.messages.stream({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    messages: [{ role: "user", content: message }],
  });

  const readableStream = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        if (
          chunk.type === "content_block_delta" &&
          chunk.delta.type === "text_delta"
        ) {
          controller.enqueue(new TextEncoder().encode(chunk.delta.text));
        }
      }
      controller.close();
    },
  });

  return new Response(readableStream, {
    headers: { "Content-Type": "text/plain; charset=utf-8" },
  });
}

Tool Use: Complete Working Example

Tool Use lets Claude call functions you define — the foundation of any Agent. Flow: you define tool schemas → Claude decides which tools to call → you execute → Claude uses results to give a final reply.

import anthropic
import json

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["city"]
        }
    },
    {
        "name": "query_database",
        "description": "Query order information from the database",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
                "status": {"type": "string", "enum": ["pending", "completed", "cancelled"]}
            }
        }
    }
]

def handle_tool_call(tool_name: str, tool_input: dict) -> dict:
    if tool_name == "get_weather":
        return {"temperature": 25, "condition": "Sunny", "city": tool_input["city"]}
    elif tool_name == "query_database":
        return {"order_id": tool_input["order_id"], "status": "completed", "amount": 199.00}
    return {}

def chat_with_tools(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "tool_use":
            # Process ALL tool_use blocks — Claude may call multiple tools at once
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = handle_tool_call(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": json.dumps(result)
                    })

            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
        else:
            return response.content[0].text

print(chat_with_tools("What's the weather in Tokyo? Also check order ORD-12345."))

Claude may call multiple tools in one response: A single response can contain multiple tool_use blocks. The code above iterates over all of them — this is correct. Processing only the first block will fail in multi-tool scenarios.

Prompt Caching — Cut Costs by 90%

When you include the same large content in every request (system prompts, reference docs, codebase context), Prompt Caching stores that content server-side. Cache hits are billed at roughly 10% of normal input token price.

import anthropic

client = anthropic.Anthropic()

long_codebase_context = "...(thousands of lines of project files)..."

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a code review expert. Only flag real bugs.",
            "cache_control": {"type": "ephemeral"}
        },
        {
            "type": "text",
            "text": long_codebase_context,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Review the latest commit to auth.py"}]
)

usage = response.usage
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")  # billed normally, first time
print(f"Cache read tokens: {usage.cache_read_input_tokens}")          # ~90% cheaper on hits

Real savings example: 5,000-token system prompt at Sonnet pricing. Each cache hit saves ~$0.014. At 100 requests/day, that's $1.35/day saved, $40/month. Cache TTL is ~5 minutes, so high-frequency usage benefits most.

Production Error Handling with Retry

import anthropic
import time
from anthropic import APIStatusError, APIConnectionError, RateLimitError

client = anthropic.Anthropic()

def call_claude_with_retry(messages: list, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=messages
            )
            return response.content[0].text

        except RateLimitError:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # exponential backoff: 1s, 2s, 4s
                time.sleep(wait_time)
            else:
                raise

        except APIConnectionError:
            if attempt < max_retries - 1:
                time.sleep(1)
            else:
                raise

        except APIStatusError as e:
            if e.status_code >= 500:
                if attempt < max_retries - 1:
                    time.sleep(2 ** attempt)
                    continue
            raise  # 4xx client errors: don't retry, fix the request

    raise RuntimeError(f"Failed after {max_retries} retries")

Model Selection and Pricing (2025)

Model Input Price Output Price Best For
claude-haiku-4-5 $0.80/MTok $4/MTok Classification, quick Q&A, bulk processing, intent detection
claude-sonnet-4-6 $3/MTok $15/MTok Complex code, deep analysis, primary workhorse model
claude-opus-4-6 $15/MTok $75/MTok Hardest reasoning, architecture design, highest-accuracy tasks

Chapter Key Points

  1. model and max_tokens are required. temperature defaults to 1.0 which is wrong for code generation — set it to 0 for deterministic output.
  2. Streaming key events: listen for content_block_delta with text_delta type. Python uses the .stream() context manager; TypeScript uses for await.
  3. Tool Use requires a loop: Claude may call tools multiple times before giving a final reply. One response can contain multiple tool_use blocks — always iterate over all of them.
  4. Prompt Caching only saves money on large repeated content. Add cache_control only to content that is identical across requests. TTL is ~5 minutes; low-frequency use gets little benefit.
  5. Retry logic must distinguish error types. RateLimitError and 5xx → exponential backoff retry. 4xx client errors → don't retry, fix the request.
Rate this chapter
4.7  / 5  (21 ratings)

💬 Comments