Chapter 3

Token Economics: Precise Calculation and Cost Estimation for Input/Output/Thinking/Cache Tokens

Chapter 3: API Quick Start: Authentication, Rate Limits, SDK Installation, and Your First Request

3.1 Obtaining an API Key

Before making your first Claude API call, you need an API key. The process:

Go to console.anthropic.com
Create an account and verify your email
Navigate to API Keys in the left sidebar
Click Create Key, give it a descriptive name (e.g., production-chatbot, dev-testing)
Copy the key immediately—it is shown only once

Secure API Key Management

An API key grants full access to your account, including spending your credit balance. Never:

Hardcode the key in source files
Commit files containing the key to any version control repository
Include the key in client-side code (browser JavaScript, mobile apps)

The correct approach is environment variables:

# Linux / macOS
export ANTHROPIC_API_KEY="sk-ant-api03-..."

# Windows PowerShell
$env:ANTHROPIC_API_KEY = "sk-ant-api03-..."

For local development, a .env file with python-dotenv is convenient:

# .env (never commit this file)
ANTHROPIC_API_KEY=sk-ant-api03-...

from dotenv import load_dotenv
load_dotenv()  # loads .env into environment before creating the client

For production, use a secrets management service: AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault, or equivalent. The key should never appear in plaintext in any configuration file that might end up in a repository.

Key Format

Anthropic API keys begin with sk-ant-api03- followed by approximately 90 random characters. If you see a different prefix, you may be looking at a legacy key format or a key from a different service.

3.2 Understanding Rate Limits

The API enforces rate limits along two independent dimensions. Hitting either threshold will return an HTTP 429 error.

Rate Limit Dimensions

RPM (Requests Per Minute): Maximum number of API calls per minute
Input TPM (Input Tokens Per Minute): Maximum input tokens per minute
Output TPM (Output Tokens Per Minute): Maximum output tokens per minute
TPD (Tokens Per Day): Daily token limit (applies to some account tiers)

Approximate limits for claude-sonnet-4-6 by account tier (verify current values in the Anthropic documentation):

Tier     RPM     Input TPM    Output TPM
──────   ─────   ──────────   ──────────
Tier 1   50      40,000       8,000
Tier 2   1,000   80,000       16,000
Tier 3   2,000   160,000      32,000
Tier 4   4,000   400,000      80,000

Upgrading tiers requires adding a payment method and completing Anthropic's review process (typically 24–48 hours).

Reading Rate Limit Headers

Every API response includes headers telling you your current consumption:

anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 847
anthropic-ratelimit-requests-reset: 2024-01-15T10:31:00Z
anthropic-ratelimit-tokens-limit: 80000
anthropic-ratelimit-tokens-remaining: 52340
anthropic-ratelimit-tokens-reset: 2024-01-15T10:30:30Z
retry-after: 30

These headers let you implement proactive throttling rather than relying purely on reactive retry logic.

Exponential Backoff for 429 Errors

import time
import anthropic
from anthropic import RateLimitError

def call_with_retry(client: anthropic.Anthropic, max_retries: int = 5, **kwargs):
    """
    Wraps client.messages.create() with exponential backoff on rate limit errors.
    Wait sequence: 1s, 2.1s, 4.2s, 8.3s, 16.4s
    """
    for attempt in range(max_retries):
        try:
            return client.messages.create(**kwargs)
        except RateLimitError:
            if attempt == max_retries - 1:
                raise  # Give up after max_retries attempts
            wait = (2 ** attempt) + (0.1 * attempt)
            print(f"Rate limit hit; retrying in {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(wait)

Token Budget Management for High Concurrency

In concurrent workloads, TPM limits are often hit before RPM limits. A sliding-window token budget manager prevents wasted retries:

import threading
import time
from collections import deque

class TokenBudgetManager:
    """
    Sliding-window token budget manager.
    Prevents requests that would exceed the per-minute token limit.
    """

    def __init__(self, tokens_per_minute: int):
        self.limit = tokens_per_minute
        self.window: deque[tuple[float, int]] = deque()
        self._lock = threading.Lock()

    def _prune(self) -> int:
        cutoff = time.time() - 60.0
        while self.window and self.window[0][0] < cutoff:
            self.window.popleft()
        return sum(tokens for _, tokens in self.window)

    def can_proceed(self, estimated_tokens: int) -> bool:
        with self._lock:
            used = self._prune()
            return used + estimated_tokens <= self.limit

    def record(self, token_count: int):
        with self._lock:
            self.window.append((time.time(), token_count))

    def wait_for_budget(self, estimated_tokens: int, timeout: float = 120.0):
        start = time.time()
        while not self.can_proceed(estimated_tokens):
            if time.time() - start > timeout:
                raise TimeoutError("Timed out waiting for token budget")
            time.sleep(1.0)

3.3 Installing the SDK

Python SDK

pip install anthropic

# With package managers
poetry add anthropic
uv add anthropic

Recommended version pinning in pyproject.toml:

[tool.poetry.dependencies]
python = "^3.9"
anthropic = "^0.34.0"   # allows patch/minor updates, pins major

The SDK's dependencies are intentionally lightweight: httpx for HTTP, pydantic for data validation, and typing-extensions for backported type hints.

TypeScript / Node.js SDK

npm install @anthropic-ai/sdk
# or
yarn add @anthropic-ai/sdk
# or
pnpm add @anthropic-ai/sdk

The TypeScript SDK ships with full type definitions. In a TypeScript project, all request and response fields are fully typed and show up in IDE autocompletion.

{
  "dependencies": {
    "@anthropic-ai/sdk": "^0.26.0"
  }
}

Direct HTTP (No SDK)

For environments where neither Python nor Node.js is available, the REST API is callable with any HTTP client:

curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "Hello, Claude!"}
    ]
  }'

Two headers are required on every request:

x-api-key: Your API key
anthropic-version: The API version string, currently 2023-06-01

3.4 Your First Request

Python

import anthropic

# Client reads ANTHROPIC_API_KEY from the environment automatically
client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain quantum entanglement in one sentence."}
    ]
)

print(message.content[0].text)
# → Quantum entanglement is a phenomenon where two or more particles become
#   correlated such that measuring the state of one instantly determines
#   the state of the other, regardless of the distance between them.

Understanding the Response Object

print(message)
# Message(
#   id='msg_01XFDUDYJgAACzvnptvVoYEL',
#   type='message',
#   role='assistant',
#   content=[
#     TextBlock(text='Quantum entanglement is...', type='text')
#   ],
#   model='claude-sonnet-4-6',
#   stop_reason='end_turn',
#   stop_sequence=None,
#   usage=Usage(input_tokens=14, output_tokens=47)
# )

# Key field access patterns
text          = message.content[0].text           # response text
input_tokens  = message.usage.input_tokens         # tokens consumed by input
output_tokens = message.usage.output_tokens        # tokens consumed by output
stop_reason   = message.stop_reason               # 'end_turn', 'max_tokens', or 'stop_sequence'
model_used    = message.model                     # actual model version that served the request

stop_reason is important for production code:

end_turn: Model finished naturally
max_tokens: Response was cut off at your max_tokens limit — you may need to increase it or paginate
stop_sequence: A stop sequence you specified was encountered

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic(); // reads process.env.ANTHROPIC_API_KEY

async function main() {
  const message = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    messages: [
      { role: "user", content: "Explain quantum entanglement in one sentence." }
    ],
  });

  // TypeScript knows content[0] can be TextBlock or ToolUseBlock
  const block = message.content[0];
  if (block.type === "text") {
    console.log(block.text);
  }

  console.log(
    `Tokens: ${message.usage.input_tokens} input, ${message.usage.output_tokens} output`
  );
}

main();

3.5 Streaming Responses

For chat interfaces and other interactive use cases, streaming delivers tokens as they are generated rather than waiting for the complete response.

Python Streaming

import anthropic

client = anthropic.Anthropic()

# Recommended: use the context manager
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a haiku about the ocean."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

print()  # newline after completion

# Retrieve the final message with full usage stats
final = stream.get_final_message()
print(f"\nUsage: {final.usage.input_tokens} in / {final.usage.output_tokens} out")

For more granular event handling:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[{"role": "user", "content": "Explain TCP's three-way handshake."}]
) as stream:
    for event in stream:
        if event.type == "content_block_delta":
            if event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)
        elif event.type == "message_delta":
            # contains stop_reason and usage when the message completes
            pass

TypeScript Streaming

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function streamExample() {
  const stream = client.messages.stream({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    messages: [{ role: "user", content: "Write a haiku about the ocean." }],
  });

  for await (const chunk of stream) {
    if (
      chunk.type === "content_block_delta" &&
      chunk.delta.type === "text_delta"
    ) {
      process.stdout.write(chunk.delta.text);
    }
  }

  const final = await stream.finalMessage();
  console.log(`\nTokens: ${final.usage.input_tokens} in / ${final.usage.output_tokens} out`);
}

streamExample();

3.6 Adding a System Prompt

The system prompt sets Claude's behavior, persona, and constraints for the entire conversation. It is passed as a top-level system parameter, separate from the messages array.

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a senior backend engineer specializing in distributed systems.

When answering questions:
1. Always provide working code examples in Python
2. Explain the trade-offs of each approach
3. Call out common pitfalls explicitly
4. If you are uncertain about something, say so clearly

Format your responses with Markdown. Use code blocks with language identifiers."""

def ask_technical_question(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": question}]
    )
    return response.content[0].text

answer = ask_technical_question("What are the trade-offs between optimistic and pessimistic locking?")
print(answer)

3.7 Complete Error Handling

Production code must handle all failure modes. The SDK exposes structured exception classes:

import anthropic
from anthropic import (
    AuthenticationError,
    BadRequestError,
    RateLimitError,
    APIConnectionError,
    APITimeoutError,
    InternalServerError,
    UnprocessableEntityError,
    APIError,
)

def robust_call(client: anthropic.Anthropic, **kwargs):
    try:
        return client.messages.create(**kwargs)

    except AuthenticationError:
        # HTTP 401 — invalid or expired API key
        raise RuntimeError("Invalid API key. Check ANTHROPIC_API_KEY.")

    except BadRequestError as e:
        # HTTP 400 — invalid request parameters
        raise ValueError(f"Bad request: {e.message}") from e

    except UnprocessableEntityError as e:
        # HTTP 422 — request violates usage policy
        # Do NOT retry; the content itself needs to change
        raise ValueError(f"Content policy violation: {e.message}") from e

    except RateLimitError:
        # HTTP 429 — implement retry logic (see section 3.2)
        raise

    except APITimeoutError:
        # Request timed out — safe to retry
        raise

    except APIConnectionError:
        # Network failure — safe to retry after checking connectivity
        raise

    except InternalServerError as e:
        # HTTP 5xx — Anthropic server error — safe to retry with backoff
        raise

    except APIError as e:
        # Catch-all for any other API error
        raise RuntimeError(f"API error {e.status_code}: {e.message}") from e

Retry decision table:

Error                   Status   Retryable   Action
──────────────────────  ──────   ─────────   ─────────────────────────────
AuthenticationError     401      No          Fix API key
BadRequestError         400      No          Fix request parameters
UnprocessableEntityError 422     No          Modify request content
NotFoundError           404      No          Check model ID
RateLimitError          429      Yes         Wait retry-after, then backoff
InternalServerError     500/529  Yes         Exponential backoff, max 3x
APITimeoutError         —        Yes         Retry with longer timeout
APIConnectionError      —        Yes         Retry after connectivity check

3.8 HTTP Client Configuration

Custom Timeouts

import anthropic
import httpx

client = anthropic.Anthropic(
    timeout=httpx.Timeout(
        connect=5.0,    # TCP connection timeout
        read=120.0,     # Time to wait for the first byte of the response
        write=10.0,     # Time to send the request body
        pool=5.0        # Time to acquire a connection from the pool
    )
)

The default read timeout is 600 seconds (10 minutes), which accommodates long Extended Thinking responses. For Haiku-based systems processing short prompts, reducing this to 30–60 seconds helps surface timeouts faster.

Proxy Support

import anthropic
import httpx

client = anthropic.Anthropic(
    http_client=httpx.Client(proxy="http://your-proxy.example.com:8080")
)

Connection Pool Tuning for High Concurrency

import anthropic
import httpx

client = anthropic.Anthropic(
    http_client=httpx.Client(
        limits=httpx.Limits(
            max_connections=100,
            max_keepalive_connections=20,
            keepalive_expiry=30.0
        )
    )
)

3.9 Async Client

For asyncio-based applications (FastAPI, aiohttp, etc.), use the async client to avoid blocking the event loop:

import asyncio
import anthropic

async def main():
    client = anthropic.AsyncAnthropic()  # note: AsyncAnthropic, not Anthropic

    message = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": "Explain async I/O in Python."}]
    )
    print(message.content[0].text)

asyncio.run(main())

FastAPI integration example:

from fastapi import FastAPI
import anthropic

app = FastAPI()
client = anthropic.AsyncAnthropic()  # one shared client instance

@app.post("/chat")
async def chat(message: str) -> dict:
    response = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": message}]
    )
    return {
        "reply": response.content[0].text,
        "tokens": {
            "input": response.usage.input_tokens,
            "output": response.usage.output_tokens,
        }
    }

Important: Create the AsyncAnthropic client once at module level and reuse it. Instantiating a new client per request wastes connection pool resources.

Summary

This chapter covered the complete path from zero to a working API integration:

API key security: Always use environment variables; never hardcode or commit keys
Rate limits: Two independent dimensions (RPM and TPM); use exponential backoff for 429 errors
SDK installation: pip install anthropic for Python; npm install @anthropic-ai/sdk for TypeScript
First request: Five lines in Python; understand stop_reason, usage, and content in the response
Streaming: Use the stream() context manager for real-time output
Error handling: Distinguish retryable from non-retryable errors; never retry 401/400/422
Async: Use AsyncAnthropic in asyncio applications; create one shared instance per process

The next chapter moves into prompt engineering—how to structure system prompts, user messages, and context to maximize output quality.

Rate this chapter

4.8 / 5 (123 ratings)