Token Economics: Precise Calculation and Cost Estimation for Input/Output/Thinking/Cache Tokens
Chapter 3: API Quick Start: Authentication, Rate Limits, SDK Installation, and Your First Request
3.1 Obtaining an API Key
Before making your first Claude API call, you need an API key. The process:
- Go to console.anthropic.com
- Create an account and verify your email
- Navigate to API Keys in the left sidebar
- Click Create Key, give it a descriptive name (e.g.,
production-chatbot,dev-testing) - Copy the key immediatelyโit is shown only once
Secure API Key Management
An API key grants full access to your account, including spending your credit balance. Never:
- Hardcode the key in source files
- Commit files containing the key to any version control repository
- Include the key in client-side code (browser JavaScript, mobile apps)
The correct approach is environment variables:
# Linux / macOS
export ANTHROPIC_API_KEY="sk-ant-api03-..."
# Windows PowerShell
$env:ANTHROPIC_API_KEY = "sk-ant-api03-..."
For local development, a .env file with python-dotenv is convenient:
# .env (never commit this file)
ANTHROPIC_API_KEY=sk-ant-api03-...
from dotenv import load_dotenv
load_dotenv() # loads .env into environment before creating the client
For production, use a secrets management service: AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault, or equivalent. The key should never appear in plaintext in any configuration file that might end up in a repository.
Key Format
Anthropic API keys begin with sk-ant-api03- followed by approximately 90 random characters. If you see a different prefix, you may be looking at a legacy key format or a key from a different service.
3.2 Understanding Rate Limits
The API enforces rate limits along two independent dimensions. Hitting either threshold will return an HTTP 429 error.
Rate Limit Dimensions
- RPM (Requests Per Minute): Maximum number of API calls per minute
- Input TPM (Input Tokens Per Minute): Maximum input tokens per minute
- Output TPM (Output Tokens Per Minute): Maximum output tokens per minute
- TPD (Tokens Per Day): Daily token limit (applies to some account tiers)
Approximate limits for claude-sonnet-4-6 by account tier (verify current values in the Anthropic documentation):
Tier RPM Input TPM Output TPM
โโโโโโ โโโโโ โโโโโโโโโโ โโโโโโโโโโ
Tier 1 50 40,000 8,000
Tier 2 1,000 80,000 16,000
Tier 3 2,000 160,000 32,000
Tier 4 4,000 400,000 80,000
Upgrading tiers requires adding a payment method and completing Anthropic's review process (typically 24โ48 hours).
Reading Rate Limit Headers
Every API response includes headers telling you your current consumption:
anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 847
anthropic-ratelimit-requests-reset: 2024-01-15T10:31:00Z
anthropic-ratelimit-tokens-limit: 80000
anthropic-ratelimit-tokens-remaining: 52340
anthropic-ratelimit-tokens-reset: 2024-01-15T10:30:30Z
retry-after: 30
These headers let you implement proactive throttling rather than relying purely on reactive retry logic.
Exponential Backoff for 429 Errors
import time
import anthropic
from anthropic import RateLimitError
def call_with_retry(client: anthropic.Anthropic, max_retries: int = 5, **kwargs):
"""
Wraps client.messages.create() with exponential backoff on rate limit errors.
Wait sequence: 1s, 2.1s, 4.2s, 8.3s, 16.4s
"""
for attempt in range(max_retries):
try:
return client.messages.create(**kwargs)
except RateLimitError:
if attempt == max_retries - 1:
raise # Give up after max_retries attempts
wait = (2 ** attempt) + (0.1 * attempt)
print(f"Rate limit hit; retrying in {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(wait)
Token Budget Management for High Concurrency
In concurrent workloads, TPM limits are often hit before RPM limits. A sliding-window token budget manager prevents wasted retries:
import threading
import time
from collections import deque
class TokenBudgetManager:
"""
Sliding-window token budget manager.
Prevents requests that would exceed the per-minute token limit.
"""
def __init__(self, tokens_per_minute: int):
self.limit = tokens_per_minute
self.window: deque[tuple[float, int]] = deque()
self._lock = threading.Lock()
def _prune(self) -> int:
cutoff = time.time() - 60.0
while self.window and self.window[0][0] < cutoff:
self.window.popleft()
return sum(tokens for _, tokens in self.window)
def can_proceed(self, estimated_tokens: int) -> bool:
with self._lock:
used = self._prune()
return used + estimated_tokens <= self.limit
def record(self, token_count: int):
with self._lock:
self.window.append((time.time(), token_count))
def wait_for_budget(self, estimated_tokens: int, timeout: float = 120.0):
start = time.time()
while not self.can_proceed(estimated_tokens):
if time.time() - start > timeout:
raise TimeoutError("Timed out waiting for token budget")
time.sleep(1.0)
3.3 Installing the SDK
Python SDK
pip install anthropic
# With package managers
poetry add anthropic
uv add anthropic
Recommended version pinning in pyproject.toml:
[tool.poetry.dependencies]
python = "^3.9"
anthropic = "^0.34.0" # allows patch/minor updates, pins major
The SDK's dependencies are intentionally lightweight: httpx for HTTP, pydantic for data validation, and typing-extensions for backported type hints.
TypeScript / Node.js SDK
npm install @anthropic-ai/sdk
# or
yarn add @anthropic-ai/sdk
# or
pnpm add @anthropic-ai/sdk
The TypeScript SDK ships with full type definitions. In a TypeScript project, all request and response fields are fully typed and show up in IDE autocompletion.
{
"dependencies": {
"@anthropic-ai/sdk": "^0.26.0"
}
}
Direct HTTP (No SDK)
For environments where neither Python nor Node.js is available, the REST API is callable with any HTTP client:
curl https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Hello, Claude!"}
]
}'
Two headers are required on every request:
x-api-key: Your API keyanthropic-version: The API version string, currently2023-06-01
3.4 Your First Request
Python
import anthropic
# Client reads ANTHROPIC_API_KEY from the environment automatically
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{"role": "user", "content": "Explain quantum entanglement in one sentence."}
]
)
print(message.content[0].text)
# โ Quantum entanglement is a phenomenon where two or more particles become
# correlated such that measuring the state of one instantly determines
# the state of the other, regardless of the distance between them.
Understanding the Response Object
print(message)
# Message(
# id='msg_01XFDUDYJgAACzvnptvVoYEL',
# type='message',
# role='assistant',
# content=[
# TextBlock(text='Quantum entanglement is...', type='text')
# ],
# model='claude-sonnet-4-6',
# stop_reason='end_turn',
# stop_sequence=None,
# usage=Usage(input_tokens=14, output_tokens=47)
# )
# Key field access patterns
text = message.content[0].text # response text
input_tokens = message.usage.input_tokens # tokens consumed by input
output_tokens = message.usage.output_tokens # tokens consumed by output
stop_reason = message.stop_reason # 'end_turn', 'max_tokens', or 'stop_sequence'
model_used = message.model # actual model version that served the request
stop_reason is important for production code:
end_turn: Model finished naturallymax_tokens: Response was cut off at yourmax_tokenslimit โ you may need to increase it or paginatestop_sequence: A stop sequence you specified was encountered
TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic(); // reads process.env.ANTHROPIC_API_KEY
async function main() {
const message = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [
{ role: "user", content: "Explain quantum entanglement in one sentence." }
],
});
// TypeScript knows content[0] can be TextBlock or ToolUseBlock
const block = message.content[0];
if (block.type === "text") {
console.log(block.text);
}
console.log(
`Tokens: ${message.usage.input_tokens} input, ${message.usage.output_tokens} output`
);
}
main();
3.5 Streaming Responses
For chat interfaces and other interactive use cases, streaming delivers tokens as they are generated rather than waiting for the complete response.
Python Streaming
import anthropic
client = anthropic.Anthropic()
# Recommended: use the context manager
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Write a haiku about the ocean."}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
print() # newline after completion
# Retrieve the final message with full usage stats
final = stream.get_final_message()
print(f"\nUsage: {final.usage.input_tokens} in / {final.usage.output_tokens} out")
For more granular event handling:
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": "Explain TCP's three-way handshake."}]
) as stream:
for event in stream:
if event.type == "content_block_delta":
if event.delta.type == "text_delta":
print(event.delta.text, end="", flush=True)
elif event.type == "message_delta":
# contains stop_reason and usage when the message completes
pass
TypeScript Streaming
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function streamExample() {
const stream = client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Write a haiku about the ocean." }],
});
for await (const chunk of stream) {
if (
chunk.type === "content_block_delta" &&
chunk.delta.type === "text_delta"
) {
process.stdout.write(chunk.delta.text);
}
}
const final = await stream.finalMessage();
console.log(`\nTokens: ${final.usage.input_tokens} in / ${final.usage.output_tokens} out`);
}
streamExample();
3.6 Adding a System Prompt
The system prompt sets Claude's behavior, persona, and constraints for the entire conversation. It is passed as a top-level system parameter, separate from the messages array.
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a senior backend engineer specializing in distributed systems.
When answering questions:
1. Always provide working code examples in Python
2. Explain the trade-offs of each approach
3. Call out common pitfalls explicitly
4. If you are uncertain about something, say so clearly
Format your responses with Markdown. Use code blocks with language identifiers."""
def ask_technical_question(question: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": question}]
)
return response.content[0].text
answer = ask_technical_question("What are the trade-offs between optimistic and pessimistic locking?")
print(answer)
3.7 Complete Error Handling
Production code must handle all failure modes. The SDK exposes structured exception classes:
import anthropic
from anthropic import (
AuthenticationError,
BadRequestError,
RateLimitError,
APIConnectionError,
APITimeoutError,
InternalServerError,
UnprocessableEntityError,
APIError,
)
def robust_call(client: anthropic.Anthropic, **kwargs):
try:
return client.messages.create(**kwargs)
except AuthenticationError:
# HTTP 401 โ invalid or expired API key
raise RuntimeError("Invalid API key. Check ANTHROPIC_API_KEY.")
except BadRequestError as e:
# HTTP 400 โ invalid request parameters
raise ValueError(f"Bad request: {e.message}") from e
except UnprocessableEntityError as e:
# HTTP 422 โ request violates usage policy
# Do NOT retry; the content itself needs to change
raise ValueError(f"Content policy violation: {e.message}") from e
except RateLimitError:
# HTTP 429 โ implement retry logic (see section 3.2)
raise
except APITimeoutError:
# Request timed out โ safe to retry
raise
except APIConnectionError:
# Network failure โ safe to retry after checking connectivity
raise
except InternalServerError as e:
# HTTP 5xx โ Anthropic server error โ safe to retry with backoff
raise
except APIError as e:
# Catch-all for any other API error
raise RuntimeError(f"API error {e.status_code}: {e.message}") from e
Retry decision table:
Error Status Retryable Action
โโโโโโโโโโโโโโโโโโโโโโ โโโโโโ โโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
AuthenticationError 401 No Fix API key
BadRequestError 400 No Fix request parameters
UnprocessableEntityError 422 No Modify request content
NotFoundError 404 No Check model ID
RateLimitError 429 Yes Wait retry-after, then backoff
InternalServerError 500/529 Yes Exponential backoff, max 3x
APITimeoutError โ Yes Retry with longer timeout
APIConnectionError โ Yes Retry after connectivity check
3.8 HTTP Client Configuration
Custom Timeouts
import anthropic
import httpx
client = anthropic.Anthropic(
timeout=httpx.Timeout(
connect=5.0, # TCP connection timeout
read=120.0, # Time to wait for the first byte of the response
write=10.0, # Time to send the request body
pool=5.0 # Time to acquire a connection from the pool
)
)
The default read timeout is 600 seconds (10 minutes), which accommodates long Extended Thinking responses. For Haiku-based systems processing short prompts, reducing this to 30โ60 seconds helps surface timeouts faster.
Proxy Support
import anthropic
import httpx
client = anthropic.Anthropic(
http_client=httpx.Client(proxy="http://your-proxy.example.com:8080")
)
Connection Pool Tuning for High Concurrency
import anthropic
import httpx
client = anthropic.Anthropic(
http_client=httpx.Client(
limits=httpx.Limits(
max_connections=100,
max_keepalive_connections=20,
keepalive_expiry=30.0
)
)
)
3.9 Async Client
For asyncio-based applications (FastAPI, aiohttp, etc.), use the async client to avoid blocking the event loop:
import asyncio
import anthropic
async def main():
client = anthropic.AsyncAnthropic() # note: AsyncAnthropic, not Anthropic
message = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain async I/O in Python."}]
)
print(message.content[0].text)
asyncio.run(main())
FastAPI integration example:
from fastapi import FastAPI
import anthropic
app = FastAPI()
client = anthropic.AsyncAnthropic() # one shared client instance
@app.post("/chat")
async def chat(message: str) -> dict:
response = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": message}]
)
return {
"reply": response.content[0].text,
"tokens": {
"input": response.usage.input_tokens,
"output": response.usage.output_tokens,
}
}
Important: Create the AsyncAnthropic client once at module level and reuse it. Instantiating a new client per request wastes connection pool resources.
Summary
This chapter covered the complete path from zero to a working API integration:
- API key security: Always use environment variables; never hardcode or commit keys
- Rate limits: Two independent dimensions (RPM and TPM); use exponential backoff for 429 errors
- SDK installation:
pip install anthropicfor Python;npm install @anthropic-ai/sdkfor TypeScript - First request: Five lines in Python; understand
stop_reason,usage, andcontentin the response - Streaming: Use the
stream()context manager for real-time output - Error handling: Distinguish retryable from non-retryable errors; never retry 401/400/422
- Async: Use
AsyncAnthropicin asyncio applications; create one shared instance per process
The next chapter moves into prompt engineeringโhow to structure system prompts, user messages, and context to maximize output quality.