Seven-Language SDK Complete Guide: Python / TypeScript / Java / Go / C# / Ruby / PHP
Chapter 5: Understanding Tokens: Billing, Context Window Size, and Long-Text Strategies
5.1 What Tokens Are: From Characters to Semantic Units
Tokens are the atomic unit of computation for large language models—not characters, not words, but something in between. Understanding them is essential for cost control, prompt design, and long-document handling.
Token Fundamentals
Claude uses a tokenizer similar to Byte Pair Encoding (BPE). The practical intuitions:
English:
"the" → 1 token
"running" → 1 token
"unbelievable" → 3 tokens (un + believ + able)
" Hello" → 1 token (the leading space is part of the token)
Chinese:
"你好" → 2 tokens (roughly 1–2 tokens per character)
"人工智能" → ~4–6 tokens
"量子纠缠" → ~3–4 tokens
Code:
"def" → 1 token
"class MyClass:" → ~5 tokens
"{" → 1 token
Rules of thumb:
- English: ~4 characters per token, or ~0.75 words per token
- Chinese: ~1.5–2 characters per token (varies by tokenizer)
- Code: ~3–4 characters per token
Measuring Token Counts via the API
The API provides a count_tokens endpoint for exact measurement—no estimation needed:
import anthropic
client = anthropic.Anthropic()
def count_tokens(text: str, model: str = "claude-sonnet-4-6") -> int:
response = client.messages.count_tokens(
model=model,
messages=[{"role": "user", "content": text}]
)
return response.input_tokens
# Measure a complete prompt including system prompt
def count_full_prompt_tokens(system: str, messages: list[dict]) -> int:
response = client.messages.count_tokens(
model="claude-sonnet-4-6",
system=system,
messages=messages
)
return response.input_tokens
# Example
texts = [
"Hello, world!",
"def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
"The quick brown fox jumps over the lazy dog",
]
for t in texts:
n = count_tokens(t)
print(f"{n:4d} tokens | {len(t):4d} chars | ratio {len(t)/n:.1f} | {t[:50]}")
Offline Estimation with tiktoken
For quick estimates without an API call, OpenAI's tiktoken uses the cl100k_base encoding which approximates Claude's tokenizer within ~10%:
# pip install tiktoken
import tiktoken
_enc = tiktoken.get_encoding("cl100k_base")
def estimate_tokens(text: str) -> int:
return len(_enc.encode(text))
5.2 Billing Model In Depth
Input vs Output Tokens
Every API request is billed on two dimensions, with output priced roughly 5× higher:
claude-sonnet-4-6:
Input: $3.00 / million tokens
Output: $15.00 / million tokens
What counts as input tokens:
- System prompt
- All prior user and assistant messages
- Current user message
- Tool definitions (when using tool use)
- Images (converted to tokens by resolution)
What counts as output tokens:
- All text the model generates
- Tool call arguments
- Extended Thinking content (thinking blocks)
Image Token Costs
Images are converted to tokens based on their resolution:
import math
def estimate_image_tokens(width: int, height: int) -> int:
"""
Estimate input tokens for an image.
Claude tiles images into 512×512 blocks; each block costs ~1,600 tokens.
"""
MAX_SIZE = 1568 # Claude's default max dimension
if width > MAX_SIZE or height > MAX_SIZE:
scale = MAX_SIZE / max(width, height)
width = int(width * scale)
height = int(height * scale)
tiles_x = math.ceil(width / 512)
tiles_y = math.ceil(height / 512)
return tiles_x * tiles_y * 1600 + 85 # +85 base overhead
print(estimate_image_tokens(800, 600)) # → ~4,885 tokens
print(estimate_image_tokens(1920, 1080)) # → ~9,685 tokens
Prompt Caching: 90% Discount on Repeated Content
Anthropic's Prompt Caching feature lets you mark content blocks as cacheable. On subsequent requests within a 5-minute window, those cached tokens are billed at only 10% of the normal input price:
import anthropic
client = anthropic.Anthropic()
LONG_DOCUMENTATION = "..." # 2,000+ tokens of static reference material
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a code assistant." + LONG_DOCUMENTATION,
"cache_control": {"type": "ephemeral"}, # mark for caching
}
],
messages=[{"role": "user", "content": "Explain async/await."}]
)
# First request: full price, writes to cache
# Subsequent requests within 5 min: cached tokens billed at 10%
print(response.usage)
# Usage(
# input_tokens=45,
# output_tokens=312,
# cache_creation_input_tokens=2058, # written to cache (full price)
# cache_read_input_tokens=0
# )
Best use cases for caching:
- Large system prompts (>1,000 tokens) shared across many requests
- RAG reference documents reused across multiple turns
- Early conversation history in long multi-turn sessions
Cost calculation for cached content:
Cache write: $3.75 / M tokens (Sonnet) = 125% of normal input price
Cache read: $0.30 / M tokens (Sonnet) = 10% of normal input price
Break-even: cache is cheaper after ~2 reads in the same window
5.3 The Context Window: Capability and Limits
What 200K Tokens Can Hold
Claude's 200K token context window is large by any measure:
200,000 tokens can hold approximately:
- 150,000 English words (a full-length novel)
- 100,000 Chinese characters
- 10,000 lines of code (depending on density)
- 400–500 pages of PDF text
- 150 standard-resolution images
But "can hold" is not the same as "can reliably process." The context window is a technical ceiling, not a quality guarantee.
The "Lost in the Middle" Effect
Empirical research demonstrates that retrieval accuracy for information placed in the middle of a long context is lower than for information at the beginning or end:
Approximate retrieval accuracy by document position:
Context length Middle-section accuracy
5K tokens ~95%
20K tokens ~90%
50K tokens ~85%
100K tokens ~80%
200K tokens ~70%
A 30% drop in accuracy at 200K is non-trivial for tasks requiring precise information retrieval (e.g., "What does clause 23 of the contract say?").
Practical Window Sizing
| Task type | Recommended effective window |
|---|---|
| Single-document summarization | Full 200K is fine |
| Multi-document Q&A | < 100K |
| Precise information retrieval | < 50K or use RAG |
| Code analysis | < 50K or process in chunks |
| Multi-turn conversation history | Keep < 20K; compress regularly |
5.4 Long-Text Processing Strategies
Strategy 1: Chunking
Split long documents into overlapping chunks, process each, then combine:
import anthropic
client = anthropic.Anthropic()
def chunk_text(text: str, chunk_chars: int = 50_000, overlap_chars: int = 500) -> list[str]:
"""
Split text into overlapping chunks, breaking at sentence boundaries.
chunk_chars: target chunk size in characters (~12,500 tokens for English)
overlap_chars: overlap between adjacent chunks for continuity
"""
chunks = []
start = 0
while start < len(text):
end = start + chunk_chars
if end < len(text):
# Prefer breaking at a sentence boundary
for sep in ['. ', '.\n', '\n\n', '\n']:
boundary = text.rfind(sep, start, end)
if boundary != -1:
end = boundary + len(sep)
break
else:
end = len(text)
chunks.append(text[start:end])
start = end - overlap_chars
return chunks
def summarize_long_document(document: str, question: str | None = None) -> str:
"""
Two-pass summarization:
Pass 1: Haiku summarizes each chunk cheaply
Pass 2: Sonnet synthesizes the chunk summaries into a final answer
"""
chunks = chunk_text(document)
# Pass 1 — map
chunk_summaries = []
for i, chunk in enumerate(chunks):
focus = f" Focus especially on content relevant to: {question}" if question else ""
resp = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=400,
messages=[{"role": "user",
"content": f"Summarize the key points of this excerpt "
f"(part {i+1}/{len(chunks)}).{focus}\n\n{chunk}"}]
)
chunk_summaries.append(resp.content[0].text)
# Pass 2 — reduce
combined = "\n\n---\n\n".join(
f"[Part {i+1}]\n{s}" for i, s in enumerate(chunk_summaries)
)
final_prompt = "Synthesize the following section summaries into a cohesive overall summary."
if question:
final_prompt += f" Also answer this specific question: {question}"
final_prompt += f"\n\n{combined}"
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1500,
messages=[{"role": "user", "content": final_prompt}]
)
return resp.content[0].text
Strategy 2: RAG (Retrieval-Augmented Generation)
For precision-retrieval tasks, RAG is more reliable than cramming everything into the context window:
from dataclasses import dataclass, field
import anthropic
@dataclass
class Chunk:
text: str
source: str = ""
metadata: dict = field(default_factory=dict)
class SimpleRAG:
"""
Minimal RAG implementation. For production, replace _score() with
embedding-based similarity using pgvector, Pinecone, or similar.
"""
def __init__(self):
self.client = anthropic.Anthropic()
self.chunks: list[Chunk] = []
def ingest(self, text: str, chunk_size: int = 800, source: str = ""):
step = chunk_size - 100 # 100-char overlap
for i in range(0, len(text), step):
self.chunks.append(Chunk(text=text[i:i+chunk_size], source=source))
def _score(self, query: str, chunk: Chunk) -> float:
q_words = set(query.lower().split())
c_words = set(chunk.text.lower().split())
return len(q_words & c_words) / max(len(q_words), 1)
def query(self, question: str, top_k: int = 4) -> str:
ranked = sorted(self.chunks, key=lambda c: self._score(question, c), reverse=True)
context_blocks = "\n\n---\n\n".join(
f"[Source: {c.source or 'document'}]\n{c.text}"
for c in ranked[:top_k]
)
prompt = f"""Answer the question using only the provided references.
If the references don't contain enough information, say so explicitly.
<references>
{context_blocks}
</references>
Question: {question}"""
resp = self.client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return resp.content[0].text
Strategy 3: Sliding Window for Sequential Processing
For tasks that need to process a document linearly (e.g., annotating, extracting a running log):
def sliding_window_process(
text: str,
task_instruction: str,
window_tokens: int = 30_000,
step_tokens: int = 20_000,
) -> list[str]:
"""
Process text in overlapping windows.
window_tokens and step_tokens are approximate (using char/4 heuristic).
"""
window_chars = window_tokens * 4
step_chars = step_tokens * 4
results = []
pos = 0
while pos < len(text):
window = text[pos: pos + window_chars]
is_first = pos == 0
continuation_note = "" if is_first else " (continuing from a previous section)"
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=800,
messages=[{"role": "user",
"content": f"{task_instruction}{continuation_note}\n\n{window}"}]
)
results.append(resp.content[0].text)
if pos + window_chars >= len(text):
break
pos += step_chars
return results
Strategy 4: Hierarchical Summarization (Map-Reduce)
def hierarchical_summarize(document: str, target_words: int = 500) -> str:
"""
Map phase: Haiku creates short summaries of each chunk (cheap)
Reduce phase: Sonnet synthesizes into a final summary (quality)
"""
CHUNK_CHARS = 20_000 # ~5,000 tokens
chunks = [document[i:i+CHUNK_CHARS] for i in range(0, len(document), CHUNK_CHARS)]
# Map
chunk_summaries = []
for chunk in chunks:
resp = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
messages=[{"role": "user",
"content": f"Summarize the key points in 3–4 sentences:\n\n{chunk}"}]
)
chunk_summaries.append(resp.content[0].text)
# Reduce
combined = "\n\n".join(f"• {s}" for s in chunk_summaries)
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=target_words * 2, # rough token estimate from target word count
messages=[{"role": "user",
"content": f"Synthesize these section summaries into a single "
f"coherent ~{target_words}-word summary:\n\n{combined}"}]
)
return resp.content[0].text
5.5 Token Cost Optimization Techniques
Tip 1: Compress System Prompts
❌ Verbose (~80 tokens):
"You are a very helpful, friendly, and professional AI assistant whose job is
to assist users in solving various problems. You always provide accurate,
detailed, and useful information in a polite and respectful manner."
✅ Compressed (~12 tokens):
"You are a professional technical assistant. Provide accurate, concise answers."
Rule: every word in the system prompt is paid for on every request. Remove any phrase that doesn't change behavior.
Tip 2: Compress Conversation History
Long conversations accumulate history that consumes tokens on every turn:
def compress_history(
messages: list[dict],
max_history_tokens: int = 8_000,
) -> list[dict]:
"""
When history exceeds the token limit, summarize the older turns.
Preserves the most recent 6 messages (3 turns) verbatim.
"""
# Rough token estimate: total chars / 4
total_chars = sum(len(m.get("content", "") or "") for m in messages)
if total_chars // 4 <= max_history_tokens:
return messages
recent = messages[-6:]
old = messages[:-6]
if not old:
return recent
history_text = "\n".join(
f"{'User' if m['role'] == 'user' else 'Assistant'}: "
f"{str(m.get('content', ''))[:200]}..."
for m in old
)
resp = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=200,
messages=[{"role": "user",
"content": f"Summarize this conversation in 3 sentences:\n\n{history_text}"}]
)
summary = resp.content[0].text
return [
{"role": "user", "content": f"[Earlier conversation summary: {summary}]\n\nContinuing our conversation."},
{"role": "assistant", "content": "Understood. I have the context from our earlier discussion."},
*recent,
]
Tip 3: Right-Size max_tokens
max_tokens does not affect billing if unused—but setting it too high for short tasks creates two issues: it signals to the model that a long response is expected, and it produces unneeded padding in some cases.
MAX_TOKENS_BY_TASK = {
"classification": 20,
"json_extraction": 500,
"short_summary": 200,
"long_summary": 800,
"code_review": 600,
"code_generation": 2_000,
"explanation": 400,
"simple_qa": 150,
"detailed_qa": 700,
}
def get_max_tokens(task: str, fallback: int = 1_024) -> int:
return MAX_TOKENS_BY_TASK.get(task, fallback)
Tip 4: Use Batch API for Non-Real-Time Work
The Batch API provides a 50% discount for asynchronous workloads with up to 24-hour processing windows:
import anthropic
client = anthropic.Anthropic()
# Submit a batch of requests
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"item-{i}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 500,
"messages": [{"role": "user", "content": item_text}]
}
}
for i, item_text in enumerate(items_to_process)
]
)
print(f"Batch ID: {batch.id}")
print(f"Use batches.retrieve('{batch.id}') to check status")
Summary
Token economics are foundational to building cost-effective systems with the Claude API:
- Token measurement: ~4 chars per English token; ~1.5–2 chars per Chinese character; use
count_tokensfor precision,tiktokenfor offline estimates - Billing structure: Output tokens are 5× the price of input tokens; Extended Thinking tokens bill at output prices
- Prompt caching: 90% discount on repeated input content; critical for high-frequency workloads with large system prompts
- Context window reality: 200K is the technical limit; middle-of-document retrieval accuracy degrades meaningfully above 50K tokens
- Long-text strategies:
- Chunking: general-purpose; good for summarization
- RAG: best for precise retrieval queries
- Sliding window: good for sequential linear processing
- Map-Reduce: best for high-quality summaries of very long documents
- Cost optimization: compress system prompts, compress conversation history, right-size
max_tokens, use Batch API for async workloads
The next chapter covers response format control—how to reliably get structured JSON, use XML tags, and build robust parsers that handle the cases where the model doesn't follow the format exactly.