Chapter 5

Seven-Language SDK Complete Guide: Python / TypeScript / Java / Go / C# / Ruby / PHP

Chapter 5: Understanding Tokens: Billing, Context Window Size, and Long-Text Strategies

5.1 What Tokens Are: From Characters to Semantic Units

Tokens are the atomic unit of computation for large language models—not characters, not words, but something in between. Understanding them is essential for cost control, prompt design, and long-document handling.

Token Fundamentals

Claude uses a tokenizer similar to Byte Pair Encoding (BPE). The practical intuitions:

English:
  "the"          → 1 token
  "running"      → 1 token
  "unbelievable" → 3 tokens (un + believ + able)
  " Hello"       → 1 token (the leading space is part of the token)

Chinese:
  "你好"          → 2 tokens (roughly 1–2 tokens per character)
  "人工智能"       → ~4–6 tokens
  "量子纠缠"       → ~3–4 tokens

Code:
  "def"          → 1 token
  "class MyClass:" → ~5 tokens
  "{"            → 1 token

Rules of thumb:

Measuring Token Counts via the API

The API provides a count_tokens endpoint for exact measurement—no estimation needed:

import anthropic

client = anthropic.Anthropic()

def count_tokens(text: str, model: str = "claude-sonnet-4-6") -> int:
    response = client.messages.count_tokens(
        model=model,
        messages=[{"role": "user", "content": text}]
    )
    return response.input_tokens

# Measure a complete prompt including system prompt
def count_full_prompt_tokens(system: str, messages: list[dict]) -> int:
    response = client.messages.count_tokens(
        model="claude-sonnet-4-6",
        system=system,
        messages=messages
    )
    return response.input_tokens

# Example
texts = [
    "Hello, world!",
    "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
    "The quick brown fox jumps over the lazy dog",
]
for t in texts:
    n = count_tokens(t)
    print(f"{n:4d} tokens | {len(t):4d} chars | ratio {len(t)/n:.1f} | {t[:50]}")

Offline Estimation with tiktoken

For quick estimates without an API call, OpenAI's tiktoken uses the cl100k_base encoding which approximates Claude's tokenizer within ~10%:

# pip install tiktoken
import tiktoken

_enc = tiktoken.get_encoding("cl100k_base")

def estimate_tokens(text: str) -> int:
    return len(_enc.encode(text))

5.2 Billing Model In Depth

Input vs Output Tokens

Every API request is billed on two dimensions, with output priced roughly 5× higher:

claude-sonnet-4-6:
  Input:  $3.00 / million tokens
  Output: $15.00 / million tokens

What counts as input tokens:
  - System prompt
  - All prior user and assistant messages
  - Current user message
  - Tool definitions (when using tool use)
  - Images (converted to tokens by resolution)

What counts as output tokens:
  - All text the model generates
  - Tool call arguments
  - Extended Thinking content (thinking blocks)

Image Token Costs

Images are converted to tokens based on their resolution:

import math

def estimate_image_tokens(width: int, height: int) -> int:
    """
    Estimate input tokens for an image.
    Claude tiles images into 512×512 blocks; each block costs ~1,600 tokens.
    """
    MAX_SIZE = 1568  # Claude's default max dimension
    if width > MAX_SIZE or height > MAX_SIZE:
        scale = MAX_SIZE / max(width, height)
        width = int(width * scale)
        height = int(height * scale)

    tiles_x = math.ceil(width / 512)
    tiles_y = math.ceil(height / 512)
    return tiles_x * tiles_y * 1600 + 85  # +85 base overhead

print(estimate_image_tokens(800, 600))    # → ~4,885 tokens
print(estimate_image_tokens(1920, 1080))  # → ~9,685 tokens

Prompt Caching: 90% Discount on Repeated Content

Anthropic's Prompt Caching feature lets you mark content blocks as cacheable. On subsequent requests within a 5-minute window, those cached tokens are billed at only 10% of the normal input price:

import anthropic

client = anthropic.Anthropic()

LONG_DOCUMENTATION = "..."  # 2,000+ tokens of static reference material

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a code assistant." + LONG_DOCUMENTATION,
            "cache_control": {"type": "ephemeral"},  # mark for caching
        }
    ],
    messages=[{"role": "user", "content": "Explain async/await."}]
)

# First request: full price, writes to cache
# Subsequent requests within 5 min: cached tokens billed at 10%
print(response.usage)
# Usage(
#   input_tokens=45,
#   output_tokens=312,
#   cache_creation_input_tokens=2058,  # written to cache (full price)
#   cache_read_input_tokens=0
# )

Best use cases for caching:

Cost calculation for cached content:

Cache write: $3.75 / M tokens (Sonnet) = 125% of normal input price
Cache read:  $0.30 / M tokens (Sonnet) = 10% of normal input price

Break-even: cache is cheaper after ~2 reads in the same window

5.3 The Context Window: Capability and Limits

What 200K Tokens Can Hold

Claude's 200K token context window is large by any measure:

200,000 tokens can hold approximately:
  - 150,000 English words (a full-length novel)
  - 100,000 Chinese characters
  - 10,000 lines of code (depending on density)
  - 400–500 pages of PDF text
  - 150 standard-resolution images

But "can hold" is not the same as "can reliably process." The context window is a technical ceiling, not a quality guarantee.

The "Lost in the Middle" Effect

Empirical research demonstrates that retrieval accuracy for information placed in the middle of a long context is lower than for information at the beginning or end:

Approximate retrieval accuracy by document position:

Context length   Middle-section accuracy
   5K tokens     ~95%
  20K tokens     ~90%
  50K tokens     ~85%
 100K tokens     ~80%
 200K tokens     ~70%

A 30% drop in accuracy at 200K is non-trivial for tasks requiring precise information retrieval (e.g., "What does clause 23 of the contract say?").

Practical Window Sizing

Task type Recommended effective window
Single-document summarization Full 200K is fine
Multi-document Q&A < 100K
Precise information retrieval < 50K or use RAG
Code analysis < 50K or process in chunks
Multi-turn conversation history Keep < 20K; compress regularly

5.4 Long-Text Processing Strategies

Strategy 1: Chunking

Split long documents into overlapping chunks, process each, then combine:

import anthropic

client = anthropic.Anthropic()

def chunk_text(text: str, chunk_chars: int = 50_000, overlap_chars: int = 500) -> list[str]:
    """
    Split text into overlapping chunks, breaking at sentence boundaries.
    chunk_chars: target chunk size in characters (~12,500 tokens for English)
    overlap_chars: overlap between adjacent chunks for continuity
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_chars
        if end < len(text):
            # Prefer breaking at a sentence boundary
            for sep in ['. ', '.\n', '\n\n', '\n']:
                boundary = text.rfind(sep, start, end)
                if boundary != -1:
                    end = boundary + len(sep)
                    break
        else:
            end = len(text)

        chunks.append(text[start:end])
        start = end - overlap_chars

    return chunks


def summarize_long_document(document: str, question: str | None = None) -> str:
    """
    Two-pass summarization:
    Pass 1: Haiku summarizes each chunk cheaply
    Pass 2: Sonnet synthesizes the chunk summaries into a final answer
    """
    chunks = chunk_text(document)

    # Pass 1 — map
    chunk_summaries = []
    for i, chunk in enumerate(chunks):
        focus = f" Focus especially on content relevant to: {question}" if question else ""
        resp = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=400,
            messages=[{"role": "user",
                       "content": f"Summarize the key points of this excerpt "
                                  f"(part {i+1}/{len(chunks)}).{focus}\n\n{chunk}"}]
        )
        chunk_summaries.append(resp.content[0].text)

    # Pass 2 — reduce
    combined = "\n\n---\n\n".join(
        f"[Part {i+1}]\n{s}" for i, s in enumerate(chunk_summaries)
    )
    final_prompt = "Synthesize the following section summaries into a cohesive overall summary."
    if question:
        final_prompt += f" Also answer this specific question: {question}"
    final_prompt += f"\n\n{combined}"

    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1500,
        messages=[{"role": "user", "content": final_prompt}]
    )
    return resp.content[0].text

Strategy 2: RAG (Retrieval-Augmented Generation)

For precision-retrieval tasks, RAG is more reliable than cramming everything into the context window:

from dataclasses import dataclass, field
import anthropic

@dataclass
class Chunk:
    text: str
    source: str = ""
    metadata: dict = field(default_factory=dict)

class SimpleRAG:
    """
    Minimal RAG implementation. For production, replace _score() with
    embedding-based similarity using pgvector, Pinecone, or similar.
    """

    def __init__(self):
        self.client = anthropic.Anthropic()
        self.chunks: list[Chunk] = []

    def ingest(self, text: str, chunk_size: int = 800, source: str = ""):
        step = chunk_size - 100  # 100-char overlap
        for i in range(0, len(text), step):
            self.chunks.append(Chunk(text=text[i:i+chunk_size], source=source))

    def _score(self, query: str, chunk: Chunk) -> float:
        q_words = set(query.lower().split())
        c_words = set(chunk.text.lower().split())
        return len(q_words & c_words) / max(len(q_words), 1)

    def query(self, question: str, top_k: int = 4) -> str:
        ranked = sorted(self.chunks, key=lambda c: self._score(question, c), reverse=True)
        context_blocks = "\n\n---\n\n".join(
            f"[Source: {c.source or 'document'}]\n{c.text}"
            for c in ranked[:top_k]
        )
        prompt = f"""Answer the question using only the provided references.
If the references don't contain enough information, say so explicitly.

<references>
{context_blocks}
</references>

Question: {question}"""

        resp = self.client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        return resp.content[0].text

Strategy 3: Sliding Window for Sequential Processing

For tasks that need to process a document linearly (e.g., annotating, extracting a running log):

def sliding_window_process(
    text: str,
    task_instruction: str,
    window_tokens: int = 30_000,
    step_tokens: int = 20_000,
) -> list[str]:
    """
    Process text in overlapping windows.
    window_tokens and step_tokens are approximate (using char/4 heuristic).
    """
    window_chars = window_tokens * 4
    step_chars = step_tokens * 4
    results = []
    pos = 0

    while pos < len(text):
        window = text[pos: pos + window_chars]
        is_first = pos == 0
        continuation_note = "" if is_first else " (continuing from a previous section)"

        resp = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=800,
            messages=[{"role": "user",
                       "content": f"{task_instruction}{continuation_note}\n\n{window}"}]
        )
        results.append(resp.content[0].text)

        if pos + window_chars >= len(text):
            break
        pos += step_chars

    return results

Strategy 4: Hierarchical Summarization (Map-Reduce)

def hierarchical_summarize(document: str, target_words: int = 500) -> str:
    """
    Map phase:  Haiku creates short summaries of each chunk (cheap)
    Reduce phase: Sonnet synthesizes into a final summary (quality)
    """
    CHUNK_CHARS = 20_000   # ~5,000 tokens

    chunks = [document[i:i+CHUNK_CHARS] for i in range(0, len(document), CHUNK_CHARS)]

    # Map
    chunk_summaries = []
    for chunk in chunks:
        resp = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=300,
            messages=[{"role": "user",
                       "content": f"Summarize the key points in 3–4 sentences:\n\n{chunk}"}]
        )
        chunk_summaries.append(resp.content[0].text)

    # Reduce
    combined = "\n\n".join(f"• {s}" for s in chunk_summaries)
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=target_words * 2,  # rough token estimate from target word count
        messages=[{"role": "user",
                   "content": f"Synthesize these section summaries into a single "
                               f"coherent ~{target_words}-word summary:\n\n{combined}"}]
    )
    return resp.content[0].text

5.5 Token Cost Optimization Techniques

Tip 1: Compress System Prompts

❌ Verbose (~80 tokens):
"You are a very helpful, friendly, and professional AI assistant whose job is
to assist users in solving various problems. You always provide accurate,
detailed, and useful information in a polite and respectful manner."

✅ Compressed (~12 tokens):
"You are a professional technical assistant. Provide accurate, concise answers."

Rule: every word in the system prompt is paid for on every request. Remove any phrase that doesn't change behavior.

Tip 2: Compress Conversation History

Long conversations accumulate history that consumes tokens on every turn:

def compress_history(
    messages: list[dict],
    max_history_tokens: int = 8_000,
) -> list[dict]:
    """
    When history exceeds the token limit, summarize the older turns.
    Preserves the most recent 6 messages (3 turns) verbatim.
    """
    # Rough token estimate: total chars / 4
    total_chars = sum(len(m.get("content", "") or "") for m in messages)
    if total_chars // 4 <= max_history_tokens:
        return messages

    recent = messages[-6:]
    old = messages[:-6]
    if not old:
        return recent

    history_text = "\n".join(
        f"{'User' if m['role'] == 'user' else 'Assistant'}: "
        f"{str(m.get('content', ''))[:200]}..."
        for m in old
    )

    resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=200,
        messages=[{"role": "user",
                   "content": f"Summarize this conversation in 3 sentences:\n\n{history_text}"}]
    )
    summary = resp.content[0].text

    return [
        {"role": "user", "content": f"[Earlier conversation summary: {summary}]\n\nContinuing our conversation."},
        {"role": "assistant", "content": "Understood. I have the context from our earlier discussion."},
        *recent,
    ]

Tip 3: Right-Size max_tokens

max_tokens does not affect billing if unused—but setting it too high for short tasks creates two issues: it signals to the model that a long response is expected, and it produces unneeded padding in some cases.

MAX_TOKENS_BY_TASK = {
    "classification":    20,
    "json_extraction":   500,
    "short_summary":     200,
    "long_summary":      800,
    "code_review":       600,
    "code_generation":   2_000,
    "explanation":       400,
    "simple_qa":         150,
    "detailed_qa":       700,
}

def get_max_tokens(task: str, fallback: int = 1_024) -> int:
    return MAX_TOKENS_BY_TASK.get(task, fallback)

Tip 4: Use Batch API for Non-Real-Time Work

The Batch API provides a 50% discount for asynchronous workloads with up to 24-hour processing windows:

import anthropic

client = anthropic.Anthropic()

# Submit a batch of requests
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"item-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 500,
                "messages": [{"role": "user", "content": item_text}]
            }
        }
        for i, item_text in enumerate(items_to_process)
    ]
)

print(f"Batch ID: {batch.id}")
print(f"Use batches.retrieve('{batch.id}') to check status")

Summary

Token economics are foundational to building cost-effective systems with the Claude API:

  1. Token measurement: ~4 chars per English token; ~1.5–2 chars per Chinese character; use count_tokens for precision, tiktoken for offline estimates
  2. Billing structure: Output tokens are 5× the price of input tokens; Extended Thinking tokens bill at output prices
  3. Prompt caching: 90% discount on repeated input content; critical for high-frequency workloads with large system prompts
  4. Context window reality: 200K is the technical limit; middle-of-document retrieval accuracy degrades meaningfully above 50K tokens
  5. Long-text strategies:
    • Chunking: general-purpose; good for summarization
    • RAG: best for precise retrieval queries
    • Sliding window: good for sequential linear processing
    • Map-Reduce: best for high-quality summaries of very long documents
  6. Cost optimization: compress system prompts, compress conversation history, right-size max_tokens, use Batch API for async workloads

The next chapter covers response format control—how to reliably get structured JSON, use XML tags, and build robust parsers that handle the cases where the model doesn't follow the format exactly.

Rate this chapter
4.9  / 5  (95 ratings)

💬 Comments