Chapter 5

Seven-Language SDK Complete Guide: Python / TypeScript / Java / Go / C# / Ruby / PHP

Chapter 5: Understanding Tokens: Billing, Context Window Size, and Long-Text Strategies

5.1 What Tokens Are: From Characters to Semantic Units

Tokens are the atomic unit of computation for large language modelsโ€”not characters, not words, but something in between. Understanding them is essential for cost control, prompt design, and long-document handling.

Token Fundamentals

Claude uses a tokenizer similar to Byte Pair Encoding (BPE). The practical intuitions:

English:
  "the"          โ†’ 1 token
  "running"      โ†’ 1 token
  "unbelievable" โ†’ 3 tokens (un + believ + able)
  " Hello"       โ†’ 1 token (the leading space is part of the token)

Chinese:
  "ไฝ ๅฅฝ"          โ†’ 2 tokens (roughly 1โ€“2 tokens per character)
  "ไบบๅทฅๆ™บ่ƒฝ"       โ†’ ~4โ€“6 tokens
  "้‡ๅญ็บ ็ผ "       โ†’ ~3โ€“4 tokens

Code:
  "def"          โ†’ 1 token
  "class MyClass:" โ†’ ~5 tokens
  "{"            โ†’ 1 token

Rules of thumb:

Measuring Token Counts via the API

The API provides a count_tokens endpoint for exact measurementโ€”no estimation needed:

import anthropic

client = anthropic.Anthropic()

def count_tokens(text: str, model: str = "claude-sonnet-4-6") -> int:
    response = client.messages.count_tokens(
        model=model,
        messages=[{"role": "user", "content": text}]
    )
    return response.input_tokens

# Measure a complete prompt including system prompt
def count_full_prompt_tokens(system: str, messages: list[dict]) -> int:
    response = client.messages.count_tokens(
        model="claude-sonnet-4-6",
        system=system,
        messages=messages
    )
    return response.input_tokens

# Example
texts = [
    "Hello, world!",
    "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
    "The quick brown fox jumps over the lazy dog",
]
for t in texts:
    n = count_tokens(t)
    print(f"{n:4d} tokens | {len(t):4d} chars | ratio {len(t)/n:.1f} | {t[:50]}")

Offline Estimation with tiktoken

For quick estimates without an API call, OpenAI's tiktoken uses the cl100k_base encoding which approximates Claude's tokenizer within ~10%:

# pip install tiktoken
import tiktoken

_enc = tiktoken.get_encoding("cl100k_base")

def estimate_tokens(text: str) -> int:
    return len(_enc.encode(text))

5.2 Billing Model In Depth

Input vs Output Tokens

Every API request is billed on two dimensions, with output priced roughly 5ร— higher:

claude-sonnet-4-6:
  Input:  $3.00 / million tokens
  Output: $15.00 / million tokens

What counts as input tokens:
  - System prompt
  - All prior user and assistant messages
  - Current user message
  - Tool definitions (when using tool use)
  - Images (converted to tokens by resolution)

What counts as output tokens:
  - All text the model generates
  - Tool call arguments
  - Extended Thinking content (thinking blocks)

Image Token Costs

Images are converted to tokens based on their resolution:

import math

def estimate_image_tokens(width: int, height: int) -> int:
    """
    Estimate input tokens for an image.
    Claude tiles images into 512ร—512 blocks; each block costs ~1,600 tokens.
    """
    MAX_SIZE = 1568  # Claude's default max dimension
    if width > MAX_SIZE or height > MAX_SIZE:
        scale = MAX_SIZE / max(width, height)
        width = int(width * scale)
        height = int(height * scale)

    tiles_x = math.ceil(width / 512)
    tiles_y = math.ceil(height / 512)
    return tiles_x * tiles_y * 1600 + 85  # +85 base overhead

print(estimate_image_tokens(800, 600))    # โ†’ ~4,885 tokens
print(estimate_image_tokens(1920, 1080))  # โ†’ ~9,685 tokens

Prompt Caching: 90% Discount on Repeated Content

Anthropic's Prompt Caching feature lets you mark content blocks as cacheable. On subsequent requests within a 5-minute window, those cached tokens are billed at only 10% of the normal input price:

import anthropic

client = anthropic.Anthropic()

LONG_DOCUMENTATION = "..."  # 2,000+ tokens of static reference material

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a code assistant." + LONG_DOCUMENTATION,
            "cache_control": {"type": "ephemeral"},  # mark for caching
        }
    ],
    messages=[{"role": "user", "content": "Explain async/await."}]
)

# First request: full price, writes to cache
# Subsequent requests within 5 min: cached tokens billed at 10%
print(response.usage)
# Usage(
#   input_tokens=45,
#   output_tokens=312,
#   cache_creation_input_tokens=2058,  # written to cache (full price)
#   cache_read_input_tokens=0
# )

Best use cases for caching:

Cost calculation for cached content:

Cache write: $3.75 / M tokens (Sonnet) = 125% of normal input price
Cache read:  $0.30 / M tokens (Sonnet) = 10% of normal input price

Break-even: cache is cheaper after ~2 reads in the same window

5.3 The Context Window: Capability and Limits

What 200K Tokens Can Hold

Claude's 200K token context window is large by any measure:

200,000 tokens can hold approximately:
  - 150,000 English words (a full-length novel)
  - 100,000 Chinese characters
  - 10,000 lines of code (depending on density)
  - 400โ€“500 pages of PDF text
  - 150 standard-resolution images

But "can hold" is not the same as "can reliably process." The context window is a technical ceiling, not a quality guarantee.

The "Lost in the Middle" Effect

Empirical research demonstrates that retrieval accuracy for information placed in the middle of a long context is lower than for information at the beginning or end:

Approximate retrieval accuracy by document position:

Context length   Middle-section accuracy
   5K tokens     ~95%
  20K tokens     ~90%
  50K tokens     ~85%
 100K tokens     ~80%
 200K tokens     ~70%

A 30% drop in accuracy at 200K is non-trivial for tasks requiring precise information retrieval (e.g., "What does clause 23 of the contract say?").

Practical Window Sizing

Task type Recommended effective window
Single-document summarization Full 200K is fine
Multi-document Q&A < 100K
Precise information retrieval < 50K or use RAG
Code analysis < 50K or process in chunks
Multi-turn conversation history Keep < 20K; compress regularly

5.4 Long-Text Processing Strategies

Strategy 1: Chunking

Split long documents into overlapping chunks, process each, then combine:

import anthropic

client = anthropic.Anthropic()

def chunk_text(text: str, chunk_chars: int = 50_000, overlap_chars: int = 500) -> list[str]:
    """
    Split text into overlapping chunks, breaking at sentence boundaries.
    chunk_chars: target chunk size in characters (~12,500 tokens for English)
    overlap_chars: overlap between adjacent chunks for continuity
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_chars
        if end < len(text):
            # Prefer breaking at a sentence boundary
            for sep in ['. ', '.\n', '\n\n', '\n']:
                boundary = text.rfind(sep, start, end)
                if boundary != -1:
                    end = boundary + len(sep)
                    break
        else:
            end = len(text)

        chunks.append(text[start:end])
        start = end - overlap_chars

    return chunks


def summarize_long_document(document: str, question: str | None = None) -> str:
    """
    Two-pass summarization:
    Pass 1: Haiku summarizes each chunk cheaply
    Pass 2: Sonnet synthesizes the chunk summaries into a final answer
    """
    chunks = chunk_text(document)

    # Pass 1 โ€” map
    chunk_summaries = []
    for i, chunk in enumerate(chunks):
        focus = f" Focus especially on content relevant to: {question}" if question else ""
        resp = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=400,
            messages=[{"role": "user",
                       "content": f"Summarize the key points of this excerpt "
                                  f"(part {i+1}/{len(chunks)}).{focus}\n\n{chunk}"}]
        )
        chunk_summaries.append(resp.content[0].text)

    # Pass 2 โ€” reduce
    combined = "\n\n---\n\n".join(
        f"[Part {i+1}]\n{s}" for i, s in enumerate(chunk_summaries)
    )
    final_prompt = "Synthesize the following section summaries into a cohesive overall summary."
    if question:
        final_prompt += f" Also answer this specific question: {question}"
    final_prompt += f"\n\n{combined}"

    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1500,
        messages=[{"role": "user", "content": final_prompt}]
    )
    return resp.content[0].text

Strategy 2: RAG (Retrieval-Augmented Generation)

For precision-retrieval tasks, RAG is more reliable than cramming everything into the context window:

from dataclasses import dataclass, field
import anthropic

@dataclass
class Chunk:
    text: str
    source: str = ""
    metadata: dict = field(default_factory=dict)

class SimpleRAG:
    """
    Minimal RAG implementation. For production, replace _score() with
    embedding-based similarity using pgvector, Pinecone, or similar.
    """

    def __init__(self):
        self.client = anthropic.Anthropic()
        self.chunks: list[Chunk] = []

    def ingest(self, text: str, chunk_size: int = 800, source: str = ""):
        step = chunk_size - 100  # 100-char overlap
        for i in range(0, len(text), step):
            self.chunks.append(Chunk(text=text[i:i+chunk_size], source=source))

    def _score(self, query: str, chunk: Chunk) -> float:
        q_words = set(query.lower().split())
        c_words = set(chunk.text.lower().split())
        return len(q_words & c_words) / max(len(q_words), 1)

    def query(self, question: str, top_k: int = 4) -> str:
        ranked = sorted(self.chunks, key=lambda c: self._score(question, c), reverse=True)
        context_blocks = "\n\n---\n\n".join(
            f"[Source: {c.source or 'document'}]\n{c.text}"
            for c in ranked[:top_k]
        )
        prompt = f"""Answer the question using only the provided references.
If the references don't contain enough information, say so explicitly.

<references>
{context_blocks}
</references>

Question: {question}"""

        resp = self.client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        return resp.content[0].text

Strategy 3: Sliding Window for Sequential Processing

For tasks that need to process a document linearly (e.g., annotating, extracting a running log):

def sliding_window_process(
    text: str,
    task_instruction: str,
    window_tokens: int = 30_000,
    step_tokens: int = 20_000,
) -> list[str]:
    """
    Process text in overlapping windows.
    window_tokens and step_tokens are approximate (using char/4 heuristic).
    """
    window_chars = window_tokens * 4
    step_chars = step_tokens * 4
    results = []
    pos = 0

    while pos < len(text):
        window = text[pos: pos + window_chars]
        is_first = pos == 0
        continuation_note = "" if is_first else " (continuing from a previous section)"

        resp = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=800,
            messages=[{"role": "user",
                       "content": f"{task_instruction}{continuation_note}\n\n{window}"}]
        )
        results.append(resp.content[0].text)

        if pos + window_chars >= len(text):
            break
        pos += step_chars

    return results

Strategy 4: Hierarchical Summarization (Map-Reduce)

def hierarchical_summarize(document: str, target_words: int = 500) -> str:
    """
    Map phase:  Haiku creates short summaries of each chunk (cheap)
    Reduce phase: Sonnet synthesizes into a final summary (quality)
    """
    CHUNK_CHARS = 20_000   # ~5,000 tokens

    chunks = [document[i:i+CHUNK_CHARS] for i in range(0, len(document), CHUNK_CHARS)]

    # Map
    chunk_summaries = []
    for chunk in chunks:
        resp = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=300,
            messages=[{"role": "user",
                       "content": f"Summarize the key points in 3โ€“4 sentences:\n\n{chunk}"}]
        )
        chunk_summaries.append(resp.content[0].text)

    # Reduce
    combined = "\n\n".join(f"โ€ข {s}" for s in chunk_summaries)
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=target_words * 2,  # rough token estimate from target word count
        messages=[{"role": "user",
                   "content": f"Synthesize these section summaries into a single "
                               f"coherent ~{target_words}-word summary:\n\n{combined}"}]
    )
    return resp.content[0].text

5.5 Token Cost Optimization Techniques

Tip 1: Compress System Prompts

โŒ Verbose (~80 tokens):
"You are a very helpful, friendly, and professional AI assistant whose job is
to assist users in solving various problems. You always provide accurate,
detailed, and useful information in a polite and respectful manner."

โœ… Compressed (~12 tokens):
"You are a professional technical assistant. Provide accurate, concise answers."

Rule: every word in the system prompt is paid for on every request. Remove any phrase that doesn't change behavior.

Tip 2: Compress Conversation History

Long conversations accumulate history that consumes tokens on every turn:

def compress_history(
    messages: list[dict],
    max_history_tokens: int = 8_000,
) -> list[dict]:
    """
    When history exceeds the token limit, summarize the older turns.
    Preserves the most recent 6 messages (3 turns) verbatim.
    """
    # Rough token estimate: total chars / 4
    total_chars = sum(len(m.get("content", "") or "") for m in messages)
    if total_chars // 4 <= max_history_tokens:
        return messages

    recent = messages[-6:]
    old = messages[:-6]
    if not old:
        return recent

    history_text = "\n".join(
        f"{'User' if m['role'] == 'user' else 'Assistant'}: "
        f"{str(m.get('content', ''))[:200]}..."
        for m in old
    )

    resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=200,
        messages=[{"role": "user",
                   "content": f"Summarize this conversation in 3 sentences:\n\n{history_text}"}]
    )
    summary = resp.content[0].text

    return [
        {"role": "user", "content": f"[Earlier conversation summary: {summary}]\n\nContinuing our conversation."},
        {"role": "assistant", "content": "Understood. I have the context from our earlier discussion."},
        *recent,
    ]

Tip 3: Right-Size max_tokens

max_tokens does not affect billing if unusedโ€”but setting it too high for short tasks creates two issues: it signals to the model that a long response is expected, and it produces unneeded padding in some cases.

MAX_TOKENS_BY_TASK = {
    "classification":    20,
    "json_extraction":   500,
    "short_summary":     200,
    "long_summary":      800,
    "code_review":       600,
    "code_generation":   2_000,
    "explanation":       400,
    "simple_qa":         150,
    "detailed_qa":       700,
}

def get_max_tokens(task: str, fallback: int = 1_024) -> int:
    return MAX_TOKENS_BY_TASK.get(task, fallback)

Tip 4: Use Batch API for Non-Real-Time Work

The Batch API provides a 50% discount for asynchronous workloads with up to 24-hour processing windows:

import anthropic

client = anthropic.Anthropic()

# Submit a batch of requests
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"item-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 500,
                "messages": [{"role": "user", "content": item_text}]
            }
        }
        for i, item_text in enumerate(items_to_process)
    ]
)

print(f"Batch ID: {batch.id}")
print(f"Use batches.retrieve('{batch.id}') to check status")

Summary

Token economics are foundational to building cost-effective systems with the Claude API:

  1. Token measurement: ~4 chars per English token; ~1.5โ€“2 chars per Chinese character; use count_tokens for precision, tiktoken for offline estimates
  2. Billing structure: Output tokens are 5ร— the price of input tokens; Extended Thinking tokens bill at output prices
  3. Prompt caching: 90% discount on repeated input content; critical for high-frequency workloads with large system prompts
  4. Context window reality: 200K is the technical limit; middle-of-document retrieval accuracy degrades meaningfully above 50K tokens
  5. Long-text strategies:
    • Chunking: general-purpose; good for summarization
    • RAG: best for precise retrieval queries
    • Sliding window: good for sequential linear processing
    • Map-Reduce: best for high-quality summaries of very long documents
  6. Cost optimization: compress system prompts, compress conversation history, right-size max_tokens, use Batch API for async workloads

The next chapter covers response format controlโ€”how to reliably get structured JSON, use XML tags, and build robust parsers that handle the cases where the model doesn't follow the format exactly.

Rate this chapter
4.9  / 5  (95 ratings)

๐Ÿ’ฌ Comments