Chapter 13

Token Counting API + Batch API: Free Count Estimation and 50% Cost Savings with Batch Processing

Chapter 13: Multimodal Input: Best Practices for Images, PDFs, and Document Understanding

13.1 Claude's Multimodal Capabilities

Starting with the Claude 3 series, Claude can process visual information alongside text. PDF support was added later as a beta feature. Understanding the technical specifications and limits of each media type is essential for building reliable multimodal applications.

Supported media types

Type	MIME type	Delivery	Size limit
JPEG	image/jpeg	base64 / URL	5 MB per image
PNG	image/png	base64 / URL	5 MB per image
GIF	image/gif	base64 / URL	5 MB (static only)
WebP	image/webp	base64 / URL	5 MB per image
PDF	application/pdf	base64	32 MB, up to 100 pages

Up to 20 images per API call are supported. PDF support requires the betas parameter.

Image token cost

Images consume tokens proportional to their pixel dimensions — a critical cost factor:

Small image (< 200×200px): ~85 tokens
Standard image (1000×1000px): ~1,334 tokens
Large image (4000×4000px): can exceed 5,000 tokens

Claude uses a "vision tiles" system: images are split into 224×224px tiles, each costing ~170 tokens, plus a fixed base of 85 tokens. Pre-scaling images is one of the most effective cost optimizations for visual workloads.

13.2 Sending Images: base64 vs URL

Method 1: base64 encoding

import anthropic, base64
from pathlib import Path

client = anthropic.Anthropic()

def encode_image(path: str) -> tuple[str, str]:
    """Return (base64_data, mime_type) for an image file."""
    p = Path(path)
    mime_map = {
        ".jpg": "image/jpeg", ".jpeg": "image/jpeg",
        ".png": "image/png", ".gif": "image/gif", ".webp": "image/webp"
    }
    mime = mime_map.get(p.suffix.lower(), "image/jpeg")
    with open(path, "rb") as f:
        data = base64.standard_b64encode(f.read()).decode("utf-8")
    return data, mime

data, mime = encode_image("screenshot.png")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {"type": "base64", "media_type": mime, "data": data}
            },
            {
                "type": "text",
                "text": "Describe the UI components visible in this screenshot."
            }
        ]
    }]
)
print(response.content[0].text)

Method 2: URL reference

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/chart.png"
                }
            },
            {
                "type": "text",
                "text": "Analyze this chart and extract the key data points and trends."
            }
        ]
    }]
)

Choosing between URL and base64:

URL: Simpler for images already publicly accessible; Anthropic's servers fetch the image
base64: Works for private or local images; adds ~33% payload size; no network dependency at call time

13.3 Multi-Image Analysis

Comparing multiple images

def compare_designs(image_paths: list[str]) -> str:
    """Compare multiple design mockups side by side."""
    content = []

    for i, path in enumerate(image_paths):
        data, mime = encode_image(path)
        content.append({
            "type": "image",
            "source": {"type": "base64", "media_type": mime, "data": data}
        })
        content.append({"type": "text", "text": f"Above is Design Option {i+1}."})

    content.append({
        "type": "text",
        "text": ("Compare these design options across three dimensions: "
                 "UX clarity, visual hierarchy, and interaction design. "
                 "Give a recommendation with reasoning.")
    })

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

def analyze_sequence(frame_paths: list[str], question: str) -> str:
    """Analyze an image sequence (e.g., video keyframes). Max 20 images."""
    if len(frame_paths) > 20:
        raise ValueError("Maximum 20 images per call")

    content = []
    for path in frame_paths:
        data, mime = encode_image(path)
        content.append({
            "type": "image",
            "source": {"type": "base64", "media_type": mime, "data": data}
        })
    content.append({"type": "text", "text": question})

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

13.4 Image Optimization to Reduce Token Cost

from PIL import Image
import io

def optimize_image(
    path: str,
    max_dimension: int = 1024,
    quality: int = 85
) -> tuple[str, str]:
    """Resize and compress an image before sending to reduce token cost."""
    with Image.open(path) as img:
        if img.mode not in ("RGB", "L"):
            img = img.convert("RGB")

        w, h = img.size
        if max(w, h) > max_dimension:
            ratio = max_dimension / max(w, h)
            img = img.resize((int(w * ratio), int(h * ratio)), Image.LANCZOS)

        buf = io.BytesIO()
        img.save(buf, format="JPEG", quality=quality, optimize=True)
        buf.seek(0)
        data = base64.standard_b64encode(buf.read()).decode("utf-8")

    return data, "image/jpeg"

def estimate_tokens(width: int, height: int) -> int:
    """Estimate token consumption for an image of given dimensions."""
    tiles = ((width + 223) // 224) * ((height + 223) // 224)
    return 85 + 170 * tiles

# Examples
print(estimate_tokens(512, 512))    # ~595 tokens
print(estimate_tokens(1024, 1024))  # ~1955 tokens
print(estimate_tokens(2048, 2048))  # ~7055 tokens

Rule of thumb: For most comprehension tasks (describing content, answering questions), 1024px on the longest edge is sufficient. For OCR, 1500–2048px. For chart data extraction, 800–1200px.

13.5 PDF Document Processing

Enabling and using PDF support

import anthropic, base64

client = anthropic.Anthropic()

def read_pdf(path: str) -> str:
    with open(path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

pdf_data = read_pdf("annual_report_2024.pdf")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=4096,
    betas=["pdfs-2024-09-25"],   # Required to enable PDF support
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "base64",
                    "media_type": "application/pdf",
                    "data": pdf_data
                }
            },
            {
                "type": "text",
                "text": "Extract the key financial metrics from this annual report: revenue, profit, and YoY growth rates."
            }
        ]
    }]
)
print(response.content[0].text)

Structured PDF extraction

def extract_pdf_sections(pdf_path: str, sections: list[str]) -> dict:
    pdf_data = read_pdf(pdf_path)
    section_list = "\n".join(f"- {s}" for s in sections)

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=8192,
        betas=["pdfs-2024-09-25"],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "document",
                        "source": {
                            "type": "base64",
                            "media_type": "application/pdf",
                            "data": pdf_data
                        }
                    },
                    {
                        "type": "text",
                        "text": f"""Extract these sections from the document:
{section_list}

For each section provide: title, summary, and key data points.
Output as JSON."""
                    }
                ]
            },
            {"role": "assistant", "content": "{"}
        ]
    )

    import json
    return json.loads("{" + response.content[0].text)

def batch_pdf_analysis(pdf_dir: str, prompt: str) -> dict[str, str]:
    """Process all PDFs in a directory."""
    from pathlib import Path

    results = {}
    for pdf_path in Path(pdf_dir).glob("*.pdf"):
        size_mb = pdf_path.stat().st_size / (1024 * 1024)
        if size_mb > 32:
            results[pdf_path.name] = f"SKIPPED: too large ({size_mb:.1f} MB)"
            continue

        try:
            pdf_data = read_pdf(str(pdf_path))
            r = client.messages.create(
                model="claude-opus-4-6",
                max_tokens=2048,
                betas=["pdfs-2024-09-25"],
                messages=[{
                    "role": "user",
                    "content": [
                        {"type": "document", "source": {"type": "base64",
                            "media_type": "application/pdf", "data": pdf_data}},
                        {"type": "text", "text": prompt}
                    ]
                }]
            )
            results[pdf_path.name] = r.content[0].text
        except Exception as e:
            results[pdf_path.name] = f"ERROR: {e}"

    return results

13.6 Table and Chart Extraction

Extracting tables from images

import json

def extract_table(image_path: str) -> list[dict]:
    """Extract tabular data from an image into a list of row dicts."""
    data, mime = encode_image(image_path)

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": mime, "data": data}},
                {"type": "text", "text": (
                    "Extract the table data as a JSON array. "
                    "Each row is an object with column names as keys. "
                    "Use numeric types for numeric values."
                )}
            ]
        }, {"role": "assistant", "content": "["}]
    )

    raw = "[" + response.content[0].text.rstrip()
    if not raw.endswith("]"):
        raw = raw.rstrip(",") + "]"
    return json.loads(raw)

def analyze_chart(image_path: str) -> dict:
    """Analyze a chart image and recover the underlying data."""
    data, mime = encode_image(image_path)

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": mime, "data": data}},
                {"type": "text", "text": (
                    "Analyze this chart. Return JSON with: "
                    "chart_type, x_axis, y_axis, data_points (list of {label, value}), "
                    "trends, and key_insight."
                )}
            ]
        }, {"role": "assistant", "content": "{"}]
    )

    return json.loads("{" + response.content[0].text)

13.7 OCR and Form Extraction

def ocr_image(image_path: str) -> str:
    """Transcribe all text from an image, preserving layout."""
    data, mime = optimize_image(image_path, max_dimension=2048)

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": mime, "data": data}},
                {"type": "text", "text": (
                    "Transcribe all text from this image exactly as it appears, "
                    "preserving line breaks and paragraph structure. "
                    "Output only the transcribed text, no commentary."
                )}
            ]
        }]
    )
    return response.content[0].text

def extract_form_fields(image_path: str) -> dict:
    """Extract form field names and their filled values."""
    data, mime = encode_image(image_path)

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": mime, "data": data}},
                {"type": "text", "text": "Extract all form fields and their values as JSON. Key = field label, value = filled content."}
            ]
        }, {"role": "assistant", "content": "{"}]
    )

    return json.loads("{" + response.content[0].text)

13.8 Production Application Examples

Contract review system

def review_contract(pdf_path: str) -> dict:
    """Intelligent contract review: extract clauses, identify risks."""
    pdf_data = read_pdf(pdf_path)

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=8192,
        betas=["pdfs-2024-09-25"],
        system="You are an experienced contract attorney specializing in identifying risk clauses.",
        messages=[{
            "role": "user",
            "content": [
                {"type": "document", "source": {
                    "type": "base64", "media_type": "application/pdf", "data": pdf_data}},
                {"type": "text", "text": """Review this contract and return JSON with:
- parties: contract parties
- subject: what the contract covers
- term: duration
- key_clauses: list of important clauses with summaries
- risks: list of {clause, risk_level (high/medium/low), description}
- unfair_terms: any one-sided provisions
- recommendations: suggested modifications
- overall_assessment: brief evaluation"""}
            ]
        }, {"role": "assistant", "content": "{"}]
    )

    return json.loads("{" + response.content[0].text)

E-commerce image quality check

def check_product_image(image_path: str) -> dict:
    """Score product image quality for e-commerce listing approval."""
    data, mime = encode_image(image_path)

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": mime, "data": data}},
                {"type": "text", "text": """Score this product image (each criterion out of 10):
- sharpness
- background (white/clean = 10)
- product completeness
- lighting
- composition

Return JSON: scores (dict), total_score, issues (list), recommendations (list), approved (total >= 80)"""}
            ]
        }, {"role": "assistant", "content": "{"}]
    )

    return json.loads("{" + response.content[0].text)

13.9 Best Practices and Gotchas

Image handling

Resolution vs cost tradeoff:
- General understanding: 1024px max
- OCR/text extraction: 1500–2048px
- Chart data extraction: 800–1200px
Format recommendations:
- Photos: JPEG quality=85
- Screenshots, diagrams: PNG
- Scanned documents: JPEG quality=90
Model selection: Use claude-opus-4-6 for complex document understanding. For simple classification (is this a photo of food / not food), claude-haiku-4-5-20251001 is far cheaper and usually sufficient.

PDF handling

Page limits: Max 100 pages. For longer documents, use pdfplumber or PyPDF2 to extract relevant pages before sending.
Scanned PDFs: Claude understands native digital PDFs better than scanned ones. If the PDF lacks a text layer, pre-processing with an OCR tool improves accuracy.
File size: 32 MB limit. Reduce file size with Ghostscript (gs -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook ...) for oversized scanned PDFs.

def is_scanned_pdf(path: str) -> bool:
    """Return True if the PDF likely lacks a text layer."""
    try:
        import pdfplumber
        with pdfplumber.open(path) as pdf:
            text = "".join(
                (p.extract_text() or "") for p in pdf.pages[:3]
            )
        return len(text.strip()) < 100
    except Exception:
        return True

Summary

Multimodal input makes Claude significantly more capable for real-world document workflows. Key points:

Images can be delivered via base64 or public URL; max 20 per call
PDF support requires betas=["pdfs-2024-09-25"]; up to 32 MB / 100 pages
Image tokens scale with resolution — pre-scale before sending to control costs
Combine prefill ({ or [) with visual analysis to get clean JSON output
Use claude-opus-4-6 for complex document comprehension; claude-haiku-4-5-20251001 for high-volume simple visual classification

Rate this chapter

4.7 / 5 (34 ratings)