Token Counting API + Batch API: Free Count Estimation and 50% Cost Savings with Batch Processing
Chapter 13: Multimodal Input: Best Practices for Images, PDFs, and Document Understanding
13.1 Claude's Multimodal Capabilities
Starting with the Claude 3 series, Claude can process visual information alongside text. PDF support was added later as a beta feature. Understanding the technical specifications and limits of each media type is essential for building reliable multimodal applications.
Supported media types
| Type | MIME type | Delivery | Size limit |
|---|---|---|---|
| JPEG | image/jpeg | base64 / URL | 5 MB per image |
| PNG | image/png | base64 / URL | 5 MB per image |
| GIF | image/gif | base64 / URL | 5 MB (static only) |
| WebP | image/webp | base64 / URL | 5 MB per image |
| application/pdf | base64 | 32 MB, up to 100 pages |
Up to 20 images per API call are supported. PDF support requires the betas parameter.
Image token cost
Images consume tokens proportional to their pixel dimensions โ a critical cost factor:
- Small image (< 200ร200px): ~85 tokens
- Standard image (1000ร1000px): ~1,334 tokens
- Large image (4000ร4000px): can exceed 5,000 tokens
Claude uses a "vision tiles" system: images are split into 224ร224px tiles, each costing ~170 tokens, plus a fixed base of 85 tokens. Pre-scaling images is one of the most effective cost optimizations for visual workloads.
13.2 Sending Images: base64 vs URL
Method 1: base64 encoding
import anthropic, base64
from pathlib import Path
client = anthropic.Anthropic()
def encode_image(path: str) -> tuple[str, str]:
"""Return (base64_data, mime_type) for an image file."""
p = Path(path)
mime_map = {
".jpg": "image/jpeg", ".jpeg": "image/jpeg",
".png": "image/png", ".gif": "image/gif", ".webp": "image/webp"
}
mime = mime_map.get(p.suffix.lower(), "image/jpeg")
with open(path, "rb") as f:
data = base64.standard_b64encode(f.read()).decode("utf-8")
return data, mime
data, mime = encode_image("screenshot.png")
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": mime, "data": data}
},
{
"type": "text",
"text": "Describe the UI components visible in this screenshot."
}
]
}]
)
print(response.content[0].text)
Method 2: URL reference
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/chart.png"
}
},
{
"type": "text",
"text": "Analyze this chart and extract the key data points and trends."
}
]
}]
)
Choosing between URL and base64:
- URL: Simpler for images already publicly accessible; Anthropic's servers fetch the image
- base64: Works for private or local images; adds ~33% payload size; no network dependency at call time
13.3 Multi-Image Analysis
Comparing multiple images
def compare_designs(image_paths: list[str]) -> str:
"""Compare multiple design mockups side by side."""
content = []
for i, path in enumerate(image_paths):
data, mime = encode_image(path)
content.append({
"type": "image",
"source": {"type": "base64", "media_type": mime, "data": data}
})
content.append({"type": "text", "text": f"Above is Design Option {i+1}."})
content.append({
"type": "text",
"text": ("Compare these design options across three dimensions: "
"UX clarity, visual hierarchy, and interaction design. "
"Give a recommendation with reasoning.")
})
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": content}]
)
return response.content[0].text
def analyze_sequence(frame_paths: list[str], question: str) -> str:
"""Analyze an image sequence (e.g., video keyframes). Max 20 images."""
if len(frame_paths) > 20:
raise ValueError("Maximum 20 images per call")
content = []
for path in frame_paths:
data, mime = encode_image(path)
content.append({
"type": "image",
"source": {"type": "base64", "media_type": mime, "data": data}
})
content.append({"type": "text", "text": question})
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": content}]
)
return response.content[0].text
13.4 Image Optimization to Reduce Token Cost
from PIL import Image
import io
def optimize_image(
path: str,
max_dimension: int = 1024,
quality: int = 85
) -> tuple[str, str]:
"""Resize and compress an image before sending to reduce token cost."""
with Image.open(path) as img:
if img.mode not in ("RGB", "L"):
img = img.convert("RGB")
w, h = img.size
if max(w, h) > max_dimension:
ratio = max_dimension / max(w, h)
img = img.resize((int(w * ratio), int(h * ratio)), Image.LANCZOS)
buf = io.BytesIO()
img.save(buf, format="JPEG", quality=quality, optimize=True)
buf.seek(0)
data = base64.standard_b64encode(buf.read()).decode("utf-8")
return data, "image/jpeg"
def estimate_tokens(width: int, height: int) -> int:
"""Estimate token consumption for an image of given dimensions."""
tiles = ((width + 223) // 224) * ((height + 223) // 224)
return 85 + 170 * tiles
# Examples
print(estimate_tokens(512, 512)) # ~595 tokens
print(estimate_tokens(1024, 1024)) # ~1955 tokens
print(estimate_tokens(2048, 2048)) # ~7055 tokens
Rule of thumb: For most comprehension tasks (describing content, answering questions), 1024px on the longest edge is sufficient. For OCR, 1500โ2048px. For chart data extraction, 800โ1200px.
13.5 PDF Document Processing
Enabling and using PDF support
import anthropic, base64
client = anthropic.Anthropic()
def read_pdf(path: str) -> str:
with open(path, "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")
pdf_data = read_pdf("annual_report_2024.pdf")
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
betas=["pdfs-2024-09-25"], # Required to enable PDF support
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_data
}
},
{
"type": "text",
"text": "Extract the key financial metrics from this annual report: revenue, profit, and YoY growth rates."
}
]
}]
)
print(response.content[0].text)
Structured PDF extraction
def extract_pdf_sections(pdf_path: str, sections: list[str]) -> dict:
pdf_data = read_pdf(pdf_path)
section_list = "\n".join(f"- {s}" for s in sections)
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=8192,
betas=["pdfs-2024-09-25"],
messages=[
{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_data
}
},
{
"type": "text",
"text": f"""Extract these sections from the document:
{section_list}
For each section provide: title, summary, and key data points.
Output as JSON."""
}
]
},
{"role": "assistant", "content": "{"}
]
)
import json
return json.loads("{" + response.content[0].text)
def batch_pdf_analysis(pdf_dir: str, prompt: str) -> dict[str, str]:
"""Process all PDFs in a directory."""
from pathlib import Path
results = {}
for pdf_path in Path(pdf_dir).glob("*.pdf"):
size_mb = pdf_path.stat().st_size / (1024 * 1024)
if size_mb > 32:
results[pdf_path.name] = f"SKIPPED: too large ({size_mb:.1f} MB)"
continue
try:
pdf_data = read_pdf(str(pdf_path))
r = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
betas=["pdfs-2024-09-25"],
messages=[{
"role": "user",
"content": [
{"type": "document", "source": {"type": "base64",
"media_type": "application/pdf", "data": pdf_data}},
{"type": "text", "text": prompt}
]
}]
)
results[pdf_path.name] = r.content[0].text
except Exception as e:
results[pdf_path.name] = f"ERROR: {e}"
return results
13.6 Table and Chart Extraction
Extracting tables from images
import json
def extract_table(image_path: str) -> list[dict]:
"""Extract tabular data from an image into a list of row dicts."""
data, mime = encode_image(image_path)
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": mime, "data": data}},
{"type": "text", "text": (
"Extract the table data as a JSON array. "
"Each row is an object with column names as keys. "
"Use numeric types for numeric values."
)}
]
}, {"role": "assistant", "content": "["}]
)
raw = "[" + response.content[0].text.rstrip()
if not raw.endswith("]"):
raw = raw.rstrip(",") + "]"
return json.loads(raw)
def analyze_chart(image_path: str) -> dict:
"""Analyze a chart image and recover the underlying data."""
data, mime = encode_image(image_path)
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": mime, "data": data}},
{"type": "text", "text": (
"Analyze this chart. Return JSON with: "
"chart_type, x_axis, y_axis, data_points (list of {label, value}), "
"trends, and key_insight."
)}
]
}, {"role": "assistant", "content": "{"}]
)
return json.loads("{" + response.content[0].text)
13.7 OCR and Form Extraction
def ocr_image(image_path: str) -> str:
"""Transcribe all text from an image, preserving layout."""
data, mime = optimize_image(image_path, max_dimension=2048)
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": mime, "data": data}},
{"type": "text", "text": (
"Transcribe all text from this image exactly as it appears, "
"preserving line breaks and paragraph structure. "
"Output only the transcribed text, no commentary."
)}
]
}]
)
return response.content[0].text
def extract_form_fields(image_path: str) -> dict:
"""Extract form field names and their filled values."""
data, mime = encode_image(image_path)
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": mime, "data": data}},
{"type": "text", "text": "Extract all form fields and their values as JSON. Key = field label, value = filled content."}
]
}, {"role": "assistant", "content": "{"}]
)
return json.loads("{" + response.content[0].text)
13.8 Production Application Examples
Contract review system
def review_contract(pdf_path: str) -> dict:
"""Intelligent contract review: extract clauses, identify risks."""
pdf_data = read_pdf(pdf_path)
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=8192,
betas=["pdfs-2024-09-25"],
system="You are an experienced contract attorney specializing in identifying risk clauses.",
messages=[{
"role": "user",
"content": [
{"type": "document", "source": {
"type": "base64", "media_type": "application/pdf", "data": pdf_data}},
{"type": "text", "text": """Review this contract and return JSON with:
- parties: contract parties
- subject: what the contract covers
- term: duration
- key_clauses: list of important clauses with summaries
- risks: list of {clause, risk_level (high/medium/low), description}
- unfair_terms: any one-sided provisions
- recommendations: suggested modifications
- overall_assessment: brief evaluation"""}
]
}, {"role": "assistant", "content": "{"}]
)
return json.loads("{" + response.content[0].text)
E-commerce image quality check
def check_product_image(image_path: str) -> dict:
"""Score product image quality for e-commerce listing approval."""
data, mime = encode_image(image_path)
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": mime, "data": data}},
{"type": "text", "text": """Score this product image (each criterion out of 10):
- sharpness
- background (white/clean = 10)
- product completeness
- lighting
- composition
Return JSON: scores (dict), total_score, issues (list), recommendations (list), approved (total >= 80)"""}
]
}, {"role": "assistant", "content": "{"}]
)
return json.loads("{" + response.content[0].text)
13.9 Best Practices and Gotchas
Image handling
-
Resolution vs cost tradeoff:
- General understanding: 1024px max
- OCR/text extraction: 1500โ2048px
- Chart data extraction: 800โ1200px
-
Format recommendations:
- Photos: JPEG quality=85
- Screenshots, diagrams: PNG
- Scanned documents: JPEG quality=90
-
Model selection: Use
claude-opus-4-6for complex document understanding. For simple classification (is this a photo of food / not food),claude-haiku-4-5-20251001is far cheaper and usually sufficient.
PDF handling
-
Page limits: Max 100 pages. For longer documents, use
pdfplumberorPyPDF2to extract relevant pages before sending. -
Scanned PDFs: Claude understands native digital PDFs better than scanned ones. If the PDF lacks a text layer, pre-processing with an OCR tool improves accuracy.
-
File size: 32 MB limit. Reduce file size with Ghostscript (
gs -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook ...) for oversized scanned PDFs.
def is_scanned_pdf(path: str) -> bool:
"""Return True if the PDF likely lacks a text layer."""
try:
import pdfplumber
with pdfplumber.open(path) as pdf:
text = "".join(
(p.extract_text() or "") for p in pdf.pages[:3]
)
return len(text.strip()) < 100
except Exception:
return True
Summary
Multimodal input makes Claude significantly more capable for real-world document workflows. Key points:
- Images can be delivered via base64 or public URL; max 20 per call
- PDF support requires
betas=["pdfs-2024-09-25"]; up to 32 MB / 100 pages - Image tokens scale with resolution โ pre-scale before sending to control costs
- Combine prefill (
{or[) with visual analysis to get clean JSON output - Use
claude-opus-4-6for complex document comprehension;claude-haiku-4-5-20251001for high-volume simple visual classification