第 13 章

Token Counting API + Batch API：免费计数预估与批量处理省50%成本

第十三章：多模态输入：图像、PDF 与文档理解最佳实践

13.1 Claude 的多模态能力概览

Claude 支持多种媒体类型作为输入，不仅仅是文字。从 Claude 3 系列开始，模型具备了理解图像的能力，而 PDF 支持则是后来通过专用功能添加的。了解每种输入类型的技术规格和最佳实践，对于构建高质量的多模态应用至关重要。

支持的媒体类型

类型	MIME 类型	传输方式	大小限制
JPEG 图像	image/jpeg	base64 / URL	每图最大 5MB
PNG 图像	image/png	base64 / URL	每图最大 5MB
GIF 图像	image/gif	base64 / URL	每图最大 5MB（静态）
WebP 图像	image/webp	base64 / URL	每图最大 5MB
PDF 文档	application/pdf	base64	最大 32MB，最多 100 页

每次 API 调用最多可包含 20 张图像。PDF 支持通过 betas 参数开启。

图像的 Token 消耗

图像会按像素维度消耗 Token，这是多模态开发中必须理解的成本因素：

小图（< 200×200px）：约 85 tokens
标准图（如 1000×1000px）：约 1334 tokens
高分辨率大图（4000×4000px）：可超过 5000 tokens

Claude 使用 "vision tiles" 系统：将图像分割为 224×224px 的块，每块约 170 tokens，加上固定基础 85 tokens。对于成本敏感的应用，应在发送前缩小图像。

13.2 图像输入：base64 与 URL 两种方式

方式一：base64 编码

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def encode_image(image_path: str) -> tuple[str, str]:
    """将图像文件编码为 base64，返回 (base64字符串, MIME类型)"""
    path = Path(image_path)
    
    mime_map = {
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg",
        ".png": "image/png",
        ".gif": "image/gif",
        ".webp": "image/webp"
    }
    
    mime_type = mime_map.get(path.suffix.lower(), "image/jpeg")
    
    with open(image_path, "rb") as f:
        data = base64.standard_b64encode(f.read()).decode("utf-8")
    
    return data, mime_type

# 发送单张图像
image_data, mime_type = encode_image("screenshot.png")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": mime_type,
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "请描述这张截图中的内容，并识别其中的 UI 组件。"
                }
            ]
        }
    ]
)

print(response.content[0].text)

方式二：URL 引用

# 使用 URL 发送图像（图像必须是公开可访问的）
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "url",
                        "url": "https://example.com/chart.png"
                    }
                },
                {
                    "type": "text",
                    "text": "分析这张图表，提取关键数据和趋势。"
                }
            ]
        }
    ]
)

URL vs base64 的选择：

URL 方式：更简洁，适合已有公开 URL 的图像；Anthropic 服务器会主动下载图像，需要确保 URL 可访问且安全
base64 方式：不依赖网络可达性，适合私有或本地图像；传输数据量更大（约增加 33%）

13.3 多图像分析

比较多张图像

def compare_ui_designs(design_paths: list[str]) -> str:
    """比较多个 UI 设计方案"""
    
    content = []
    
    for i, path in enumerate(design_paths):
        data, mime_type = encode_image(path)
        content.append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": mime_type,
                "data": data
            }
        })
        content.append({
            "type": "text",
            "text": f"上图是设计方案 {i+1}"
        })
    
    content.append({
        "type": "text",
        "text": "请对比以上设计方案，从用户体验、视觉层级和交互设计三个维度分析每个方案的优缺点，并给出推荐。"
    })
    
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": content}]
    )
    
    return response.content[0].text

# 分析图像序列（如视频帧）
def analyze_image_sequence(frame_paths: list[str], question: str) -> str:
    """分析图像序列，如视频关键帧"""
    
    if len(frame_paths) > 20:
        raise ValueError("最多支持 20 张图像")
    
    content = []
    for i, path in enumerate(frame_paths):
        data, mime = encode_image(path)
        content.append({
            "type": "image",
            "source": {"type": "base64", "media_type": mime, "data": data}
        })
    
    content.append({"type": "text", "text": question})
    
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": content}]
    )
    
    return response.content[0].text

13.4 图像优化：降低 Token 消耗

对于成本敏感的应用，发送前对图像进行预处理可以显著降低 Token 消耗：

from PIL import Image
import io

def optimize_image_for_claude(
    image_path: str,
    max_dimension: int = 1024,
    quality: int = 85
) -> tuple[str, str]:
    """
    优化图像以降低 Token 消耗：
    - 限制最大尺寸
    - 压缩质量
    - 转换为 JPEG（通常比 PNG 小）
    """
    
    with Image.open(image_path) as img:
        # 转换为 RGB（处理 RGBA 或调色板模式）
        if img.mode not in ("RGB", "L"):
            img = img.convert("RGB")
        
        # 限制最大维度
        orig_w, orig_h = img.size
        if max(orig_w, orig_h) > max_dimension:
            ratio = max_dimension / max(orig_w, orig_h)
            new_w = int(orig_w * ratio)
            new_h = int(orig_h * ratio)
            img = img.resize((new_w, new_h), Image.LANCZOS)
        
        # 保存为 JPEG
        buffer = io.BytesIO()
        img.save(buffer, format="JPEG", quality=quality, optimize=True)
        buffer.seek(0)
        
        data = base64.standard_b64encode(buffer.read()).decode("utf-8")
    
    return data, "image/jpeg"

def estimate_image_tokens(width: int, height: int) -> int:
    """估算图像的 Token 消耗"""
    tiles_w = (width + 223) // 224
    tiles_h = (height + 223) // 224
    num_tiles = tiles_w * tiles_h
    return 85 + 170 * num_tiles

13.5 PDF 文档处理

启用 PDF 支持

PDF 处理需要通过 betas 参数启用：

import anthropic
import base64

client = anthropic.Anthropic()

def read_pdf_file(pdf_path: str) -> str:
    """读取 PDF 文件并返回 base64 编码"""
    with open(pdf_path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

# 分析 PDF 文档
pdf_data = read_pdf_file("annual_report_2024.pdf")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=4096,
    betas=["pdfs-2024-09-25"],  # 必须启用 PDF beta
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data
                    }
                },
                {
                    "type": "text",
                    "text": "请提取这份年度报告中的关键财务指标，包括收入、利润和增长率。"
                }
            ]
        }
    ]
)

print(response.content[0].text)

PDF 内容引用与精确定位

def extract_pdf_sections(pdf_path: str, sections: list[str]) -> dict[str, str]:
    """提取 PDF 中的特定章节内容"""
    
    pdf_data = read_pdf_file(pdf_path)
    sections_query = "\n".join(f"- {s}" for s in sections)
    
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=8192,
        betas=["pdfs-2024-09-25"],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "document",
                        "source": {
                            "type": "base64",
                            "media_type": "application/pdf",
                            "data": pdf_data
                        }
                    },
                    {
                        "type": "text",
                        "text": f"""请提取以下章节的内容：
{sections_query}

对每个章节，请提供：
1. 章节标题
2. 主要内容摘要
3. 关键数据或结论

输出 JSON 格式。"""
                    }
                ]
            },
            {"role": "assistant", "content": "{"}
        ]
    )
    
    import json
    return json.loads("{" + response.content[0].text)

批量 PDF 处理

import os
from pathlib import Path

def batch_process_pdfs(
    pdf_dir: str,
    task_prompt: str,
    max_pages_per_pdf: int = 50
) -> dict[str, str]:
    """批量处理目录中的 PDF 文件"""
    
    results = {}
    pdf_files = list(Path(pdf_dir).glob("*.pdf"))
    
    print(f"找到 {len(pdf_files)} 个 PDF 文件")
    
    for pdf_path in pdf_files:
        try:
            # 检查文件大小
            file_size_mb = pdf_path.stat().st_size / (1024 * 1024)
            if file_size_mb > 32:
                print(f"跳过 {pdf_path.name}：文件过大 ({file_size_mb:.1f}MB)")
                continue
            
            pdf_data = read_pdf_file(str(pdf_path))
            
            response = client.messages.create(
                model="claude-opus-4-6",
                max_tokens=2048,
                betas=["pdfs-2024-09-25"],
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "document",
                                "source": {
                                    "type": "base64",
                                    "media_type": "application/pdf",
                                    "data": pdf_data
                                }
                            },
                            {"type": "text", "text": task_prompt}
                        ]
                    }
                ]
            )
            
            results[pdf_path.name] = response.content[0].text
            print(f"完成: {pdf_path.name}")
            
        except Exception as e:
            results[pdf_path.name] = f"处理失败: {e}"
            print(f"错误 {pdf_path.name}: {e}")
    
    return results

13.6 图像与文本的混合分析

表格图像数据提取

def extract_table_from_image(image_path: str) -> list[dict]:
    """从图像中提取表格数据并转换为结构化格式"""
    
    data, mime = encode_image(image_path)
    
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {"type": "base64", "media_type": mime, "data": data}
                    },
                    {
                        "type": "text",
                        "text": """提取图像中的表格数据。
输出 JSON 数组，每行是一个对象，键为列名，值为单元格内容。
确保数值类型的字段使用数字而非字符串。"""
                    }
                ]
            },
            {"role": "assistant", "content": "["}
        ]
    )
    
    import json
    raw = "[" + response.content[0].text.rstrip()
    if not raw.endswith("]"):
        raw += "]"
    return json.loads(raw)

图表分析与数据还原

def analyze_chart(image_path: str) -> dict:
    """分析图表图像，提取数据和趋势"""
    
    data, mime = encode_image(image_path)
    
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {"type": "base64", "media_type": mime, "data": data}
                    },
                    {
                        "type": "text",
                        "text": """分析这张图表：
1. 图表类型（柱状图/折线图/饼图等）
2. X轴和Y轴的含义
3. 数据点（尽量精确还原数值）
4. 主要趋势和异常点
5. 关键结论

以 JSON 格式输出。"""
                    }
                ]
            },
            {"role": "assistant", "content": "{"}
        ]
    )
    
    import json
    return json.loads("{" + response.content[0].text)

13.7 OCR 与文字识别

def ocr_image(image_path: str, language: str = "zh+en") -> str:
    """对图像进行 OCR 文字识别"""
    
    data, mime = optimize_image_for_claude(image_path, max_dimension=2048)
    
    lang_hint = {
        "zh": "主要是中文",
        "en": "主要是英文",
        "zh+en": "中英文混合",
        "ja": "日文",
    }.get(language, language)
    
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {"type": "base64", "media_type": mime, "data": data}
                    },
                    {
                        "type": "text",
                        "text": f"这张图像中的文字{lang_hint}。请完整转录图像中的所有文字，保持原始排版格式（行、段落）。只输出文字内容，不要解释或描述。"
                    }
                ]
            }
        ]
    )
    
    return response.content[0].text

def extract_form_data(image_path: str) -> dict:
    """从表单图像中提取字段和值"""
    
    data, mime = encode_image(image_path)
    
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {"type": "base64", "media_type": mime, "data": data}
                    },
                    {
                        "type": "text",
                        "text": "识别表单中的所有字段和对应的填写内容，输出 JSON 格式，键为字段名，值为填写内容。"
                    }
                ]
            },
            {"role": "assistant", "content": "{"}
        ]
    )
    
    import json
    return json.loads("{" + response.content[0].text)

13.8 实际应用案例

案例 1：文档智能审核系统

def review_contract_document(pdf_path: str) -> dict:
    """智能审核合同文档，识别关键条款和风险点"""
    
    pdf_data = read_pdf_file(pdf_path)
    
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=8192,
        betas=["pdfs-2024-09-25"],
        system="你是一个专业的合同审查律师，擅长识别合同中的风险条款和不平等条款。",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "document",
                        "source": {
                            "type": "base64",
                            "media_type": "application/pdf",
                            "data": pdf_data
                        }
                    },
                    {
                        "type": "text",
                        "text": """请审查这份合同，提供：
1. 合同基本信息（双方、标的、金额、期限）
2. 关键条款摘要
3. 风险点（标注高/中/低风险级别）
4. 不平等条款（如有）
5. 建议修改的条款
6. 总体评估

输出结构化 JSON。"""
                    }
                ]
            },
            {"role": "assistant", "content": "{"}
        ]
    )
    
    import json
    return json.loads("{" + response.content[0].text)

案例 2：电商产品图像质量检测

def check_product_image_quality(image_path: str) -> dict:
    """检查电商产品图像质量，提供上架建议"""
    
    data, mime = encode_image(image_path)
    
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {"type": "base64", "media_type": mime, "data": data}
                    },
                    {
                        "type": "text",
                        "text": """作为电商图像质量审核员，评估这张产品图像：

评分标准（各10分）：
- 图像清晰度
- 背景处理（白底/纯色加分）
- 产品展示完整性
- 光线和阴影
- 角度和构图

输出 JSON 包含：scores（各项评分），total_score，issues（问题列表），recommendations（改进建议），approved（是否通过，80分以上）"""
                    }
                ]
            },
            {"role": "assistant", "content": "{"}
        ]
    )
    
    import json
    return json.loads("{" + response.content[0].text)

13.9 最佳实践总结

图像处理建议

分辨率权衡：大多数理解任务不需要超过 1024px 的图像。OCR 任务可适当提高到 1500-2048px。图表分析使用 800-1200px 通常足够。
格式选择：
- 照片类：JPEG，quality=85
- 截图/图表：PNG（保持锐利边缘）
- 文档扫描：JPEG，quality=90
批量处理成本控制：使用 claude-haiku-4-5-20251001 处理简单的图像分类任务；使用 claude-opus-4-6 处理需要深度理解的任务。

PDF 处理建议

页数控制：PDF 最多支持 100 页。超长文档应预先用 PyPDF2 或 pdfplumber 提取相关页。
文件大小：32MB 限制。扫描版 PDF 通常较大，考虑使用 Ghostscript 压缩。
版本兼容：Claude 对数字原生 PDF 的理解优于扫描 PDF；扫描 PDF 建议先做 OCR 预处理。

# 检查 PDF 是否是扫描版
def is_scanned_pdf(pdf_path: str) -> bool:
    """判断 PDF 是否为扫描版（缺少文本层）"""
    try:
        import pdfplumber
        with pdfplumber.open(pdf_path) as pdf:
            text = ""
            for page in pdf.pages[:3]:  # 检查前3页
                text += page.extract_text() or ""
        # 如果提取的文字很少，可能是扫描版
        return len(text.strip()) < 100
    except Exception:
        return True

小结

Claude 的多模态能力让图像和文档理解成为 API 的一等公民：

图像支持 base64 和 URL 两种传入方式，每次最多 20 张
PDF 通过 betas=["pdfs-2024-09-25"] 开启，最大 32MB/100 页
图像 Token 按分辨率计算，发送前优化图像尺寸可显著降低成本
Prefill 结合文档分析可以强制输出结构化 JSON
claude-opus-4-6 在复杂文档理解任务上表现最佳

本章评分

4.7 / 5 (34 评分)