← 返回 Skills 市场

Extract PDF Text

Name: Extract PDF Text
Author: ivangdavila

作者 Iván · GitHub ↗ · v1.0.2

linuxdarwinwin32 ✓ 安全检测通过

1433

总下载

当前安装

版本数

在 OpenClaw 中安装

/install extract-pdf-text

功能描述

Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.

使用说明 (SKILL.md)

When to Use

Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.

Quick Reference

Topic	File
Code examples	`examples.md`
OCR setup	`ocr.md`
Troubleshooting	`troubleshooting.md`

Core Rules

1. Install PyMuPDF First

pip install PyMuPDF

Import as fitz (historical name):

import fitz  # PyMuPDF

2. Basic Text Extraction

import fitz

doc = fitz.open("document.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()

3. Pick the Right Method

PDF Type	Method
Text-based	`page.get_text()` — fast, accurate
Scanned	OCR with pytesseract — slower
Mixed	Check each page, use OCR when needed

4. Check for Text Before OCR

def needs_ocr(page):
    text = page.get_text().strip()
    return len(text) \x3C 50  # Likely scanned if very little text

5. Handle Errors Gracefully

try:
    doc = fitz.open(path)
except fitz.FileDataError:
    print("Invalid or corrupted PDF")
except fitz.PasswordError:
    doc = fitz.open(path, password="secret")

Extraction Traps

Trap	What Happens	Fix
OCR on text PDF	Slow + worse accuracy	Check `get_text()` first
Forget to close doc	Memory leak	Use `with` or `doc.close()`
Assume page order	Wrong reading flow	Use `sort=True` in get_text()
Ignore encoding	Garbled characters	PyMuPDF handles UTF-8

Scope

This skill provides instructions for using PyMuPDF to extract PDF text.

This skill ONLY:

Gives code examples for PyMuPDF
Explains OCR setup when needed
Troubleshoots common issues

This skill NEVER:

Accesses files without user request
Sends data externally
Modifies original PDFs

Security & Privacy

All processing is local:

PyMuPDF runs entirely on your machine
No external API calls
No data leaves your system

Output Formats

Plain Text

text = page.get_text()

Structured (dict)

blocks = page.get_text("dict")["blocks"]
for b in blocks:
    if b["type"] == 0:  # text block
        for line in b["lines"]:
            for span in line["spans"]:
                print(span["text"], span["size"])

JSON

import json
data = page.get_text("json")
parsed = json.loads(data)

Full Example

import fitz

def extract_pdf(path):
    """Extract text from PDF, with OCR fallback for scanned pages."""
    doc = fitz.open(path)
    results = []
    
    for i, page in enumerate(doc):
        text = page.get_text()
        method = "text"
        
        # If very little text, might be scanned
        if len(text.strip()) \x3C 50:
            # OCR would go here (see ocr.md)
            method = "needs_ocr"
        
        results.append({
            "page": i + 1,
            "text": text,
            "method": method
        })
    
    doc.close()
    return {
        "pages": len(results),
        "content": results,
        "word_count": sum(len(r["text"].split()) for r in results)
    }

# Usage
result = extract_pdf("document.pdf")
print(f"Extracted {result['word_count']} words from {result['pages']} pages")

Feedback

Useful? clawhub star extract-pdf-text
Stay updated: clawhub sync

安全使用建议

This skill is an offline how-to for using PyMuPDF and (optionally) Tesseract OCR — it appears internally consistent. Before using: (1) install Python packages in a virtualenv to avoid system-wide changes; (2) install Tesseract from your OS package manager if you need OCR; (3) review example code if you plan to copy/paste (the examples open files you provide and include illustrative hardcoded passwords — do not ship secrets in code); (4) treat PDFs you process as potentially hostile content (always run on trusted hosts or sandboxes if files come from untrusted sources). If you need confirmation of any hidden behavior, request a version that includes runnable code for review (this skill is instruction-only, so nothing executes automatically).

功能分析

Type: OpenClaw Skill Name: extract-pdf-text Version: 1.0.2 The OpenClaw AgentSkills bundle 'extract-pdf-text' is benign. All files (`_meta.json`, `SKILL.md`, `examples.md`, `ocr.md`, `troubleshooting.md`) consistently provide instructions and code examples for local PDF text extraction using PyMuPDF and Tesseract OCR. The `SKILL.md` explicitly states that the skill 'NEVER accesses files without user request, sends data externally, or modifies original PDFs', and the provided code adheres to this. There is no evidence of data exfiltration, malicious execution, persistence mechanisms, obfuscation, or prompt injection attempts against the agent. All operations are local and aligned with the stated purpose.

能力评估

✓ Purpose & Capability

Name/description (extract text, parse tables/forms, support OCR) match the instructions and examples, which use PyMuPDF and optionally pytesseract/Tesseract. Required binary (python3) and pip install guidance for PyMuPDF are appropriate.

✓ Instruction Scope

SKILL.md and included docs only show local operations (opening files, rendering pages, OCR). Examples reference opening user-supplied PDF paths and handling passwords for encrypted PDFs — all within the stated scope. No instructions to read unrelated system files, transmit data externally, or modify other skills.

✓ Install Mechanism

This is instruction-only (no install spec that downloads arbitrary artifacts). The metadata suggests installing PyMuPDF via pip and the docs recommend installing Tesseract from common package managers — standard, low-risk guidance. No obscure URLs or archive extraction are present.

✓ Credentials

The skill requests no environment variables, credentials, or config paths. Examples do show authenticating password-protected PDFs (example passwords are demonstrative only). There are no unrelated credential requests.

✓ Persistence & Privilege

Skill is not always-enabled, is user-invocable, and is instruction-only (no code persisted or installed by the skill). It doesn't request persistent system presence or modify other skills or system-wide settings.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install extract-pdf-text
安装完成后，直接呼叫该 Skill 的名称或使用 /extract-pdf-text 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.2

Remove internal build file that was accidentally included

v1.0.1

- Clarified focus on using PyMuPDF (fitz) for PDF text extraction. - Added new code examples and guidance in a dedicated `examples.md` file. - Removed separate documentation for API, CLI, and output formats to simplify usage. - Streamlined quick reference and troubleshooting sections. - Updated security and privacy notes for clarity.

v1.0.0

Extract PDF Text v1.0.0 - Initial release. - Extracts text from PDFs, supporting both text-based and scanned documents with OCR. - Handles tables, forms, and complex layouts; preserves document structure if needed. - Offers multiple output formats including plain text, JSON, and Markdown. - Provides CLI and Python API interfaces with batch and page range support. - All processing is local for security and privacy—no uploads or external calls required.

元数据

Slug extract-pdf-text

版本 1.0.2

许可证 —

累计安装 9

当前安装数 9

历史版本数 3

常见问题

Extract PDF Text 是什么？

Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 1433 次。

如何安装 Extract PDF Text？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install extract-pdf-text」即可一键安装，无需额外配置。

Extract PDF Text 是免费的吗？

是的，Extract PDF Text 完全免费（开源免费），可自由下载、安装和使用。

Extract PDF Text 支持哪些平台？

Extract PDF Text 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（linux, darwin, win32）。

谁开发了 Extract PDF Text？

由 Iván（@ivangdavila）开发并维护，当前版本 v1.0.2。