/install extract-pdf-text
When to Use
Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.
Quick Reference
| Topic | File |
|---|---|
| Code examples | examples.md |
| OCR setup | ocr.md |
| Troubleshooting | troubleshooting.md |
Core Rules
1. Install PyMuPDF First
pip install PyMuPDF
Import as fitz (historical name):
import fitz # PyMuPDF
2. Basic Text Extraction
import fitz
doc = fitz.open("document.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
3. Pick the Right Method
| PDF Type | Method |
|---|---|
| Text-based | page.get_text() — fast, accurate |
| Scanned | OCR with pytesseract — slower |
| Mixed | Check each page, use OCR when needed |
4. Check for Text Before OCR
def needs_ocr(page):
text = page.get_text().strip()
return len(text) \x3C 50 # Likely scanned if very little text
5. Handle Errors Gracefully
try:
doc = fitz.open(path)
except fitz.FileDataError:
print("Invalid or corrupted PDF")
except fitz.PasswordError:
doc = fitz.open(path, password="secret")
Extraction Traps
| Trap | What Happens | Fix |
|---|---|---|
| OCR on text PDF | Slow + worse accuracy | Check get_text() first |
| Forget to close doc | Memory leak | Use with or doc.close() |
| Assume page order | Wrong reading flow | Use sort=True in get_text() |
| Ignore encoding | Garbled characters | PyMuPDF handles UTF-8 |
Scope
This skill provides instructions for using PyMuPDF to extract PDF text.
This skill ONLY:
- Gives code examples for PyMuPDF
- Explains OCR setup when needed
- Troubleshoots common issues
This skill NEVER:
- Accesses files without user request
- Sends data externally
- Modifies original PDFs
Security & Privacy
All processing is local:
- PyMuPDF runs entirely on your machine
- No external API calls
- No data leaves your system
Output Formats
Plain Text
text = page.get_text()
Structured (dict)
blocks = page.get_text("dict")["blocks"]
for b in blocks:
if b["type"] == 0: # text block
for line in b["lines"]:
for span in line["spans"]:
print(span["text"], span["size"])
JSON
import json
data = page.get_text("json")
parsed = json.loads(data)
Full Example
import fitz
def extract_pdf(path):
"""Extract text from PDF, with OCR fallback for scanned pages."""
doc = fitz.open(path)
results = []
for i, page in enumerate(doc):
text = page.get_text()
method = "text"
# If very little text, might be scanned
if len(text.strip()) \x3C 50:
# OCR would go here (see ocr.md)
method = "needs_ocr"
results.append({
"page": i + 1,
"text": text,
"method": method
})
doc.close()
return {
"pages": len(results),
"content": results,
"word_count": sum(len(r["text"].split()) for r in results)
}
# Usage
result = extract_pdf("document.pdf")
print(f"Extracted {result['word_count']} words from {result['pages']} pages")
Feedback
- Useful?
clawhub star extract-pdf-text - Stay updated:
clawhub sync
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install extract-pdf-text - 安装完成后,直接呼叫该 Skill 的名称或使用
/extract-pdf-text触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Extract PDF Text 是什么?
Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 1433 次。
如何安装 Extract PDF Text?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install extract-pdf-text」即可一键安装,无需额外配置。
Extract PDF Text 是免费的吗?
是的,Extract PDF Text 完全免费(开源免费),可自由下载、安装和使用。
Extract PDF Text 支持哪些平台?
Extract PDF Text 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(linux, darwin, win32)。
谁开发了 Extract PDF Text?
由 Iván(@ivangdavila)开发并维护,当前版本 v1.0.2。