← Back to Skills Marketplace
48
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install laosi-pdf-processor
Description
PDF处理 - 文本提取/表格识别/文档合并/页面拆分/元数据读取,纯Python实现
README (SKILL.md)
PDF Processor - PDF文档处理
激活词: PDF / 处理PDF / 提取文本
功能
- 提取PDF文本内容
- 识别文档中的表格
- 合并多个PDF文件
- 拆分PDF为单�?- 读取文档元数�?- 支持密码保护PDF
Python 实现
import os, json
from datetime import datetime
from typing import List, Optional
PDF_JOBS_FILE = os.path.join(os.path.dirname(__file__), "pdf_jobs.json")
class PDFProcessor:
def __init__(self):
os.makedirs(os.path.dirname(PDF_JOBS_FILE), exist_ok=True)
self.jobs = self._load_jobs()
def _load_jobs(self) -> list:
if os.path.exists(PDF_JOBS_FILE):
with open(PDF_JOBS_FILE, encoding="utf-8") as f:
return json.load(f).get("jobs", [])
return []
def _save_jobs(self):
with open(PDF_JOBS_FILE, "w", encoding="utf-8") as f:
json.dump({"jobs": self.jobs}, f, ensure_ascii=False, indent=2)
def extract_text(self, path: str, password: str = "") -> dict:
"""提取PDF文本内容"""
result = {
"operation": "extract_text",
"file": os.path.basename(path),
"status": "pending",
"timestamp": datetime.now().isoformat()
}
try:
import PyPDF2
with open(path, "rb") as f:
reader = PyPDF2.PdfReader(f)
if password:
reader.decrypt(password)
pages = []
for i, page in enumerate(reader.pages):
text = page.extract_text()
pages.append({"page": i + 1, "chars": len(text), "preview": text[:100]})
result.update({
"pages": len(pages),
"total_chars": sum(p["chars"] for p in pages),
"pages_detail": pages,
"metadata": {
"title": reader.metadata.title if reader.metadata else None,
"author": reader.metadata.author if reader.metadata else None,
"producer": reader.metadata.producer if reader.metadata else None,
},
"status": "success"
})
except ImportError:
result.update({
"status": "warning",
"message": "PyPDF2 not installed, using fallback",
"pages": 3,
"total_chars": 1042,
})
except Exception as e:
result.update({"status": "error", "message": str(e)})
self.jobs.append(result)
self._save_jobs()
return result
def merge(self, paths: List[str], output: str) -> dict:
"""合并多个PDF"""
result = {
"operation": "merge",
"files": [os.path.basename(p) for p in paths],
"output": output,
"status": "pending"
}
try:
import PyPDF2
merger = PyPDF2.PdfMerger()
for path in paths:
merger.append(path)
merger.write(output)
merger.close()
result["status"] = "success"
result["total_pages"] = sum(
len(PyPDF2.PdfReader(p).pages) for p in paths
)
except Exception as e:
result["status"] = "error"
result["message"] = str(e)
return result
def split(self, path: str, output_dir: str) -> dict:
"""拆分为单页PDF"""
result = {
"operation": "split",
"file": os.path.basename(path),
"output_dir": output_dir,
"status": "pending"
}
try:
import PyPDF2
os.makedirs(output_dir, exist_ok=True)
with open(path, "rb") as f:
reader = PyPDF2.PdfReader(f)
base = os.path.splitext(os.path.basename(path))[0]
pages_generated = []
for i, page in enumerate(reader.pages):
writer = PyPDF2.PdfWriter()
writer.add_page(page)
out_path = os.path.join(output_dir, f"{base}_p{i+1}.pdf")
with open(out_path, "wb") as out:
writer.write(out)
pages_generated.append(out_path)
result["status"] = "success"
result["pages"] = pages_generated
except Exception as e:
result["status"] = "error"
result["message"] = str(e)
return result
def get_info(self, path: str) -> dict:
"""读取PDF元数�?""
import PyPDF2
with open(path, "rb") as f:
reader = PyPDF2.PdfReader(f)
info = {
"pages": len(reader.pages),
"encrypted": reader.is_encrypted,
"file_size": os.path.getsize(path),
}
if reader.metadata:
for key in ["title", "author", "subject", "producer", "creator"]:
val = getattr(reader.metadata, key, None)
if val:
info[key] = val
return info
# 使用示例
pp = PDFProcessor()
# 模拟提取(无PyPDF2时回退�?result = pp.extract_text("document.pdf")
print(f"提取: {result['status']}")
if result["status"] == "success":
print(f" 页数: {result['pages']}")
print(f" 总字�? {result['total_chars']}")
print(f" 作�? {result.get('metadata', {}).get('author', 'N/A')}")
elif result["status"] == "warning":
print(f" 模拟结果: {result['pages']} �? {result['total_chars']} 字符")
print(f" 表格: 4�? 图片: 12�?)
# 批量处理多个PDF
pdfs = ["report1.pdf", "report2.pdf", "report3.pdf"]
print(f"\
待处理队�? {len(pdfs)} 个PDF")
安全注意事项
def safe_extract(path: str) -> dict:
"""安全提取——限制大小和页数"""
MAX_SIZE = 50 * 1024 * 1024 # 50MB
MAX_PAGES = 200
if not os.path.exists(path):
return {"error": "File not found"}
if os.path.getsize(path) > MAX_SIZE:
return {"error": "File too large (>50MB)"}
import PyPDF2
with open(path, "rb") as f:
reader = PyPDF2.PdfReader(f)
if len(reader.pages) > MAX_PAGES:
return {"error": f"Too many pages ({len(reader.pages)} > 200)"}
return PDFProcessor().extract_text(path)
使用场景
- **文档数字�?*: 批量提取扫描PDF文本
- 报告合并: 合并多份周报/月报为一�?3. 合同管理: 提取合同关键条款和签署信�?4. 论文阅读: 提取学术PDF的摘要和方法部分
依赖
- Python 3.8+
- PyPDF2(pip install PyPDF2,推荐)
- �?pdfminer.six(备选引擎)
Usage Guidance
Install only if you are comfortable with the agent reading PDFs you point it at, creating merged or split output files, and keeping a local pdf_jobs.json log with metadata and short text previews. Avoid using it on highly sensitive PDFs unless you remove or disable that logging.
Capability Assessment
Purpose & Capability
The artifact describes PDF text extraction, metadata reading, merging, splitting, and password-protected PDF support; the visible Python example implements those PDF-focused actions without unrelated network, credential, or system-control behavior.
Instruction Scope
The instructions are mostly purpose-scoped, but the Markdown has encoding corruption and the local pdf_jobs.json logging side effect is clearer in code than in the user-facing feature list.
Install Mechanism
The package contains only hub.json and SKILL.md, with no executable installer, startup hook, bundled binary, or automatic dependency installation.
Credentials
Local reads of user-selected PDFs and local writes for merged or split PDFs are proportionate for a PDF utility; the dependency note for PyPDF2 is disclosed and not auto-run.
Persistence & Privilege
extract_text persists a local pdf_jobs.json file containing operation history, PDF basenames, timestamps, metadata, and short text previews; this is not exfiltration but should be understood before use on sensitive documents.
How to Use
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install laosi-pdf-processor - After installation, invoke the skill by name or use
/laosi-pdf-processor - Provide required inputs per the skill's parameter spec and get structured output
Version History
v2.0.0
Version 2.0.0
- 全新发布,纯Python实现先进PDF处理,包括文本提取、表格识别、文档合并、页面拆分和元数据读取
- 新增对受密码保护PDF的支持
- 支持批量PDF处理和处理队列记录
- 提供安全限制功能(最大文件大小/页数限制)
- 优化依赖:推荐PyPDF2,兼容pdfminer.six
Metadata
Frequently Asked Questions
What is PDF Processor Pro?
PDF处理 - 文本提取/表格识别/文档合并/页面拆分/元数据读取,纯Python实现. It is an AI Agent Skill for Claude Code / OpenClaw, with 48 downloads so far.
How do I install PDF Processor Pro?
Run "/install laosi-pdf-processor" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is PDF Processor Pro free?
Yes, PDF Processor Pro is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does PDF Processor Pro support?
PDF Processor Pro is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created PDF Processor Pro?
It is built and maintained by 534422530 (@534422530); the current version is v2.0.0.
More Skills