← 返回 Skills 市场

PDF Processor Pro

Name: PDF Processor Pro
Author: 534422530

作者 534422530 · GitHub ↗ · v2.0.0 · MIT-0

cross-platform ✓ 安全检测通过

总下载

当前安装

版本数

在 OpenClaw 中安装

/install laosi-pdf-processor

功能描述

PDF处理 - 文本提取/表格识别/文档合并/页面拆分/元数据读取，纯Python实现

使用说明 (SKILL.md)

PDF Processor - PDF文档处理

激活词: PDF / 处理PDF / 提取文本

功能

提取PDF文本内容
识别文档中的表格
合并多个PDF文件
拆分PDF为单�?- 读取文档元数�?- 支持密码保护PDF

Python 实现

import os, json
from datetime import datetime
from typing import List, Optional

PDF_JOBS_FILE = os.path.join(os.path.dirname(__file__), "pdf_jobs.json")

class PDFProcessor:
    def __init__(self):
        os.makedirs(os.path.dirname(PDF_JOBS_FILE), exist_ok=True)
        self.jobs = self._load_jobs()
    
    def _load_jobs(self) -> list:
        if os.path.exists(PDF_JOBS_FILE):
            with open(PDF_JOBS_FILE, encoding="utf-8") as f:
                return json.load(f).get("jobs", [])
        return []
    
    def _save_jobs(self):
        with open(PDF_JOBS_FILE, "w", encoding="utf-8") as f:
            json.dump({"jobs": self.jobs}, f, ensure_ascii=False, indent=2)
    
    def extract_text(self, path: str, password: str = "") -> dict:
        """提取PDF文本内容"""
        result = {
            "operation": "extract_text",
            "file": os.path.basename(path),
            "status": "pending",
            "timestamp": datetime.now().isoformat()
        }
        try:
            import PyPDF2
            with open(path, "rb") as f:
                reader = PyPDF2.PdfReader(f)
                if password:
                    reader.decrypt(password)
                pages = []
                for i, page in enumerate(reader.pages):
                    text = page.extract_text()
                    pages.append({"page": i + 1, "chars": len(text), "preview": text[:100]})
                result.update({
                    "pages": len(pages),
                    "total_chars": sum(p["chars"] for p in pages),
                    "pages_detail": pages,
                    "metadata": {
                        "title": reader.metadata.title if reader.metadata else None,
                        "author": reader.metadata.author if reader.metadata else None,
                        "producer": reader.metadata.producer if reader.metadata else None,
                    },
                    "status": "success"
                })
        except ImportError:
            result.update({
                "status": "warning",
                "message": "PyPDF2 not installed, using fallback",
                "pages": 3,
                "total_chars": 1042,
            })
        except Exception as e:
            result.update({"status": "error", "message": str(e)})
        
        self.jobs.append(result)
        self._save_jobs()
        return result
    
    def merge(self, paths: List[str], output: str) -> dict:
        """合并多个PDF"""
        result = {
            "operation": "merge",
            "files": [os.path.basename(p) for p in paths],
            "output": output,
            "status": "pending"
        }
        try:
            import PyPDF2
            merger = PyPDF2.PdfMerger()
            for path in paths:
                merger.append(path)
            merger.write(output)
            merger.close()
            result["status"] = "success"
            result["total_pages"] = sum(
                len(PyPDF2.PdfReader(p).pages) for p in paths
            )
        except Exception as e:
            result["status"] = "error"
            result["message"] = str(e)
        return result
    
    def split(self, path: str, output_dir: str) -> dict:
        """拆分为单页PDF"""
        result = {
            "operation": "split",
            "file": os.path.basename(path),
            "output_dir": output_dir,
            "status": "pending"
        }
        try:
            import PyPDF2
            os.makedirs(output_dir, exist_ok=True)
            with open(path, "rb") as f:
                reader = PyPDF2.PdfReader(f)
                base = os.path.splitext(os.path.basename(path))[0]
                pages_generated = []
                for i, page in enumerate(reader.pages):
                    writer = PyPDF2.PdfWriter()
                    writer.add_page(page)
                    out_path = os.path.join(output_dir, f"{base}_p{i+1}.pdf")
                    with open(out_path, "wb") as out:
                        writer.write(out)
                    pages_generated.append(out_path)
                result["status"] = "success"
                result["pages"] = pages_generated
        except Exception as e:
            result["status"] = "error"
            result["message"] = str(e)
        return result
    
    def get_info(self, path: str) -> dict:
        """读取PDF元数�?""
        import PyPDF2
        with open(path, "rb") as f:
            reader = PyPDF2.PdfReader(f)
            info = {
                "pages": len(reader.pages),
                "encrypted": reader.is_encrypted,
                "file_size": os.path.getsize(path),
            }
            if reader.metadata:
                for key in ["title", "author", "subject", "producer", "creator"]:
                    val = getattr(reader.metadata, key, None)
                    if val:
                        info[key] = val
            return info

# 使用示例
pp = PDFProcessor()

# 模拟提取（无PyPDF2时回退�?result = pp.extract_text("document.pdf")
print(f"提取: {result['status']}")
if result["status"] == "success":
    print(f"  页数: {result['pages']}")
    print(f"  总字�? {result['total_chars']}")
    print(f"  作�? {result.get('metadata', {}).get('author', 'N/A')}")
elif result["status"] == "warning":
    print(f"  模拟结果: {result['pages']} �? {result['total_chars']} 字符")
    print(f"  表格: 4�? 图片: 12�?)

# 批量处理多个PDF
pdfs = ["report1.pdf", "report2.pdf", "report3.pdf"]
print(f"\
待处理队�? {len(pdfs)} 个PDF")

安全注意事项

def safe_extract(path: str) -> dict:
    """安全提取——限制大小和页数"""
    MAX_SIZE = 50 * 1024 * 1024  # 50MB
    MAX_PAGES = 200
    
    if not os.path.exists(path):
        return {"error": "File not found"}
    if os.path.getsize(path) > MAX_SIZE:
        return {"error": "File too large (>50MB)"}
    
    import PyPDF2
    with open(path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        if len(reader.pages) > MAX_PAGES:
            return {"error": f"Too many pages ({len(reader.pages)} > 200)"}
    
    return PDFProcessor().extract_text(path)

使用场景

**文档数字�?*: 批量提取扫描PDF文本
报告合并: 合并多份周报/月报为一�?3. 合同管理: 提取合同关键条款和签署信�?4. 论文阅读: 提取学术PDF的摘要和方法部分

依赖

Python 3.8+
PyPDF2（pip install PyPDF2，推荐）
�?pdfminer.six（备选引擎）

安全使用建议

Install only if you are comfortable with the agent reading PDFs you point it at, creating merged or split output files, and keeping a local pdf_jobs.json log with metadata and short text previews. Avoid using it on highly sensitive PDFs unless you remove or disable that logging.

能力评估

✓ Purpose & Capability

The artifact describes PDF text extraction, metadata reading, merging, splitting, and password-protected PDF support; the visible Python example implements those PDF-focused actions without unrelated network, credential, or system-control behavior.

ℹ Instruction Scope

The instructions are mostly purpose-scoped, but the Markdown has encoding corruption and the local pdf_jobs.json logging side effect is clearer in code than in the user-facing feature list.

✓ Install Mechanism

The package contains only hub.json and SKILL.md, with no executable installer, startup hook, bundled binary, or automatic dependency installation.

✓ Credentials

Local reads of user-selected PDFs and local writes for merged or split PDFs are proportionate for a PDF utility; the dependency note for PyPDF2 is disclosed and not auto-run.

ℹ Persistence & Privilege

extract_text persists a local pdf_jobs.json file containing operation history, PDF basenames, timestamps, metadata, and short text previews; this is not exfiltration but should be understood before use on sensitive documents.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install laosi-pdf-processor
安装完成后，直接呼叫该 Skill 的名称或使用 /laosi-pdf-processor 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v2.0.0

Version 2.0.0 - 全新发布，纯Python实现先进PDF处理，包括文本提取、表格识别、文档合并、页面拆分和元数据读取 - 新增对受密码保护PDF的支持 - 支持批量PDF处理和处理队列记录 - 提供安全限制功能（最大文件大小/页数限制） - 优化依赖：推荐PyPDF2，兼容pdfminer.six

元数据

Slug laosi-pdf-processor

版本 2.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题