← Back to Skills Marketplace
534422530

PDF Processor Pro

by 534422530 · GitHub ↗ · v2.0.0 · MIT-0
cross-platform ✓ Security Clean
48
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install laosi-pdf-processor
Description
PDF处理 - 文本提取/表格识别/文档合并/页面拆分/元数据读取,纯Python实现
README (SKILL.md)

PDF Processor - PDF文档处理

激活词: PDF / 处理PDF / 提取文本

功能

  • 提取PDF文本内容
  • 识别文档中的表格
  • 合并多个PDF文件
  • 拆分PDF为单�?- 读取文档元数�?- 支持密码保护PDF

Python 实现

import os, json
from datetime import datetime
from typing import List, Optional

PDF_JOBS_FILE = os.path.join(os.path.dirname(__file__), "pdf_jobs.json")

class PDFProcessor:
    def __init__(self):
        os.makedirs(os.path.dirname(PDF_JOBS_FILE), exist_ok=True)
        self.jobs = self._load_jobs()
    
    def _load_jobs(self) -> list:
        if os.path.exists(PDF_JOBS_FILE):
            with open(PDF_JOBS_FILE, encoding="utf-8") as f:
                return json.load(f).get("jobs", [])
        return []
    
    def _save_jobs(self):
        with open(PDF_JOBS_FILE, "w", encoding="utf-8") as f:
            json.dump({"jobs": self.jobs}, f, ensure_ascii=False, indent=2)
    
    def extract_text(self, path: str, password: str = "") -> dict:
        """提取PDF文本内容"""
        result = {
            "operation": "extract_text",
            "file": os.path.basename(path),
            "status": "pending",
            "timestamp": datetime.now().isoformat()
        }
        try:
            import PyPDF2
            with open(path, "rb") as f:
                reader = PyPDF2.PdfReader(f)
                if password:
                    reader.decrypt(password)
                pages = []
                for i, page in enumerate(reader.pages):
                    text = page.extract_text()
                    pages.append({"page": i + 1, "chars": len(text), "preview": text[:100]})
                result.update({
                    "pages": len(pages),
                    "total_chars": sum(p["chars"] for p in pages),
                    "pages_detail": pages,
                    "metadata": {
                        "title": reader.metadata.title if reader.metadata else None,
                        "author": reader.metadata.author if reader.metadata else None,
                        "producer": reader.metadata.producer if reader.metadata else None,
                    },
                    "status": "success"
                })
        except ImportError:
            result.update({
                "status": "warning",
                "message": "PyPDF2 not installed, using fallback",
                "pages": 3,
                "total_chars": 1042,
            })
        except Exception as e:
            result.update({"status": "error", "message": str(e)})
        
        self.jobs.append(result)
        self._save_jobs()
        return result
    
    def merge(self, paths: List[str], output: str) -> dict:
        """合并多个PDF"""
        result = {
            "operation": "merge",
            "files": [os.path.basename(p) for p in paths],
            "output": output,
            "status": "pending"
        }
        try:
            import PyPDF2
            merger = PyPDF2.PdfMerger()
            for path in paths:
                merger.append(path)
            merger.write(output)
            merger.close()
            result["status"] = "success"
            result["total_pages"] = sum(
                len(PyPDF2.PdfReader(p).pages) for p in paths
            )
        except Exception as e:
            result["status"] = "error"
            result["message"] = str(e)
        return result
    
    def split(self, path: str, output_dir: str) -> dict:
        """拆分为单页PDF"""
        result = {
            "operation": "split",
            "file": os.path.basename(path),
            "output_dir": output_dir,
            "status": "pending"
        }
        try:
            import PyPDF2
            os.makedirs(output_dir, exist_ok=True)
            with open(path, "rb") as f:
                reader = PyPDF2.PdfReader(f)
                base = os.path.splitext(os.path.basename(path))[0]
                pages_generated = []
                for i, page in enumerate(reader.pages):
                    writer = PyPDF2.PdfWriter()
                    writer.add_page(page)
                    out_path = os.path.join(output_dir, f"{base}_p{i+1}.pdf")
                    with open(out_path, "wb") as out:
                        writer.write(out)
                    pages_generated.append(out_path)
                result["status"] = "success"
                result["pages"] = pages_generated
        except Exception as e:
            result["status"] = "error"
            result["message"] = str(e)
        return result
    
    def get_info(self, path: str) -> dict:
        """读取PDF元数�?""
        import PyPDF2
        with open(path, "rb") as f:
            reader = PyPDF2.PdfReader(f)
            info = {
                "pages": len(reader.pages),
                "encrypted": reader.is_encrypted,
                "file_size": os.path.getsize(path),
            }
            if reader.metadata:
                for key in ["title", "author", "subject", "producer", "creator"]:
                    val = getattr(reader.metadata, key, None)
                    if val:
                        info[key] = val
            return info

# 使用示例
pp = PDFProcessor()

# 模拟提取(无PyPDF2时回退�?result = pp.extract_text("document.pdf")
print(f"提取: {result['status']}")
if result["status"] == "success":
    print(f"  页数: {result['pages']}")
    print(f"  总字�? {result['total_chars']}")
    print(f"  作�? {result.get('metadata', {}).get('author', 'N/A')}")
elif result["status"] == "warning":
    print(f"  模拟结果: {result['pages']} �? {result['total_chars']} 字符")
    print(f"  表格: 4�? 图片: 12�?)

# 批量处理多个PDF
pdfs = ["report1.pdf", "report2.pdf", "report3.pdf"]
print(f"\
待处理队�? {len(pdfs)} 个PDF")

安全注意事项

def safe_extract(path: str) -> dict:
    """安全提取——限制大小和页数"""
    MAX_SIZE = 50 * 1024 * 1024  # 50MB
    MAX_PAGES = 200
    
    if not os.path.exists(path):
        return {"error": "File not found"}
    if os.path.getsize(path) > MAX_SIZE:
        return {"error": "File too large (>50MB)"}
    
    import PyPDF2
    with open(path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        if len(reader.pages) > MAX_PAGES:
            return {"error": f"Too many pages ({len(reader.pages)} > 200)"}
    
    return PDFProcessor().extract_text(path)

使用场景

  1. **文档数字�?*: 批量提取扫描PDF文本
  2. 报告合并: 合并多份周报/月报为一�?3. 合同管理: 提取合同关键条款和签署信�?4. 论文阅读: 提取学术PDF的摘要和方法部分

依赖

  • Python 3.8+
  • PyPDF2(pip install PyPDF2,推荐)
  • �?pdfminer.six(备选引擎)
Usage Guidance
Install only if you are comfortable with the agent reading PDFs you point it at, creating merged or split output files, and keeping a local pdf_jobs.json log with metadata and short text previews. Avoid using it on highly sensitive PDFs unless you remove or disable that logging.
Capability Assessment
Purpose & Capability
The artifact describes PDF text extraction, metadata reading, merging, splitting, and password-protected PDF support; the visible Python example implements those PDF-focused actions without unrelated network, credential, or system-control behavior.
Instruction Scope
The instructions are mostly purpose-scoped, but the Markdown has encoding corruption and the local pdf_jobs.json logging side effect is clearer in code than in the user-facing feature list.
Install Mechanism
The package contains only hub.json and SKILL.md, with no executable installer, startup hook, bundled binary, or automatic dependency installation.
Credentials
Local reads of user-selected PDFs and local writes for merged or split PDFs are proportionate for a PDF utility; the dependency note for PyPDF2 is disclosed and not auto-run.
Persistence & Privilege
extract_text persists a local pdf_jobs.json file containing operation history, PDF basenames, timestamps, metadata, and short text previews; this is not exfiltration but should be understood before use on sensitive documents.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install laosi-pdf-processor
  3. After installation, invoke the skill by name or use /laosi-pdf-processor
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v2.0.0
Version 2.0.0 - 全新发布,纯Python实现先进PDF处理,包括文本提取、表格识别、文档合并、页面拆分和元数据读取 - 新增对受密码保护PDF的支持 - 支持批量PDF处理和处理队列记录 - 提供安全限制功能(最大文件大小/页数限制) - 优化依赖:推荐PyPDF2,兼容pdfminer.six
Metadata
Slug laosi-pdf-processor
Version 2.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is PDF Processor Pro?

PDF处理 - 文本提取/表格识别/文档合并/页面拆分/元数据读取,纯Python实现. It is an AI Agent Skill for Claude Code / OpenClaw, with 48 downloads so far.

How do I install PDF Processor Pro?

Run "/install laosi-pdf-processor" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is PDF Processor Pro free?

Yes, PDF Processor Pro is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does PDF Processor Pro support?

PDF Processor Pro is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created PDF Processor Pro?

It is built and maintained by 534422530 (@534422530); the current version is v2.0.0.

💬 Comments