← 返回 Skills 市场
kaiyuelv

Pdf Intelligence Suite

作者 Lv Lancer · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
223
总下载
0
收藏
1
当前安装
1
版本数
在 OpenClaw 中安装
/install pdf-intelligence-suite
功能描述
PDF智能处理套件 - 文本提取、表格识别、OCR、PDF转Word/Excel等 | PDF Intelligence Suite - Text extraction, table recognition, OCR, PDF to Word/Excel conversion
使用说明 (SKILL.md)

PDF Intelligence Suite - PDF智能处理套件


中文描述

概述

PDF智能处理套件是一个功能强大的PDF文档处理工具集,提供文本提取、表格识别、OCR文字识别、格式转换等一站式服务。

功能特性

  • 📄 文本提取: 从PDF中提取纯文本或结构化文本,支持多种布局分析
  • 📊 表格识别: 自动识别PDF中的表格并提取为结构化数据(CSV/Excel)
  • 🔍 OCR识别: 对扫描件和图片型PDF进行文字识别,支持多语言
  • 🔄 格式转换: PDF转Word、PDF转Excel、PDF转图片等
  • ✂️ 页面操作: 合并、拆分、旋转、删除页面
  • 🔒 安全处理: 加密、解密、添加水印、数字签名
  • 📝 元数据管理: 读取和修改PDF文档属性

技术栈

  • PyPDF2: PDF基础操作(合并、拆分、加密等)
  • pdfplumber: 高级文本和表格提取,精准定位
  • camelot-py: 专业表格识别引擎
  • pytesseract: OCR文字识别(需安装Tesseract)
  • pdf2image: PDF转图片
  • reportlab: PDF生成和编辑
  • Pillow: 图像处理

目录结构

pdf-intelligence-suite/
├── SKILL.md              # 本文件
├── README.md             # 使用文档
├── requirements.txt      # 依赖声明
├── setup.py              # 安装配置
├── src/
│   └── pdf_intelligence_suite/
│       ├── __init__.py
│       ├── extractor.py      # 文本提取模块
│       ├── tables.py         # 表格识别模块
│       ├── ocr.py            # OCR识别模块
│       ├── converter.py      # 格式转换模块
│       ├── manipulator.py    # 页面操作模块
│       ├── security.py       # 安全处理模块
│       └── utils.py          # 工具函数
├── examples/
│   └── basic_usage.py    # 使用示例
└── tests/
    └── test_pdf_suite.py # 单元测试

快速开始

from pdf_intelligence_suite import PDFExtractor, TableExtractor, OCRProcessor

# 文本提取
extractor = PDFExtractor()
text = extractor.extract_text("document.pdf")

# 表格提取
tables = TableExtractor.extract_tables("report.pdf", output_format="excel")

# OCR识别
ocr = OCRProcessor(lang='chi_sim+eng')
text = ocr.process("scanned.pdf")

安装

pip install -r requirements.txt

# 安装Tesseract OCR引擎(Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra

# macOS
brew install tesseract tesseract-lang

# Windows: 下载安装包 https://github.com/UB-Mannheim/tesseract/wiki

English Description

Overview

PDF Intelligence Suite is a powerful PDF document processing toolkit providing one-stop services for text extraction, table recognition, OCR, format conversion, and more.

Features

  • 📄 Text Extraction: Extract plain or structured text from PDFs with layout analysis
  • 📊 Table Recognition: Automatically detect and extract tables as structured data (CSV/Excel)
  • 🔍 OCR Recognition: Recognize text in scanned documents and image-based PDFs, multi-language support
  • 🔄 Format Conversion: PDF to Word, PDF to Excel, PDF to images, etc.
  • ✂️ Page Operations: Merge, split, rotate, delete pages
  • 🔒 Security: Encryption, decryption, watermarking, digital signatures
  • 📝 Metadata: Read and modify PDF document properties

Tech Stack

  • PyPDF2: Basic PDF operations (merge, split, encrypt, etc.)
  • pdfplumber: Advanced text and table extraction with precise positioning
  • camelot-py: Professional table recognition engine
  • pytesseract: OCR text recognition (requires Tesseract installation)
  • pdf2image: PDF to image conversion
  • reportlab: PDF generation and editing
  • Pillow: Image processing

Quick Start

from pdf_intelligence_suite import PDFExtractor, TableExtractor, OCRProcessor

# Text extraction
extractor = PDFExtractor()
text = extractor.extract_text("document.pdf")

# Table extraction
tables = TableExtractor.extract_tables("report.pdf", output_format="excel")

# OCR recognition
ocr = OCRProcessor(lang='eng')
text = ocr.process("scanned.pdf")

Installation

pip install -r requirements.txt

# Install Tesseract OCR engine (Ubuntu/Debian)
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki

License

MIT License

Author

ClawHub Skills Collection

安全使用建议
This package looks like a straightforward local PDF processing library and its files (extractor, ocr, converter, manipulator) match the documented features. Before installing or running on sensitive documents: 1) Review the omitted/truncated files (security.py, utils.py, tables.py and any remaining code) for any network calls or unexpected file access—these files were not fully shown. 2) Run the package in an isolated environment (VM or container) because several dependencies require native system packages (Tesseract, poppler, Ghostscript) and heavy Python packages. 3) Note the small packaging inconsistency: setup.py references pdf_intelligence_suite.cli:main but cli.py isn't in the manifest—expect that the console script may not work until fixed. 4) If you will process confidential PDFs, verify security.py (encryption/decryption) behavior and any logging/network functionality to ensure no external transmission. If you want, I can scan the omitted files for network or subprocess usage if you provide them.
功能分析
Type: OpenClaw Skill Name: pdf-intelligence-suite Version: 1.0.0 The PDF Intelligence Suite is a comprehensive and well-structured toolkit for PDF processing, including text extraction, OCR, table recognition, and format conversion. The code utilizes standard, reputable libraries such as PyPDF2, pdfplumber, camelot-py, and pytesseract. A thorough review of the source code, documentation (SKILL.md and README.md), and dependencies shows no signs of malicious intent, data exfiltration, or prompt injection. The functionality is transparent, well-documented, and aligns strictly with the stated purpose of document automation and processing.
能力评估
Purpose & Capability
Name, README, SKILL.md, requirements, and the shown source files (extractor, ocr, converter, manipulator, etc.) are coherent: the requested libraries (PyPDF2, pdfplumber, pytesseract, pdf2image, python-docx, openpyxl, reportlab, Pillow, camelot) match the described features (text extraction, table recognition, OCR, conversion, page manipulation, security). No unrelated cloud credentials, binaries, or config paths are requested.
Instruction Scope
SKILL.md gives concrete install and usage steps (pip install -r requirements.txt, install system Tesseract/poppler), and the runtime examples and APIs operate only on local PDF files. The instructions do not ask the agent to read unrelated host files, access external endpoints, or exfiltrate environment variables.
Install Mechanism
There is no special install spec (the skill relies on pip requirements and system packages). This is low risk in terms of arbitrary downloads, but the dependency list includes system-level components (Tesseract, poppler) and heavy Python packages (camelot, opencv, pdf2image) that require native libraries; the README documents those needs. Minor inconsistency: setup.py defines a console entry_point 'pdf-suite=pdf_intelligence_suite.cli:main' but no cli.py was listed in the manifest, which may be a packaging oversight (not necessarily malicious).
Credentials
The skill does not declare required environment variables or credentials. The README notes optional TESSDATA_PREFIX for nonstandard Tesseract installs (reasonable). No environment variables named SECRET/TOKEN/KEY are requested and the code shown does not read unrelated env vars.
Persistence & Privilege
The skill does not request always:true and has default invocation privileges. It does not attempt to modify other skills or system-wide agent configuration in the reviewed files.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install pdf-intelligence-suite
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /pdf-intelligence-suite 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
PDF Intelligence Suite 1.0.0 initial release: - Provides text extraction, table recognition (CSV/Excel), and OCR for PDFs. - Supports PDF to Word/Excel/image conversion. - Enables PDF page merging, splitting, rotating, deletion, and security features (encryption, decryption, watermark, digital signature). - Includes metadata read and edit capabilities. - Built with PyPDF2, pdfplumber, camelot-py, pytesseract, pdf2image, reportlab, and Pillow. - Example code and installation instructions included.
元数据
Slug pdf-intelligence-suite
版本 1.0.0
许可证 MIT-0
累计安装 1
当前安装数 1
历史版本数 1
常见问题

Pdf Intelligence Suite 是什么?

PDF智能处理套件 - 文本提取、表格识别、OCR、PDF转Word/Excel等 | PDF Intelligence Suite - Text extraction, table recognition, OCR, PDF to Word/Excel conversion. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 223 次。

如何安装 Pdf Intelligence Suite?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install pdf-intelligence-suite」即可一键安装,无需额外配置。

Pdf Intelligence Suite 是免费的吗?

是的,Pdf Intelligence Suite 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Pdf Intelligence Suite 支持哪些平台?

Pdf Intelligence Suite 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Pdf Intelligence Suite?

由 Lv Lancer(@kaiyuelv)开发并维护,当前版本 v1.0.0。

💬 留言讨论