← 返回 Skills 市场

Pdf Intelligence Suite

Name: Pdf Intelligence Suite
Author: kaiyuelv

作者 Lv Lancer · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ 安全检测通过

223

总下载

当前安装

版本数

在 OpenClaw 中安装

/install pdf-intelligence-suite

功能描述

PDF智能处理套件 - 文本提取、表格识别、OCR、PDF转Word/Excel等 | PDF Intelligence Suite - Text extraction, table recognition, OCR, PDF to Word/Excel conversion

使用说明 (SKILL.md)

PDF Intelligence Suite - PDF智能处理套件

中文描述

概述

PDF智能处理套件是一个功能强大的PDF文档处理工具集，提供文本提取、表格识别、OCR文字识别、格式转换等一站式服务。

功能特性

📄 文本提取: 从PDF中提取纯文本或结构化文本，支持多种布局分析
📊 表格识别: 自动识别PDF中的表格并提取为结构化数据（CSV/Excel）
🔍 OCR识别: 对扫描件和图片型PDF进行文字识别，支持多语言
🔄 格式转换: PDF转Word、PDF转Excel、PDF转图片等
✂️ 页面操作: 合并、拆分、旋转、删除页面
🔒 安全处理: 加密、解密、添加水印、数字签名
📝 元数据管理: 读取和修改PDF文档属性

技术栈

PyPDF2: PDF基础操作（合并、拆分、加密等）
pdfplumber: 高级文本和表格提取，精准定位
camelot-py: 专业表格识别引擎
pytesseract: OCR文字识别（需安装Tesseract）
pdf2image: PDF转图片
reportlab: PDF生成和编辑
Pillow: 图像处理

目录结构

pdf-intelligence-suite/
├── SKILL.md              # 本文件
├── README.md             # 使用文档
├── requirements.txt      # 依赖声明
├── setup.py              # 安装配置
├── src/
│   └── pdf_intelligence_suite/
│       ├── __init__.py
│       ├── extractor.py      # 文本提取模块
│       ├── tables.py         # 表格识别模块
│       ├── ocr.py            # OCR识别模块
│       ├── converter.py      # 格式转换模块
│       ├── manipulator.py    # 页面操作模块
│       ├── security.py       # 安全处理模块
│       └── utils.py          # 工具函数
├── examples/
│   └── basic_usage.py    # 使用示例
└── tests/
    └── test_pdf_suite.py # 单元测试

快速开始

from pdf_intelligence_suite import PDFExtractor, TableExtractor, OCRProcessor

# 文本提取
extractor = PDFExtractor()
text = extractor.extract_text("document.pdf")

# 表格提取
tables = TableExtractor.extract_tables("report.pdf", output_format="excel")

# OCR识别
ocr = OCRProcessor(lang='chi_sim+eng')
text = ocr.process("scanned.pdf")

安装

pip install -r requirements.txt

# 安装Tesseract OCR引擎（Ubuntu/Debian）
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra

# macOS
brew install tesseract tesseract-lang

# Windows: 下载安装包 https://github.com/UB-Mannheim/tesseract/wiki

English Description

Overview

PDF Intelligence Suite is a powerful PDF document processing toolkit providing one-stop services for text extraction, table recognition, OCR, format conversion, and more.

Features

📄 Text Extraction: Extract plain or structured text from PDFs with layout analysis
📊 Table Recognition: Automatically detect and extract tables as structured data (CSV/Excel)
🔍 OCR Recognition: Recognize text in scanned documents and image-based PDFs, multi-language support
🔄 Format Conversion: PDF to Word, PDF to Excel, PDF to images, etc.
✂️ Page Operations: Merge, split, rotate, delete pages
🔒 Security: Encryption, decryption, watermarking, digital signatures
📝 Metadata: Read and modify PDF document properties

Tech Stack

PyPDF2: Basic PDF operations (merge, split, encrypt, etc.)
pdfplumber: Advanced text and table extraction with precise positioning
camelot-py: Professional table recognition engine
pytesseract: OCR text recognition (requires Tesseract installation)
pdf2image: PDF to image conversion
reportlab: PDF generation and editing
Pillow: Image processing

Quick Start

from pdf_intelligence_suite import PDFExtractor, TableExtractor, OCRProcessor

# Text extraction
extractor = PDFExtractor()
text = extractor.extract_text("document.pdf")

# Table extraction
tables = TableExtractor.extract_tables("report.pdf", output_format="excel")

# OCR recognition
ocr = OCRProcessor(lang='eng')
text = ocr.process("scanned.pdf")

Installation

pip install -r requirements.txt

# Install Tesseract OCR engine (Ubuntu/Debian)
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki

License

MIT License

Author

ClawHub Skills Collection

安全使用建议

This package looks like a straightforward local PDF processing library and its files (extractor, ocr, converter, manipulator) match the documented features. Before installing or running on sensitive documents: 1) Review the omitted/truncated files (security.py, utils.py, tables.py and any remaining code) for any network calls or unexpected file access—these files were not fully shown. 2) Run the package in an isolated environment (VM or container) because several dependencies require native system packages (Tesseract, poppler, Ghostscript) and heavy Python packages. 3) Note the small packaging inconsistency: setup.py references pdf_intelligence_suite.cli:main but cli.py isn't in the manifest—expect that the console script may not work until fixed. 4) If you will process confidential PDFs, verify security.py (encryption/decryption) behavior and any logging/network functionality to ensure no external transmission. If you want, I can scan the omitted files for network or subprocess usage if you provide them.

功能分析

Type: OpenClaw Skill Name: pdf-intelligence-suite Version: 1.0.0 The PDF Intelligence Suite is a comprehensive and well-structured toolkit for PDF processing, including text extraction, OCR, table recognition, and format conversion. The code utilizes standard, reputable libraries such as PyPDF2, pdfplumber, camelot-py, and pytesseract. A thorough review of the source code, documentation (SKILL.md and README.md), and dependencies shows no signs of malicious intent, data exfiltration, or prompt injection. The functionality is transparent, well-documented, and aligns strictly with the stated purpose of document automation and processing.

能力评估

✓ Purpose & Capability

Name, README, SKILL.md, requirements, and the shown source files (extractor, ocr, converter, manipulator, etc.) are coherent: the requested libraries (PyPDF2, pdfplumber, pytesseract, pdf2image, python-docx, openpyxl, reportlab, Pillow, camelot) match the described features (text extraction, table recognition, OCR, conversion, page manipulation, security). No unrelated cloud credentials, binaries, or config paths are requested.

✓ Instruction Scope

SKILL.md gives concrete install and usage steps (pip install -r requirements.txt, install system Tesseract/poppler), and the runtime examples and APIs operate only on local PDF files. The instructions do not ask the agent to read unrelated host files, access external endpoints, or exfiltrate environment variables.

ℹ Install Mechanism

There is no special install spec (the skill relies on pip requirements and system packages). This is low risk in terms of arbitrary downloads, but the dependency list includes system-level components (Tesseract, poppler) and heavy Python packages (camelot, opencv, pdf2image) that require native libraries; the README documents those needs. Minor inconsistency: setup.py defines a console entry_point 'pdf-suite=pdf_intelligence_suite.cli:main' but no cli.py was listed in the manifest, which may be a packaging oversight (not necessarily malicious).

✓ Credentials

The skill does not declare required environment variables or credentials. The README notes optional TESSDATA_PREFIX for nonstandard Tesseract installs (reasonable). No environment variables named SECRET/TOKEN/KEY are requested and the code shown does not read unrelated env vars.

✓ Persistence & Privilege

The skill does not request always:true and has default invocation privileges. It does not attempt to modify other skills or system-wide agent configuration in the reviewed files.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install pdf-intelligence-suite
安装完成后，直接呼叫该 Skill 的名称或使用 /pdf-intelligence-suite 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

PDF Intelligence Suite 1.0.0 initial release: - Provides text extraction, table recognition (CSV/Excel), and OCR for PDFs. - Supports PDF to Word/Excel/image conversion. - Enables PDF page merging, splitting, rotating, deletion, and security features (encryption, decryption, watermark, digital signature). - Includes metadata read and edit capabilities. - Built with PyPDF2, pdfplumber, camelot-py, pytesseract, pdf2image, reportlab, and Pillow. - Example code and installation instructions included.

元数据

Slug pdf-intelligence-suite

版本 1.0.0

许可证 MIT-0

累计安装 1

当前安装数 1

历史版本数 1

常见问题