← Back to Skills Marketplace

Pdf Intelligence Suite

Name: Pdf Intelligence Suite
Author: kaiyuelv

by Lv Lancer · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ Security Clean

223

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install pdf-intelligence-suite

Description

PDF智能处理套件 - 文本提取、表格识别、OCR、PDF转Word/Excel等 | PDF Intelligence Suite - Text extraction, table recognition, OCR, PDF to Word/Excel conversion

README (SKILL.md)

PDF Intelligence Suite - PDF智能处理套件

中文描述

概述

PDF智能处理套件是一个功能强大的PDF文档处理工具集，提供文本提取、表格识别、OCR文字识别、格式转换等一站式服务。

功能特性

📄 文本提取: 从PDF中提取纯文本或结构化文本，支持多种布局分析
📊 表格识别: 自动识别PDF中的表格并提取为结构化数据（CSV/Excel）
🔍 OCR识别: 对扫描件和图片型PDF进行文字识别，支持多语言
🔄 格式转换: PDF转Word、PDF转Excel、PDF转图片等
✂️ 页面操作: 合并、拆分、旋转、删除页面
🔒 安全处理: 加密、解密、添加水印、数字签名
📝 元数据管理: 读取和修改PDF文档属性

技术栈

PyPDF2: PDF基础操作（合并、拆分、加密等）
pdfplumber: 高级文本和表格提取，精准定位
camelot-py: 专业表格识别引擎
pytesseract: OCR文字识别（需安装Tesseract）
pdf2image: PDF转图片
reportlab: PDF生成和编辑
Pillow: 图像处理

目录结构

pdf-intelligence-suite/
├── SKILL.md              # 本文件
├── README.md             # 使用文档
├── requirements.txt      # 依赖声明
├── setup.py              # 安装配置
├── src/
│   └── pdf_intelligence_suite/
│       ├── __init__.py
│       ├── extractor.py      # 文本提取模块
│       ├── tables.py         # 表格识别模块
│       ├── ocr.py            # OCR识别模块
│       ├── converter.py      # 格式转换模块
│       ├── manipulator.py    # 页面操作模块
│       ├── security.py       # 安全处理模块
│       └── utils.py          # 工具函数
├── examples/
│   └── basic_usage.py    # 使用示例
└── tests/
    └── test_pdf_suite.py # 单元测试

快速开始

from pdf_intelligence_suite import PDFExtractor, TableExtractor, OCRProcessor

# 文本提取
extractor = PDFExtractor()
text = extractor.extract_text("document.pdf")

# 表格提取
tables = TableExtractor.extract_tables("report.pdf", output_format="excel")

# OCR识别
ocr = OCRProcessor(lang='chi_sim+eng')
text = ocr.process("scanned.pdf")

安装

pip install -r requirements.txt

# 安装Tesseract OCR引擎（Ubuntu/Debian）
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra

# macOS
brew install tesseract tesseract-lang

# Windows: 下载安装包 https://github.com/UB-Mannheim/tesseract/wiki

English Description

Overview

PDF Intelligence Suite is a powerful PDF document processing toolkit providing one-stop services for text extraction, table recognition, OCR, format conversion, and more.

Features

📄 Text Extraction: Extract plain or structured text from PDFs with layout analysis
📊 Table Recognition: Automatically detect and extract tables as structured data (CSV/Excel)
🔍 OCR Recognition: Recognize text in scanned documents and image-based PDFs, multi-language support
🔄 Format Conversion: PDF to Word, PDF to Excel, PDF to images, etc.
✂️ Page Operations: Merge, split, rotate, delete pages
🔒 Security: Encryption, decryption, watermarking, digital signatures
📝 Metadata: Read and modify PDF document properties

Tech Stack

PyPDF2: Basic PDF operations (merge, split, encrypt, etc.)
pdfplumber: Advanced text and table extraction with precise positioning
camelot-py: Professional table recognition engine
pytesseract: OCR text recognition (requires Tesseract installation)
pdf2image: PDF to image conversion
reportlab: PDF generation and editing
Pillow: Image processing

Quick Start

from pdf_intelligence_suite import PDFExtractor, TableExtractor, OCRProcessor

# Text extraction
extractor = PDFExtractor()
text = extractor.extract_text("document.pdf")

# Table extraction
tables = TableExtractor.extract_tables("report.pdf", output_format="excel")

# OCR recognition
ocr = OCRProcessor(lang='eng')
text = ocr.process("scanned.pdf")

Installation

pip install -r requirements.txt

# Install Tesseract OCR engine (Ubuntu/Debian)
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki

License

MIT License

Author

ClawHub Skills Collection

Usage Guidance

This package looks like a straightforward local PDF processing library and its files (extractor, ocr, converter, manipulator) match the documented features. Before installing or running on sensitive documents: 1) Review the omitted/truncated files (security.py, utils.py, tables.py and any remaining code) for any network calls or unexpected file access—these files were not fully shown. 2) Run the package in an isolated environment (VM or container) because several dependencies require native system packages (Tesseract, poppler, Ghostscript) and heavy Python packages. 3) Note the small packaging inconsistency: setup.py references pdf_intelligence_suite.cli:main but cli.py isn't in the manifest—expect that the console script may not work until fixed. 4) If you will process confidential PDFs, verify security.py (encryption/decryption) behavior and any logging/network functionality to ensure no external transmission. If you want, I can scan the omitted files for network or subprocess usage if you provide them.

Capability Analysis

Type: OpenClaw Skill Name: pdf-intelligence-suite Version: 1.0.0 The PDF Intelligence Suite is a comprehensive and well-structured toolkit for PDF processing, including text extraction, OCR, table recognition, and format conversion. The code utilizes standard, reputable libraries such as PyPDF2, pdfplumber, camelot-py, and pytesseract. A thorough review of the source code, documentation (SKILL.md and README.md), and dependencies shows no signs of malicious intent, data exfiltration, or prompt injection. The functionality is transparent, well-documented, and aligns strictly with the stated purpose of document automation and processing.

Capability Assessment

✓ Purpose & Capability

Name, README, SKILL.md, requirements, and the shown source files (extractor, ocr, converter, manipulator, etc.) are coherent: the requested libraries (PyPDF2, pdfplumber, pytesseract, pdf2image, python-docx, openpyxl, reportlab, Pillow, camelot) match the described features (text extraction, table recognition, OCR, conversion, page manipulation, security). No unrelated cloud credentials, binaries, or config paths are requested.

✓ Instruction Scope

SKILL.md gives concrete install and usage steps (pip install -r requirements.txt, install system Tesseract/poppler), and the runtime examples and APIs operate only on local PDF files. The instructions do not ask the agent to read unrelated host files, access external endpoints, or exfiltrate environment variables.

ℹ Install Mechanism

There is no special install spec (the skill relies on pip requirements and system packages). This is low risk in terms of arbitrary downloads, but the dependency list includes system-level components (Tesseract, poppler) and heavy Python packages (camelot, opencv, pdf2image) that require native libraries; the README documents those needs. Minor inconsistency: setup.py defines a console entry_point 'pdf-suite=pdf_intelligence_suite.cli:main' but no cli.py was listed in the manifest, which may be a packaging oversight (not necessarily malicious).

✓ Credentials

The skill does not declare required environment variables or credentials. The README notes optional TESSDATA_PREFIX for nonstandard Tesseract installs (reasonable). No environment variables named SECRET/TOKEN/KEY are requested and the code shown does not read unrelated env vars.

✓ Persistence & Privilege

The skill does not request always:true and has default invocation privileges. It does not attempt to modify other skills or system-wide agent configuration in the reviewed files.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install pdf-intelligence-suite
After installation, invoke the skill by name or use /pdf-intelligence-suite
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

PDF Intelligence Suite 1.0.0 initial release: - Provides text extraction, table recognition (CSV/Excel), and OCR for PDFs. - Supports PDF to Word/Excel/image conversion. - Enables PDF page merging, splitting, rotating, deletion, and security features (encryption, decryption, watermark, digital signature). - Includes metadata read and edit capabilities. - Built with PyPDF2, pdfplumber, camelot-py, pytesseract, pdf2image, reportlab, and Pillow. - Example code and installation instructions included.

Metadata

Slug pdf-intelligence-suite

Version 1.0.0

License MIT-0

All-time Installs 1

Active Installs 1

Total Versions 1

Frequently Asked Questions

What is Pdf Intelligence Suite?

PDF智能处理套件 - 文本提取、表格识别、OCR、PDF转Word/Excel等 | PDF Intelligence Suite - Text extraction, table recognition, OCR, PDF to Word/Excel conversion. It is an AI Agent Skill for Claude Code / OpenClaw, with 223 downloads so far.

How do I install Pdf Intelligence Suite?

Run "/install pdf-intelligence-suite" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Pdf Intelligence Suite free?

Yes, Pdf Intelligence Suite is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Pdf Intelligence Suite support?

Pdf Intelligence Suite is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Pdf Intelligence Suite?

It is built and maintained by Lv Lancer (@kaiyuelv); the current version is v1.0.0.

More Skills