← Back to Skills Marketplace
kaiyuelv

Pdf Intelligence Suite

by Lv Lancer · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
223
Downloads
0
Stars
1
Active Installs
1
Versions
Install in OpenClaw
/install pdf-intelligence-suite
Description
PDF智能处理套件 - 文本提取、表格识别、OCR、PDF转Word/Excel等 | PDF Intelligence Suite - Text extraction, table recognition, OCR, PDF to Word/Excel conversion
README (SKILL.md)

PDF Intelligence Suite - PDF智能处理套件


中文描述

概述

PDF智能处理套件是一个功能强大的PDF文档处理工具集,提供文本提取、表格识别、OCR文字识别、格式转换等一站式服务。

功能特性

  • 📄 文本提取: 从PDF中提取纯文本或结构化文本,支持多种布局分析
  • 📊 表格识别: 自动识别PDF中的表格并提取为结构化数据(CSV/Excel)
  • 🔍 OCR识别: 对扫描件和图片型PDF进行文字识别,支持多语言
  • 🔄 格式转换: PDF转Word、PDF转Excel、PDF转图片等
  • ✂️ 页面操作: 合并、拆分、旋转、删除页面
  • 🔒 安全处理: 加密、解密、添加水印、数字签名
  • 📝 元数据管理: 读取和修改PDF文档属性

技术栈

  • PyPDF2: PDF基础操作(合并、拆分、加密等)
  • pdfplumber: 高级文本和表格提取,精准定位
  • camelot-py: 专业表格识别引擎
  • pytesseract: OCR文字识别(需安装Tesseract)
  • pdf2image: PDF转图片
  • reportlab: PDF生成和编辑
  • Pillow: 图像处理

目录结构

pdf-intelligence-suite/
├── SKILL.md              # 本文件
├── README.md             # 使用文档
├── requirements.txt      # 依赖声明
├── setup.py              # 安装配置
├── src/
│   └── pdf_intelligence_suite/
│       ├── __init__.py
│       ├── extractor.py      # 文本提取模块
│       ├── tables.py         # 表格识别模块
│       ├── ocr.py            # OCR识别模块
│       ├── converter.py      # 格式转换模块
│       ├── manipulator.py    # 页面操作模块
│       ├── security.py       # 安全处理模块
│       └── utils.py          # 工具函数
├── examples/
│   └── basic_usage.py    # 使用示例
└── tests/
    └── test_pdf_suite.py # 单元测试

快速开始

from pdf_intelligence_suite import PDFExtractor, TableExtractor, OCRProcessor

# 文本提取
extractor = PDFExtractor()
text = extractor.extract_text("document.pdf")

# 表格提取
tables = TableExtractor.extract_tables("report.pdf", output_format="excel")

# OCR识别
ocr = OCRProcessor(lang='chi_sim+eng')
text = ocr.process("scanned.pdf")

安装

pip install -r requirements.txt

# 安装Tesseract OCR引擎(Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra

# macOS
brew install tesseract tesseract-lang

# Windows: 下载安装包 https://github.com/UB-Mannheim/tesseract/wiki

English Description

Overview

PDF Intelligence Suite is a powerful PDF document processing toolkit providing one-stop services for text extraction, table recognition, OCR, format conversion, and more.

Features

  • 📄 Text Extraction: Extract plain or structured text from PDFs with layout analysis
  • 📊 Table Recognition: Automatically detect and extract tables as structured data (CSV/Excel)
  • 🔍 OCR Recognition: Recognize text in scanned documents and image-based PDFs, multi-language support
  • 🔄 Format Conversion: PDF to Word, PDF to Excel, PDF to images, etc.
  • ✂️ Page Operations: Merge, split, rotate, delete pages
  • 🔒 Security: Encryption, decryption, watermarking, digital signatures
  • 📝 Metadata: Read and modify PDF document properties

Tech Stack

  • PyPDF2: Basic PDF operations (merge, split, encrypt, etc.)
  • pdfplumber: Advanced text and table extraction with precise positioning
  • camelot-py: Professional table recognition engine
  • pytesseract: OCR text recognition (requires Tesseract installation)
  • pdf2image: PDF to image conversion
  • reportlab: PDF generation and editing
  • Pillow: Image processing

Quick Start

from pdf_intelligence_suite import PDFExtractor, TableExtractor, OCRProcessor

# Text extraction
extractor = PDFExtractor()
text = extractor.extract_text("document.pdf")

# Table extraction
tables = TableExtractor.extract_tables("report.pdf", output_format="excel")

# OCR recognition
ocr = OCRProcessor(lang='eng')
text = ocr.process("scanned.pdf")

Installation

pip install -r requirements.txt

# Install Tesseract OCR engine (Ubuntu/Debian)
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki

License

MIT License

Author

ClawHub Skills Collection

Usage Guidance
This package looks like a straightforward local PDF processing library and its files (extractor, ocr, converter, manipulator) match the documented features. Before installing or running on sensitive documents: 1) Review the omitted/truncated files (security.py, utils.py, tables.py and any remaining code) for any network calls or unexpected file access—these files were not fully shown. 2) Run the package in an isolated environment (VM or container) because several dependencies require native system packages (Tesseract, poppler, Ghostscript) and heavy Python packages. 3) Note the small packaging inconsistency: setup.py references pdf_intelligence_suite.cli:main but cli.py isn't in the manifest—expect that the console script may not work until fixed. 4) If you will process confidential PDFs, verify security.py (encryption/decryption) behavior and any logging/network functionality to ensure no external transmission. If you want, I can scan the omitted files for network or subprocess usage if you provide them.
Capability Analysis
Type: OpenClaw Skill Name: pdf-intelligence-suite Version: 1.0.0 The PDF Intelligence Suite is a comprehensive and well-structured toolkit for PDF processing, including text extraction, OCR, table recognition, and format conversion. The code utilizes standard, reputable libraries such as PyPDF2, pdfplumber, camelot-py, and pytesseract. A thorough review of the source code, documentation (SKILL.md and README.md), and dependencies shows no signs of malicious intent, data exfiltration, or prompt injection. The functionality is transparent, well-documented, and aligns strictly with the stated purpose of document automation and processing.
Capability Assessment
Purpose & Capability
Name, README, SKILL.md, requirements, and the shown source files (extractor, ocr, converter, manipulator, etc.) are coherent: the requested libraries (PyPDF2, pdfplumber, pytesseract, pdf2image, python-docx, openpyxl, reportlab, Pillow, camelot) match the described features (text extraction, table recognition, OCR, conversion, page manipulation, security). No unrelated cloud credentials, binaries, or config paths are requested.
Instruction Scope
SKILL.md gives concrete install and usage steps (pip install -r requirements.txt, install system Tesseract/poppler), and the runtime examples and APIs operate only on local PDF files. The instructions do not ask the agent to read unrelated host files, access external endpoints, or exfiltrate environment variables.
Install Mechanism
There is no special install spec (the skill relies on pip requirements and system packages). This is low risk in terms of arbitrary downloads, but the dependency list includes system-level components (Tesseract, poppler) and heavy Python packages (camelot, opencv, pdf2image) that require native libraries; the README documents those needs. Minor inconsistency: setup.py defines a console entry_point 'pdf-suite=pdf_intelligence_suite.cli:main' but no cli.py was listed in the manifest, which may be a packaging oversight (not necessarily malicious).
Credentials
The skill does not declare required environment variables or credentials. The README notes optional TESSDATA_PREFIX for nonstandard Tesseract installs (reasonable). No environment variables named SECRET/TOKEN/KEY are requested and the code shown does not read unrelated env vars.
Persistence & Privilege
The skill does not request always:true and has default invocation privileges. It does not attempt to modify other skills or system-wide agent configuration in the reviewed files.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install pdf-intelligence-suite
  3. After installation, invoke the skill by name or use /pdf-intelligence-suite
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
PDF Intelligence Suite 1.0.0 initial release: - Provides text extraction, table recognition (CSV/Excel), and OCR for PDFs. - Supports PDF to Word/Excel/image conversion. - Enables PDF page merging, splitting, rotating, deletion, and security features (encryption, decryption, watermark, digital signature). - Includes metadata read and edit capabilities. - Built with PyPDF2, pdfplumber, camelot-py, pytesseract, pdf2image, reportlab, and Pillow. - Example code and installation instructions included.
Metadata
Slug pdf-intelligence-suite
Version 1.0.0
License MIT-0
All-time Installs 1
Active Installs 1
Total Versions 1
Frequently Asked Questions

What is Pdf Intelligence Suite?

PDF智能处理套件 - 文本提取、表格识别、OCR、PDF转Word/Excel等 | PDF Intelligence Suite - Text extraction, table recognition, OCR, PDF to Word/Excel conversion. It is an AI Agent Skill for Claude Code / OpenClaw, with 223 downloads so far.

How do I install Pdf Intelligence Suite?

Run "/install pdf-intelligence-suite" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Pdf Intelligence Suite free?

Yes, Pdf Intelligence Suite is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Pdf Intelligence Suite support?

Pdf Intelligence Suite is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Pdf Intelligence Suite?

It is built and maintained by Lv Lancer (@kaiyuelv); the current version is v1.0.0.

💬 Comments