← 返回 Skills 市场
michael-laffin

PDF Text Extractor

作者 Michael-laffin · GitHub ↗ · v1.0.0
cross-platform ⚠ suspicious
12518
总下载
20
收藏
139
当前安装
1
版本数
在 OpenClaw 中安装
/install pdf-text-extractor
功能描述
Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.
使用说明 (SKILL.md)

PDF-Text-Extractor - Extract Text from PDFs

Vernox Utility Skill - Perfect for document digitization.

Overview

PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).

Features

✅ Text Extraction

  • Extract text from PDFs without external tools
  • Support for both text-based and scanned PDFs
  • Preserve document structure and formatting
  • Fast extraction (milliseconds for text-based)

✅ OCR Support

  • Use Tesseract.js for scanned documents
  • Support multiple languages (English, Spanish, French, German)
  • Configurable OCR quality/speed
  • Fallback to text extraction when possible

✅ Batch Processing

  • Process multiple PDFs at once
  • Batch extraction for document workflows
  • Progress tracking for large files
  • Error handling and retry logic

✅ Output Options

  • Plain text output
  • JSON output with metadata
  • Markdown conversion
  • HTML output (preserving links)

✅ Utility Features

  • Page-by-page extraction
  • Character/word counting
  • Language detection
  • Metadata extraction (author, title, creation date)

Installation

clawhub install pdf-text-extractor

Quick Start

Extract Text from PDF

const result = await extractText({
  pdfPath: './document.pdf',
  options: {
    outputFormat: 'text',
    ocr: true,
    language: 'eng'
  }
});

console.log(result.text);
console.log(`Pages: ${result.pages}`);
console.log(`Words: ${result.wordCount}`);

Batch Extract Multiple PDFs

const results = await extractBatch({
  pdfFiles: [
    './document1.pdf',
    './document2.pdf',
    './document3.pdf'
  ],
  options: {
    outputFormat: 'json',
    ocr: true
  }
});

console.log(`Extracted ${results.length} PDFs`);

Extract with OCR

const result = await extractText({
  pdfPath: './scanned-document.pdf',
  options: {
    ocr: true,
    language: 'eng',
    ocrQuality: 'high'
  }
});

// OCR will be used (scanned document detected)

Tool Functions

extractText

Extract text content from a single PDF file.

Parameters:

  • pdfPath (string, required): Path to PDF file
  • options (object, optional): Extraction options
    • outputFormat (string): 'text' | 'json' | 'markdown' | 'html'
    • ocr (boolean): Enable OCR for scanned docs
    • language (string): OCR language code ('eng', 'spa', 'fra', 'deu')
    • preserveFormatting (boolean): Keep headings/structure
    • minConfidence (number): Minimum OCR confidence score (0-100)

Returns:

  • text (string): Extracted text content
  • pages (number): Number of pages processed
  • wordCount (number): Total word count
  • charCount (number): Total character count
  • language (string): Detected language
  • metadata (object): PDF metadata (title, author, creation date)
  • method (string): 'text' or 'ocr' (extraction method)

extractBatch

Extract text from multiple PDF files at once.

Parameters:

  • pdfFiles (array, required): Array of PDF file paths
  • options (object, optional): Same as extractText

Returns:

  • results (array): Array of extraction results
  • totalPages (number): Total pages across all PDFs
  • successCount (number): Successfully extracted
  • failureCount (number): Failed extractions
  • errors (array): Error details for failures

countWords

Count words in extracted text.

Parameters:

  • text (string, required): Text to count
  • options (object, optional):
    • minWordLength (number): Minimum characters per word (default: 3)
    • excludeNumbers (boolean): Don't count numbers as words
    • countByPage (boolean): Return word count per page

Returns:

  • wordCount (number): Total word count
  • charCount (number): Total character count
  • pageCounts (array): Word count per page
  • averageWordsPerPage (number): Average words per page

detectLanguage

Detect the language of extracted text.

Parameters:

  • text (string, required): Text to analyze
  • minConfidence (number): Minimum confidence for detection

Returns:

  • language (string): Detected language code
  • languageName (string): Full language name
  • confidence (number): Confidence score (0-100)

Use Cases

Document Digitization

  • Convert paper documents to digital text
  • Process invoices and receipts
  • Digitize contracts and agreements
  • Archive physical documents

Content Analysis

  • Extract text for analysis tools
  • Prepare content for LLM processing
  • Clean up scanned documents
  • Parse PDF-based reports

Data Extraction

  • Extract data from PDF reports
  • Parse tables from PDFs
  • Pull structured data
  • Automate document workflows

Text Processing

  • Prepare content for translation
  • Clean up OCR output
  • Extract specific sections
  • Search within PDF content

Performance

Text-Based PDFs

  • Speed: ~100ms for 10-page PDF
  • Accuracy: 100% (exact text)
  • Memory: ~10MB for typical document

OCR Processing

  • Speed: ~1-3s per page (high quality)
  • Accuracy: 85-95% (depends on scan quality)
  • Memory: ~50-100MB peak during OCR

Technical Details

PDF Parsing

  • Uses native PDF.js library
  • Extracts text layer directly (no OCR needed)
  • Preserves document structure
  • Handles password-protected PDFs

OCR Engine

  • Tesseract.js under the hood
  • Supports 100+ languages
  • Adjustable quality/speed tradeoff
  • Confidence scoring for accuracy

Dependencies

  • ZERO external dependencies
  • Uses Node.js built-in modules only
  • PDF.js included in skill
  • Tesseract.js bundled

Error Handling

Invalid PDF

  • Clear error message
  • Suggest fix (check file format)
  • Skip to next file in batch

OCR Failure

  • Report confidence score
  • Suggest rescan at higher quality
  • Fallback to basic extraction

Memory Issues

  • Stream processing for large files
  • Progress reporting
  • Graceful degradation

Configuration

Edit config.json:

{
  "ocr": {
    "enabled": true,
    "defaultLanguage": "eng",
    "quality": "medium",
    "languages": ["eng", "spa", "fra", "deu"]
  },
  "output": {
    "defaultFormat": "text",
    "preserveFormatting": true,
    "includeMetadata": true
  },
  "batch": {
    "maxConcurrent": 3,
    "timeoutSeconds": 30
  }
}

Examples

Extract from Invoice

const invoice = await extractText('./invoice.pdf');
console.log(invoice.text);
// "INVOICE #12345 Date: 2026-02-04..."

Extract from Scanned Contract

const contract = await extractText('./scanned-contract.pdf', {
  ocr: true,
  language: 'eng',
  ocrQuality: 'high'
});
console.log(contract.text);
// "AGREEMENT This contract between..."

Batch Process Documents

const docs = await extractBatch([
  './doc1.pdf',
  './doc2.pdf',
  './doc3.pdf',
  './doc4.pdf'
]);
console.log(`Processed ${docs.successCount}/${docs.results.length} documents`);

Troubleshooting

OCR Not Working

  • Check if PDF is truly scanned (not text-based)
  • Try different quality settings (low/medium/high)
  • Ensure language matches document
  • Check image quality of scan

Extraction Returns Empty

  • PDF may be image-only
  • OCR failed with low confidence
  • Try different language setting

Slow Processing

  • Large PDF takes longer
  • Reduce quality for speed
  • Process in smaller batches

Tips

Best Results

  • Use text-based PDFs when possible (faster, 100% accurate)
  • High-quality scans for OCR (300 DPI+)
  • Clean background before scanning
  • Use correct language setting

Performance Optimization

  • Batch processing for multiple files
  • Disable OCR for text-based PDFs
  • Lower OCR quality for speed when acceptable

Roadmap

  • PDF/A support
  • Advanced OCR pre-processing
  • Table extraction from OCR
  • Handwriting OCR
  • PDF form field extraction
  • Batch language detection
  • Confidence scoring visualization

License

MIT


Extract text from PDFs. Fast, accurate, zero dependencies. 🔮

安全使用建议
Review before installing if your workflow depends on scanned-document OCR or audit confidence. This skill should be treated as embedded-text PDF extraction only; do not rely on its OCR claims or the method field as proof that OCR ran. Only point it at PDFs you are comfortable having read into the agent context.
功能分析
Type: OpenClaw Skill Name: pdf-text-extractor Version: 1.0.0 The skill is designed to extract text from PDF files and includes basic text processing utilities. The code primarily uses `pdfjs-dist` for PDF parsing and `fs.readFileSync` to read the PDF and its own configuration. There is no evidence of data exfiltration, malicious execution, persistence mechanisms, or prompt injection attempts against the agent in the `SKILL.md` or `README.md` files. A notable discrepancy is that the advertised OCR functionality in `SKILL.md` and `README.md` is not implemented in `index.js`, which only extracts embedded text from PDFs. This is a functional flaw, not a security vulnerability. The dependencies listed in `package-lock.json` include some deprecated packages, but these are common in Node.js projects and do not indicate intentional malicious behavior by the skill author.
能力评估
Purpose & Capability
Core embedded PDF text extraction is purpose-aligned, but the artifacts repeatedly claim Tesseract/OCR support for scanned documents while index.js only uses pdfjs-dist text extraction and sets method to 'ocr' based on the input option rather than actual processing.
Instruction Scope
The runtime interface takes caller-supplied PDF file paths and batch lists, which is expected for this tool but can expose sensitive document text and metadata to the agent context.
Install Mechanism
The package has a normal npm dependency on pdfjs-dist and no first-party install hook, but the documentation's zero-dependency and bundled OCR claims are inaccurate; the lockfile also includes optional dependency install behavior from transitive packages.
Credentials
Local file reads are proportionate to PDF extraction, and review found no network calls, credential access, broad local indexing, unrelated filesystem mutation, or command execution in the skill implementation.
Persistence & Privilege
No startup hooks, background workers, persistence mechanisms, privilege escalation, destructive actions, or credential/session use were found.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install pdf-text-extractor
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /pdf-text-extractor 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release: Extract text from PDFs with OCR support for digitizing documents
元数据
Slug pdf-text-extractor
版本 1.0.0
许可证
累计安装 140
当前安装数 139
历史版本数 1
常见问题

PDF Text Extractor 是什么?

Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 12518 次。

如何安装 PDF Text Extractor?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install pdf-text-extractor」即可一键安装,无需额外配置。

PDF Text Extractor 是免费的吗?

是的,PDF Text Extractor 完全免费(开源免费),可自由下载、安装和使用。

PDF Text Extractor 支持哪些平台?

PDF Text Extractor 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 PDF Text Extractor?

由 Michael-laffin(@michael-laffin)开发并维护,当前版本 v1.0.0。

💬 留言讨论