← Back to Skills Marketplace
bigclawd

doc-extract-filter

by bigclawd · GitHub ↗ · v1.1.1 · MIT-0
cross-platform ✓ Security Clean
181
Downloads
1
Stars
0
Active Installs
4
Versions
Install in OpenClaw
/install doc-extract-filter
Description
支持 PDF、Word、Excel 文件的文本提取和按关键词筛选,返回完整或筛选后的文本内容。
README (SKILL.md)

doc-extract-filter

元数据

基本信息

  • name: doc-extract-filter
  • description: 文件处理技能,支持多种文件格式的文本提取、关键词/正则表达式筛选、排除筛选和批量文件处理
  • version: 1.1.1
  • author: file-agent team
  • license: MIT-0

OpenClaw 配置

{
  "name": "doc-extract-filter",
  "description": "文件处理技能,支持多种文件格式的文本提取、关键词/正则表达式筛选、排除筛选和批量文件处理",
  "version": "1.1.1",
  "author": "file-agent team",
  "license": "MIT-0",
  "type": "tool",
  "entry_point": "scripts/doc-extract-filter.py",
  "parameters": {
    "file_path": {
      "type": "string",
      "description": "文件路径",
      "required": false
    },
    "action": {
      "type": "string",
      "description": "操作类型:extract 或 filter",
      "required": true
    },
    "keywords": {
      "type": "array",
      "description": "关键词列表(仅 filter 操作需要)",
      "required": false
    },
    "regex": {
      "type": "string",
      "description": "正则表达式模式(仅 filter 操作需要)",
      "required": false
    },
    "enable_ocr": {
      "type": "boolean",
      "description": "启用 OCR 支持(用于扫描件 PDF)",
      "required": false
    },
    "exclude_keywords": {
      "type": "array",
      "description": "排除关键词列表(仅 filter 操作需要)",
      "required": false
    },
    "exclude_regex": {
      "type": "string",
      "description": "排除正则表达式模式(仅 filter 操作需要)",
      "required": false
    },
    "context_length": {
      "type": "integer",
      "description": "上下文长度(默认50字符)",
      "required": false
    },
    "filter_level": {
      "type": "string",
      "description": "筛选级别:line(按行)或 paragraph(按段落)",
      "required": false
    },
    "batch": {
      "type": "boolean",
      "description": "开启批量处理模式",
      "required": false
    },
    "input_dir": {
      "type": "string",
      "description": "批量处理的输入文件夹路径",
      "required": false
    },
    "file_paths": {
      "type": "array",
      "description": "批量处理的文件列表",
      "required": false
    },
    "output_dir": {
      "type": "string",
      "description": "批量结果输出目录",
      "required": false
    },
    "merge_results": {
      "type": "boolean",
      "description": "是否合并所有文件结果为一个 JSON 文件",
      "required": false
    }
  }
}

CoPaw 配置

name: doc-extract-filter
description: 文件处理技能,支持多种文件格式的文本提取、关键词/正则表达式筛选、排除筛选和批量文件处理
version: 1.1.1
author: file-agent team
license: MIT-0
type: tool
entry_point: scripts/doc-extract-filter.py
parameters:
  file_path:
    type: string
    description: 文件路径
    required: false
  action:
    type: string
    description: 操作类型:extract 或 filter
    required: true
  keywords:
    type: array
    description: 关键词列表(仅 filter 操作需要)
    required: false
  regex:
    type: string
    description: 正则表达式模式(仅 filter 操作需要)
    required: false
  enable_ocr:
    type: boolean
    description: 启用 OCR 支持(用于扫描件 PDF)
    required: false
  exclude_keywords:
    type: array
    description: 排除关键词列表(仅 filter 操作需要)
    required: false
  exclude_regex:
    type: string
    description: 排除正则表达式模式(仅 filter 操作需要)
    required: false
  context_length:
    type: integer
    description: 上下文长度(默认50字符)
    required: false
  filter_level:
    type: string
    description: 筛选级别:line(按行)或 paragraph(按段落)
    required: false
  batch:
    type: boolean
    description: 开启批量处理模式
    required: false
  input_dir:
    type: string
    description: 批量处理的输入文件夹路径
    required: false
  file_paths:
    type: array
    description: 批量处理的文件列表
    required: false
  output_dir:
    type: string
    description: 批量结果输出目录
    required: false
  merge_results:
    type: boolean
    description: 是否合并所有文件结果为一个 JSON 文件
    required: false

更新说明

  • 版本 1.1.1: 新增格式扩展+兼容性优化和筛选功能增强
    • 新增支持 CSV、Markdown(.md)、WPS(.wps/.et)文件提取
    • 修复 Excel 合并单元格、PDF 扫描件、Word 图文混排的提取问题
    • 新增 --enable-ocr 参数(可选),支持扫描件 PDF 轻量 OCR 提取
    • 新增智能格式检测逻辑,自动识别文件类型,无需用户指定
    • 新增 --exclude-keywords/--exclude-regex 参数,支持排除指定内容
    • 新增 --context-length N 参数,返回筛选结果的上下文(默认 50 字符)
    • 新增 --filter-level(line/paragraph)参数,支持按行/段落筛选
    • 批量处理模式下,筛选增强逻辑自动适配,结果按文件维度保留筛选细节
    • 依赖新增 tesseract(可选)、python-markdown,写入 requirements.txt 并标注可选
  • 版本 1.1.0: 添加了批量文件处理功能
  • 版本 1.0.3: 添加了正则表达式筛选功能
  • 版本 1.0.2: 移除了未使用的依赖,优化了项目结构

使用说明

功能

  • extract: 提取文件中的文本内容,支持多种文件格式
  • filter: 提取文件中的文本并筛选包含指定关键词或匹配正则表达式的内容,支持排除筛选
  • batch: 批量处理多个文件,支持文件夹遍历和多文件列表

调用方式

CLI 调用

# 单个文件处理
python scripts/doc-extract-filter.py --file_path "path/to/file.pdf" --action "extract"
python scripts/doc-extract-filter.py --file_path "path/to/file.pdf" --action "filter" --keywords "关键词1,关键词2"
python scripts/doc-extract-filter.py --file_path "path/to/file.pdf" --action "filter" --regex "\d{4}-\d{2}-\d{2}"

# 提取 PDF 扫描件(启用 OCR)
python scripts/doc-extract-filter.py --file_path "path/to/scanned.pdf" --action "extract" --enable-ocr

# 筛选并排除指定内容
python scripts/doc-extract-filter.py --file_path "path/to/file.pdf" --action "filter" --keywords "关键词" --exclude-keywords "排除词"

# 设置上下文长度和筛选级别
python scripts/doc-extract-filter.py --file_path "path/to/file.pdf" --action "filter" --keywords "关键词" --context-length 100 --filter-level "paragraph"

# 批量处理 - 文件夹路径
python scripts/doc-extract-filter.py --batch --input-dir "path/to/folder" --action "extract" --output-dir "batch-results"

# 批量处理 - 文件列表
python scripts/doc-extract-filter.py --batch --file-paths "path/to/file1.pdf,path/to/file2.docx" --action "extract" --output-dir "batch-results"

# 批量处理并合并结果
python scripts/doc-extract-filter.py --batch --input-dir "path/to/folder" --action "extract" --output-dir "batch-results" --merge-results

# 批量筛选
python scripts/doc-extract-filter.py --batch --input-dir "path/to/folder" --action "filter" --keywords "关键词" --output-dir "batch-results"

Python 函数调用

from scripts.doc_extract_filter import DocExtractFilter

# 提取文本
result = DocExtractFilter.process("path/to/file.pdf", "extract")

# 提取 PDF 扫描件(启用 OCR)
result = DocExtractFilter.process("path/to/scanned.pdf", "extract", enable_ocr=True)

# 筛选关键词
result = DocExtractFilter.process("path/to/file.pdf", "filter", ["关键词1", "关键词2"])

# 筛选并排除指定内容
result = DocExtractFilter.process("path/to/file.pdf", "filter", ["关键词"], exclude_keywords=["排除词"])

# 设置上下文长度和筛选级别
result = DocExtractFilter.process("path/to/file.pdf", "filter", ["关键词"], context_length=100, filter_level="paragraph")

# 使用正则表达式筛选
result = DocExtractFilter.process("path/to/file.pdf", "filter", regex_pattern="\d{4}-\d{2}-\d{2}")

# 批量处理 - 文件夹路径
result = DocExtractFilter.batch_process(
    input_dir="path/to/folder",
    action="extract",
    output_dir="batch-results"
)

# 批量处理 - 文件列表
result = DocExtractFilter.batch_process(
    file_paths=["path/to/file1.pdf", "path/to/file2.docx"],
    action="extract",
    output_dir="batch-results"
)

# 批量处理并合并结果
result = DocExtractFilter.batch_process(
    input_dir="path/to/folder",
    action="extract",
    output_dir="batch-results",
    merge_results=True
)

返回格式

{
  "success": true,
  "data": {
    "text": "提取的文本内容",
    "filtered_text": "筛选后的文本内容" // 仅 filter 操作返回
  },
  "error": ""
}

错误处理

  • 文件不存在:返回错误信息
  • 不支持的文件类型:返回错误信息
  • 操作失败:返回错误信息

安装与测试

安装

  1. doc-extract-filter 目录复制到 OpenClaw/CoPaw 的 skills 目录
  2. 运行 pip install -r requirements.txt 安装依赖

测试

使用 docs/test.pdf 文件测试功能:

# 测试提取文本
python scripts/doc-extract-filter.py --file_path "docs/test.pdf" --action "extract"

# 测试关键词筛选
python scripts/doc-extract-filter.py --file_path "docs/test.pdf" --action "filter" --keywords "单价,小计,总金额"

# 测试排除筛选
python scripts/doc-extract-filter.py --file_path "docs/test.pdf" --action "filter" --keywords "单价" --exclude-keywords "小计"

独立运行

doc-extract-filter 现在包含了所有必要的核心代码,可以独立运行,不依赖于外部的 src 目录。

Usage Guidance
This skill is internally consistent and appears to do what it claims, but it runs code on your agent and will read/write any file paths you pass to it. Before installing or invoking: (1) ensure the Python environment has required packages (requirements.txt) and tesseract if you need OCR; (2) avoid pointing it at sensitive system or credential directories—it will traverse directories you give it in batch mode; (3) run it on non-sensitive sample files first to confirm behavior and output formats; (4) if you need stronger isolation, run the skill in a sandboxed environment or container. If you need, I can list the exact functions that read/write files and where outputs are saved.
Capability Analysis
Type: OpenClaw Skill Name: doc-extract-filter Version: 1.1.1 The doc-extract-filter skill bundle is a legitimate document processing utility designed for text extraction and keyword/regex filtering across various formats (PDF, Word, Excel, CSV, Markdown, and WPS). The implementation in scripts/doc-extract-filter.py and the core/ directory uses standard, well-known libraries like PyPDF2, python-docx, and openpyxl. There is no evidence of data exfiltration, malicious command execution, or prompt injection attempts; all file system operations are localized to the user-specified paths for input and output.
Capability Assessment
Purpose & Capability
Name/description (extract/filter text from documents) matches the included code: extractor, filter, converter, and utils implement extraction, keyword/regex filtering, batch processing and result export. Declared CLI/API parameters align with implementation.
Instruction Scope
SKILL.md and entry script instruct the agent to read specified files or directories, extract text, filter matches, and optionally write JSON/text outputs. The instructions do not request unrelated data, secrets, or remote endpoints. Note: batch mode traverses directories and will process any supported files accessible to the running agent—this is expected but relevant for sensitive directories.
Install Mechanism
This is instruction/code-based (no install spec). A requirements.txt is provided but there is no automated installer; the runtime environment must have the listed Python packages. OCR functionality additionally requires system tesseract and pdf2image/Pillow; missing optional dependencies are handled in code (falls back to non-OCR extraction).
Credentials
The skill does not request environment variables, credentials, or config paths. All I/O is local-file-based as described. There are no requests for unrelated service keys or tokens.
Persistence & Privilege
Skill is not marked always:true and does not modify other skills or system-wide agent settings. It performs file reads/writes within the paths provided by the caller, which is appropriate for its purpose.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install doc-extract-filter
  3. After installation, invoke the skill by name or use /doc-extract-filter
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.1.1
Version 1.1.1 introduces major compatibility and feature enhancements: - Added support for CSV, Markdown (.md), and WPS (.wps/.et) file extraction. - Improved extraction for merged Excel cells, scanned PDFs (OCR support), and Word documents with images. - Introduced --enable-ocr parameter for lightweight OCR on scanned PDFs. - Added automatic file type detection and new exclusion filter options (--exclude-keywords, --exclude-regex). - Supports context extraction with --context-length and filtering by line or paragraph (--filter-level). - Enhanced batch processing with detailed results and merged output options. - Updated dependencies, including optional support for tesseract and python-markdown.
v1.0.3
- Added support for regex-based filtering of extracted text. - Updated configuration to include a "regex" parameter for filter operations. - Updated documentation and usage examples to reflect new regex filtering capability.
v1.0.2
- 升级版本号至 1.0.2 - 移除了未使用的依赖 - 优化了项目结构 - 更新了 OpenClaw 和 CoPaw 配置中的版本号描述 - 新增“更新说明”模块,明确变更内容
v1.0.0
Initial release of doc-extract-filter: - Supports text extraction and keyword filtering for PDF, Word, and Excel files. - Provides both CLI and Python API usage. - Returns extraction results and error details in a structured JSON format. - Includes OpenClaw and CoPaw configuration for integration. - Can be installed directly and run independently without external dependencies.
Metadata
Slug doc-extract-filter
Version 1.1.1
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 4
Frequently Asked Questions

What is doc-extract-filter?

支持 PDF、Word、Excel 文件的文本提取和按关键词筛选,返回完整或筛选后的文本内容。 It is an AI Agent Skill for Claude Code / OpenClaw, with 181 downloads so far.

How do I install doc-extract-filter?

Run "/install doc-extract-filter" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is doc-extract-filter free?

Yes, doc-extract-filter is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does doc-extract-filter support?

doc-extract-filter is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created doc-extract-filter?

It is built and maintained by bigclawd (@bigclawd); the current version is v1.1.1.

💬 Comments