← 返回 Skills 市场
bigclawd

doc-extract-filter

作者 bigclawd · GitHub ↗ · v1.1.1 · MIT-0
cross-platform ✓ 安全检测通过
181
总下载
1
收藏
0
当前安装
4
版本数
在 OpenClaw 中安装
/install doc-extract-filter
功能描述
支持 PDF、Word、Excel 文件的文本提取和按关键词筛选,返回完整或筛选后的文本内容。
使用说明 (SKILL.md)

doc-extract-filter

元数据

基本信息

  • name: doc-extract-filter
  • description: 文件处理技能,支持多种文件格式的文本提取、关键词/正则表达式筛选、排除筛选和批量文件处理
  • version: 1.1.1
  • author: file-agent team
  • license: MIT-0

OpenClaw 配置

{
  "name": "doc-extract-filter",
  "description": "文件处理技能,支持多种文件格式的文本提取、关键词/正则表达式筛选、排除筛选和批量文件处理",
  "version": "1.1.1",
  "author": "file-agent team",
  "license": "MIT-0",
  "type": "tool",
  "entry_point": "scripts/doc-extract-filter.py",
  "parameters": {
    "file_path": {
      "type": "string",
      "description": "文件路径",
      "required": false
    },
    "action": {
      "type": "string",
      "description": "操作类型:extract 或 filter",
      "required": true
    },
    "keywords": {
      "type": "array",
      "description": "关键词列表(仅 filter 操作需要)",
      "required": false
    },
    "regex": {
      "type": "string",
      "description": "正则表达式模式(仅 filter 操作需要)",
      "required": false
    },
    "enable_ocr": {
      "type": "boolean",
      "description": "启用 OCR 支持(用于扫描件 PDF)",
      "required": false
    },
    "exclude_keywords": {
      "type": "array",
      "description": "排除关键词列表(仅 filter 操作需要)",
      "required": false
    },
    "exclude_regex": {
      "type": "string",
      "description": "排除正则表达式模式(仅 filter 操作需要)",
      "required": false
    },
    "context_length": {
      "type": "integer",
      "description": "上下文长度(默认50字符)",
      "required": false
    },
    "filter_level": {
      "type": "string",
      "description": "筛选级别:line(按行)或 paragraph(按段落)",
      "required": false
    },
    "batch": {
      "type": "boolean",
      "description": "开启批量处理模式",
      "required": false
    },
    "input_dir": {
      "type": "string",
      "description": "批量处理的输入文件夹路径",
      "required": false
    },
    "file_paths": {
      "type": "array",
      "description": "批量处理的文件列表",
      "required": false
    },
    "output_dir": {
      "type": "string",
      "description": "批量结果输出目录",
      "required": false
    },
    "merge_results": {
      "type": "boolean",
      "description": "是否合并所有文件结果为一个 JSON 文件",
      "required": false
    }
  }
}

CoPaw 配置

name: doc-extract-filter
description: 文件处理技能,支持多种文件格式的文本提取、关键词/正则表达式筛选、排除筛选和批量文件处理
version: 1.1.1
author: file-agent team
license: MIT-0
type: tool
entry_point: scripts/doc-extract-filter.py
parameters:
  file_path:
    type: string
    description: 文件路径
    required: false
  action:
    type: string
    description: 操作类型:extract 或 filter
    required: true
  keywords:
    type: array
    description: 关键词列表(仅 filter 操作需要)
    required: false
  regex:
    type: string
    description: 正则表达式模式(仅 filter 操作需要)
    required: false
  enable_ocr:
    type: boolean
    description: 启用 OCR 支持(用于扫描件 PDF)
    required: false
  exclude_keywords:
    type: array
    description: 排除关键词列表(仅 filter 操作需要)
    required: false
  exclude_regex:
    type: string
    description: 排除正则表达式模式(仅 filter 操作需要)
    required: false
  context_length:
    type: integer
    description: 上下文长度(默认50字符)
    required: false
  filter_level:
    type: string
    description: 筛选级别:line(按行)或 paragraph(按段落)
    required: false
  batch:
    type: boolean
    description: 开启批量处理模式
    required: false
  input_dir:
    type: string
    description: 批量处理的输入文件夹路径
    required: false
  file_paths:
    type: array
    description: 批量处理的文件列表
    required: false
  output_dir:
    type: string
    description: 批量结果输出目录
    required: false
  merge_results:
    type: boolean
    description: 是否合并所有文件结果为一个 JSON 文件
    required: false

更新说明

  • 版本 1.1.1: 新增格式扩展+兼容性优化和筛选功能增强
    • 新增支持 CSV、Markdown(.md)、WPS(.wps/.et)文件提取
    • 修复 Excel 合并单元格、PDF 扫描件、Word 图文混排的提取问题
    • 新增 --enable-ocr 参数(可选),支持扫描件 PDF 轻量 OCR 提取
    • 新增智能格式检测逻辑,自动识别文件类型,无需用户指定
    • 新增 --exclude-keywords/--exclude-regex 参数,支持排除指定内容
    • 新增 --context-length N 参数,返回筛选结果的上下文(默认 50 字符)
    • 新增 --filter-level(line/paragraph)参数,支持按行/段落筛选
    • 批量处理模式下,筛选增强逻辑自动适配,结果按文件维度保留筛选细节
    • 依赖新增 tesseract(可选)、python-markdown,写入 requirements.txt 并标注可选
  • 版本 1.1.0: 添加了批量文件处理功能
  • 版本 1.0.3: 添加了正则表达式筛选功能
  • 版本 1.0.2: 移除了未使用的依赖,优化了项目结构

使用说明

功能

  • extract: 提取文件中的文本内容,支持多种文件格式
  • filter: 提取文件中的文本并筛选包含指定关键词或匹配正则表达式的内容,支持排除筛选
  • batch: 批量处理多个文件,支持文件夹遍历和多文件列表

调用方式

CLI 调用

# 单个文件处理
python scripts/doc-extract-filter.py --file_path "path/to/file.pdf" --action "extract"
python scripts/doc-extract-filter.py --file_path "path/to/file.pdf" --action "filter" --keywords "关键词1,关键词2"
python scripts/doc-extract-filter.py --file_path "path/to/file.pdf" --action "filter" --regex "\d{4}-\d{2}-\d{2}"

# 提取 PDF 扫描件(启用 OCR)
python scripts/doc-extract-filter.py --file_path "path/to/scanned.pdf" --action "extract" --enable-ocr

# 筛选并排除指定内容
python scripts/doc-extract-filter.py --file_path "path/to/file.pdf" --action "filter" --keywords "关键词" --exclude-keywords "排除词"

# 设置上下文长度和筛选级别
python scripts/doc-extract-filter.py --file_path "path/to/file.pdf" --action "filter" --keywords "关键词" --context-length 100 --filter-level "paragraph"

# 批量处理 - 文件夹路径
python scripts/doc-extract-filter.py --batch --input-dir "path/to/folder" --action "extract" --output-dir "batch-results"

# 批量处理 - 文件列表
python scripts/doc-extract-filter.py --batch --file-paths "path/to/file1.pdf,path/to/file2.docx" --action "extract" --output-dir "batch-results"

# 批量处理并合并结果
python scripts/doc-extract-filter.py --batch --input-dir "path/to/folder" --action "extract" --output-dir "batch-results" --merge-results

# 批量筛选
python scripts/doc-extract-filter.py --batch --input-dir "path/to/folder" --action "filter" --keywords "关键词" --output-dir "batch-results"

Python 函数调用

from scripts.doc_extract_filter import DocExtractFilter

# 提取文本
result = DocExtractFilter.process("path/to/file.pdf", "extract")

# 提取 PDF 扫描件(启用 OCR)
result = DocExtractFilter.process("path/to/scanned.pdf", "extract", enable_ocr=True)

# 筛选关键词
result = DocExtractFilter.process("path/to/file.pdf", "filter", ["关键词1", "关键词2"])

# 筛选并排除指定内容
result = DocExtractFilter.process("path/to/file.pdf", "filter", ["关键词"], exclude_keywords=["排除词"])

# 设置上下文长度和筛选级别
result = DocExtractFilter.process("path/to/file.pdf", "filter", ["关键词"], context_length=100, filter_level="paragraph")

# 使用正则表达式筛选
result = DocExtractFilter.process("path/to/file.pdf", "filter", regex_pattern="\d{4}-\d{2}-\d{2}")

# 批量处理 - 文件夹路径
result = DocExtractFilter.batch_process(
    input_dir="path/to/folder",
    action="extract",
    output_dir="batch-results"
)

# 批量处理 - 文件列表
result = DocExtractFilter.batch_process(
    file_paths=["path/to/file1.pdf", "path/to/file2.docx"],
    action="extract",
    output_dir="batch-results"
)

# 批量处理并合并结果
result = DocExtractFilter.batch_process(
    input_dir="path/to/folder",
    action="extract",
    output_dir="batch-results",
    merge_results=True
)

返回格式

{
  "success": true,
  "data": {
    "text": "提取的文本内容",
    "filtered_text": "筛选后的文本内容" // 仅 filter 操作返回
  },
  "error": ""
}

错误处理

  • 文件不存在:返回错误信息
  • 不支持的文件类型:返回错误信息
  • 操作失败:返回错误信息

安装与测试

安装

  1. doc-extract-filter 目录复制到 OpenClaw/CoPaw 的 skills 目录
  2. 运行 pip install -r requirements.txt 安装依赖

测试

使用 docs/test.pdf 文件测试功能:

# 测试提取文本
python scripts/doc-extract-filter.py --file_path "docs/test.pdf" --action "extract"

# 测试关键词筛选
python scripts/doc-extract-filter.py --file_path "docs/test.pdf" --action "filter" --keywords "单价,小计,总金额"

# 测试排除筛选
python scripts/doc-extract-filter.py --file_path "docs/test.pdf" --action "filter" --keywords "单价" --exclude-keywords "小计"

独立运行

doc-extract-filter 现在包含了所有必要的核心代码,可以独立运行,不依赖于外部的 src 目录。

安全使用建议
This skill is internally consistent and appears to do what it claims, but it runs code on your agent and will read/write any file paths you pass to it. Before installing or invoking: (1) ensure the Python environment has required packages (requirements.txt) and tesseract if you need OCR; (2) avoid pointing it at sensitive system or credential directories—it will traverse directories you give it in batch mode; (3) run it on non-sensitive sample files first to confirm behavior and output formats; (4) if you need stronger isolation, run the skill in a sandboxed environment or container. If you need, I can list the exact functions that read/write files and where outputs are saved.
功能分析
Type: OpenClaw Skill Name: doc-extract-filter Version: 1.1.1 The doc-extract-filter skill bundle is a legitimate document processing utility designed for text extraction and keyword/regex filtering across various formats (PDF, Word, Excel, CSV, Markdown, and WPS). The implementation in scripts/doc-extract-filter.py and the core/ directory uses standard, well-known libraries like PyPDF2, python-docx, and openpyxl. There is no evidence of data exfiltration, malicious command execution, or prompt injection attempts; all file system operations are localized to the user-specified paths for input and output.
能力评估
Purpose & Capability
Name/description (extract/filter text from documents) matches the included code: extractor, filter, converter, and utils implement extraction, keyword/regex filtering, batch processing and result export. Declared CLI/API parameters align with implementation.
Instruction Scope
SKILL.md and entry script instruct the agent to read specified files or directories, extract text, filter matches, and optionally write JSON/text outputs. The instructions do not request unrelated data, secrets, or remote endpoints. Note: batch mode traverses directories and will process any supported files accessible to the running agent—this is expected but relevant for sensitive directories.
Install Mechanism
This is instruction/code-based (no install spec). A requirements.txt is provided but there is no automated installer; the runtime environment must have the listed Python packages. OCR functionality additionally requires system tesseract and pdf2image/Pillow; missing optional dependencies are handled in code (falls back to non-OCR extraction).
Credentials
The skill does not request environment variables, credentials, or config paths. All I/O is local-file-based as described. There are no requests for unrelated service keys or tokens.
Persistence & Privilege
Skill is not marked always:true and does not modify other skills or system-wide agent settings. It performs file reads/writes within the paths provided by the caller, which is appropriate for its purpose.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install doc-extract-filter
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /doc-extract-filter 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.1.1
Version 1.1.1 introduces major compatibility and feature enhancements: - Added support for CSV, Markdown (.md), and WPS (.wps/.et) file extraction. - Improved extraction for merged Excel cells, scanned PDFs (OCR support), and Word documents with images. - Introduced --enable-ocr parameter for lightweight OCR on scanned PDFs. - Added automatic file type detection and new exclusion filter options (--exclude-keywords, --exclude-regex). - Supports context extraction with --context-length and filtering by line or paragraph (--filter-level). - Enhanced batch processing with detailed results and merged output options. - Updated dependencies, including optional support for tesseract and python-markdown.
v1.0.3
- Added support for regex-based filtering of extracted text. - Updated configuration to include a "regex" parameter for filter operations. - Updated documentation and usage examples to reflect new regex filtering capability.
v1.0.2
- 升级版本号至 1.0.2 - 移除了未使用的依赖 - 优化了项目结构 - 更新了 OpenClaw 和 CoPaw 配置中的版本号描述 - 新增“更新说明”模块,明确变更内容
v1.0.0
Initial release of doc-extract-filter: - Supports text extraction and keyword filtering for PDF, Word, and Excel files. - Provides both CLI and Python API usage. - Returns extraction results and error details in a structured JSON format. - Includes OpenClaw and CoPaw configuration for integration. - Can be installed directly and run independently without external dependencies.
元数据
Slug doc-extract-filter
版本 1.1.1
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 4
常见问题

doc-extract-filter 是什么?

支持 PDF、Word、Excel 文件的文本提取和按关键词筛选,返回完整或筛选后的文本内容。 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 181 次。

如何安装 doc-extract-filter?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install doc-extract-filter」即可一键安装,无需额外配置。

doc-extract-filter 是免费的吗?

是的,doc-extract-filter 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

doc-extract-filter 支持哪些平台?

doc-extract-filter 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 doc-extract-filter?

由 bigclawd(@bigclawd)开发并维护,当前版本 v1.1.1。

💬 留言讨论