← 返回 Skills 市场
kaiasdobi

docx-pdf-knowledge-parser

作者 kaiasdobi · GitHub ↗ · v1.0.1 · MIT-0
cross-platform ✓ 安全检测通过
153
总下载
0
收藏
0
当前安装
2
版本数
在 OpenClaw 中安装
/install docx-pdf-knowledge-parser
功能描述
Parse local `.docx` and `.pdf` files into structured knowledge artifacts with detailed reports, tracking successes, failures, and summaries without auto-writ...
使用说明 (SKILL.md)

name: docx-pdf-knowledge-parser description: parse local docx and pdf files into report-first knowledge artifacts. use when chatgpt needs to extract text from uploaded or locally available attachments, generate ingest-report.md, kb-items.jsonl, failed-items.jsonl, and memory.candidate.md without directly writing memory.md.

Docx PDF Knowledge Parser

Use this skill to turn local or uploaded .docx and .pdf files into structured, reviewable knowledge outputs.

What this skill does

  • Accept local or already-available .docx and .pdf files.
  • Classify files into parseable, manual-review, or failed.
  • Parse .docx and .pdf in v1.0.
  • Produce report-first outputs instead of writing MEMORY.md directly.
  • Preserve failures and uncertainty instead of guessing content.

Supported v1.0 scope

Inputs

  • Local .docx file path
  • Local .pdf file path
  • A batch of local .docx and .pdf files in one directory

Parsing

  • .docx
  • .pdf

Outputs

  • ingest-report.md
  • kb-items.jsonl
  • failed-items.jsonl
  • MEMORY.candidate.md

Required behavior

  1. Only process files that are already available locally or have already been provided to the runtime.
  2. Do not claim file content was learned unless text was actually extracted.
  3. Default to report-first. Do not write MEMORY.md in v1.0.
  4. Record every failed file with a concrete reason.
  5. Prefer plain-text summaries over complex cards when reporting progress.

File routing rules

Parseable

Treat these as parseable in v1.0:

  • .docx
  • .pdf

Manual-review

Route here when the file is out of scope or low-confidence in v1.0:

  • .pptx
  • images
  • scans with no extractable text
  • archives
  • unusual file types

Failed

Route here when the file cannot be opened, parsed, or extracted successfully.

Standard workflow

  1. Resolve input type.
    • Single file path -> process one file
    • Directory path -> enumerate supported files
  2. Create a batch record.
    • Generate batch_id
    • Record started_at
  3. Build a manifest.
    • File name
    • File path
    • File type
    • Route decision
  4. Attempt extraction.
    • .docx -> use parsers/parse_docx.py
    • .pdf -> use parsers/parse_pdf.py
  5. Produce structured outputs.
    • success -> append to kb-items.jsonl
    • failure -> append to failed-items.jsonl
  6. Summarize the batch.
    • Write ingest-report.md
    • Write MEMORY.candidate.md
  7. Finish the batch.
    • Record finished_at
    • Never auto-write MEMORY.md

Output contracts

kb-items.jsonl

Write one JSON object per successfully extracted knowledge item with at least:

  • batch_id
  • source_file
  • source_path
  • file_type
  • topic
  • content_type
  • summary
  • extracted_at
  • confidence

failed-items.jsonl

Write one JSON object per failed file with at least:

  • batch_id
  • source_file
  • source_path
  • file_type
  • failure_reason
  • error_detail
  • suggested_action
  • failed_at

MEMORY.candidate.md

Include:

  • batch header (batch_id, started_at, finished_at, source_directory or source_file)
  • grouped knowledge summaries
  • source references
  • confidence notes
  • items needing review

ingest-report.md

Include:

  1. Batch summary
  2. Input scope
  3. File counts and routing counts
  4. Successful extraction summary
  5. Failures and risks
  6. Recommended next actions

Safety rules

  • Never invent text that was not extracted.
  • If parsing fails, say so plainly and log it.
  • Treat filenames as hints only, never as proof of document contents.
  • Keep sensitive data out of MEMORY.candidate.md unless the workflow explicitly allows it.

Included files

  • run.py: minimal batch runner for local testing
  • parsers/parse_docx.py: docx text extraction helper
  • parsers/parse_pdf.py: pdf text extraction helper
  • references/output_examples.md: sample output shapes and field guidance
  • README.md: setup and usage notes
安全使用建议
This skill appears to be what it says: a local batch parser for .docx and .pdf files. Before running it, (1) ensure the --input-dir contains only files you want parsed (it will read and extract text from each .docx/.pdf it finds); (2) be aware extracted text and summary files (including MEMORY.candidate.md) will be written in plaintext to --output-dir — avoid writing to a shared or sensitive location; (3) install the two Python dependencies (python-docx, pypdf) in a controlled environment; (4) the README/metadata mentions Feishu but no network connector or credentials are included — adding Feishu integration would require extra code/credentials; and (5) if you need OCR for image-based PDFs, this version will mark them as failed and recommend manual/OCR workflows. No network exfiltration or credential use was found.
功能分析
Type: OpenClaw Skill Name: docx-pdf-knowledge-parser Version: 1.0.1 The skill bundle is a legitimate utility designed to parse local .docx and .pdf files into structured knowledge artifacts and reports. The Python scripts (run.py, parsers/parse_docx.py, and parsers/parse_pdf.py) use standard libraries to extract text and generate local output files without any network activity, credential access, or suspicious execution patterns. The instructions in SKILL.md are well-defined and include safety-oriented constraints, such as requiring manual review before updating the agent's long-term memory.
能力评估
Purpose & Capability
The name/description (parse local .docx/.pdf into report-first outputs) matches the code and SKILL.md. The included parsers and run.py implement the declared behavior. Mentions of Feishu in README/agent metadata are informational for future connectors but do not imply hidden Feishu integration.
Instruction Scope
SKILL.md explicitly limits processing to local/already-available files and the code follows that. Be aware the tool will iterate all files in the provided input directory and will attempt to parse any .docx/.pdf it finds — so the operator must ensure the input directory contains only files intended for ingestion to avoid accidental parsing of sensitive documents.
Install Mechanism
There is no install spec; requirements.txt lists python-docx and pypdf which are appropriate for the task. No downloads from arbitrary URLs or extract operations are present.
Credentials
The skill requests no environment variables, no credentials, and no config paths. The code does not reference any secrets or external services; the lack of credentials is consistent with an offline/local parsing utility.
Persistence & Privilege
always is false and the skill does not attempt to modify other skills or global agent settings. It writes output files only to the user-specified output directory (kb-items.jsonl, failed-items.jsonl, ingest-report.md, MEMORY.candidate.md).
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install docx-pdf-knowledge-parser
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /docx-pdf-knowledge-parser 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.1
- Added metadata file (_meta.json) for the skill. - No changes to core functionality or documentation.
v1.0.0
Initial release of docx-pdf-knowledge-parser. - Parses local `.docx` and `.pdf` files into structured, report-first knowledge artifacts. - Supports batch extraction, generating `ingest-report.md`, `kb-items.jsonl`, `failed-items.jsonl`, and `MEMORY.candidate.md`. - Clearly routes files as parseable, for manual review, or failed, with detailed logging of failures. - Does not write to `MEMORY.md` directly—focuses on reviewable, auditable outputs. - Includes helper scripts and sample outputs for local testing and integration.
元数据
Slug docx-pdf-knowledge-parser
版本 1.0.1
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 2
常见问题

docx-pdf-knowledge-parser 是什么?

Parse local `.docx` and `.pdf` files into structured knowledge artifacts with detailed reports, tracking successes, failures, and summaries without auto-writ... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 153 次。

如何安装 docx-pdf-knowledge-parser?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install docx-pdf-knowledge-parser」即可一键安装,无需额外配置。

docx-pdf-knowledge-parser 是免费的吗?

是的,docx-pdf-knowledge-parser 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

docx-pdf-knowledge-parser 支持哪些平台?

docx-pdf-knowledge-parser 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 docx-pdf-knowledge-parser?

由 kaiasdobi(@kaiasdobi)开发并维护,当前版本 v1.0.1。

💬 留言讨论