← 返回 Skills 市场
Hwp Extract Pipeline
作者
developheo
· GitHub ↗
· v1.0.0
· MIT-0
152
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install hwp-extract-pipeline
功能描述
HWP/HWPX/PDF extraction pipeline: attempt hwp-reader, then pyhwp, then OCR, with safe fallbacks. Use when agent needs reliable text extraction from Korean HW...
使用说明 (SKILL.md)
hwp-extract-pipeline
간단한 HWP/HWPX/PDF 추출 파이프라인 스킬입니다. 핵심 목표는 로컬에 저장된 공고문(한글 파일)을 안정적으로 텍스트로 변환해 JSON 형식으로 반환하는 것입니다.
간단 사용법
- 실행 스크립트: scripts/extract_hwp.py
- 입력: 로컬 파일 경로(예: /home/vorox/.openclaw/agents/nalda-mail-opt/data/\x3CPBLN_ID>/getImageFile.do)
- 출력: JSON 출력(표준출력) 및 데이터 폴더에 \x3Cid>_extracted.json으로 저장
우선순위(폴백 방식)
- hwp-reader 호출 (외부 skill 호출 가능시)
- pyhwp(venv) 기반 추출
- 시스템 OCR (poppler + tesseract) — 시스템 설치 필요할 수 있음
- strings 기반 폴백
참고 문서
- scripts/README.md (간단 사용 예시 및 통합 방법)
安全使用建议
This skill appears to do exactly what it describes: extract text from local HWP/HWPX/PDF files and save JSON output. Before installing/use, consider: (1) it may execute a local 'hwp-reader' binary or a Python interpreter from ~/.openclaw/venv — ensure those binaries are trusted (an attacker controlling the working directory or venv could cause execution of malicious code); (2) it writes <id>_extracted.json into the current directory and creates a short-lived temp script when using pyhwp; (3) OCR is mentioned but not implemented in the script (system OCR tools are not invoked here). If you will run this on untrusted files or in multi-tenant environments, run it in an isolated container or sandbox and verify any helper binaries (hwp-reader, venv python) are from trusted sources.
功能分析
Type: OpenClaw Skill
Name: hwp-extract-pipeline
Version: 1.0.0
The skill bundle provides a legitimate utility for extracting text from Korean HWP and HWPX documents using a multi-stage fallback pipeline (hwp-reader, pyhwp, and system strings). The core logic in `scripts/extract_hwp.py` uses safe subprocess execution and standard library functions for ZIP parsing and XML processing. While the script lacks sanitization for the `--id` parameter used in the output filename (a potential path traversal vulnerability), there is no evidence of malicious intent, data exfiltration, or unauthorized remote execution.
能力评估
Purpose & Capability
Name/description match the included script: the code implements a pipeline (hwp-reader -> pyhwp -> HWPX parsing -> strings) to extract text from local HWP/HWPX/PDF files. No unrelated capabilities or extra credentials are requested.
Instruction Scope
SKILL.md and the script restrict operations to local files and produce JSON output. The script will execute local helper binaries (hwp-reader if present), may run the provided or detected Python venv to import pyhwp, reads zip/XML inside HWPX, and calls the system 'strings' binary as a fallback. It writes <id>_extracted.json to the current working directory and creates a short-lived temp extractor script when invoking pyhwp. These behaviors are expected for this purpose but are worth noting because the skill executes local binaries and writes files.
Install Mechanism
No install spec; this is an instruction + script bundle only. Nothing is downloaded or extracted from external URLs and no packages are installed by the skill itself.
Credentials
The skill declares no environment variables or credentials. Runtime behavior inspects ~/.openclaw/venv and the current working directory for helper binaries, which is reasonable for locating a venv or workspace-provided hwp-reader binary.
Persistence & Privilege
always is false and the skill does not request persistent system-wide changes or modify other skills. It writes output files to the working directory only (no system config changes).
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install hwp-extract-pipeline - 安装完成后,直接呼叫该 Skill 的名称或使用
/hwp-extract-pipeline触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release of hwp-extract-pipeline.
- Provides robust extraction of text from HWP/HWPX/PDF (including scanned) files using a prioritized fallback pipeline.
- Supports extraction via hwp-reader, pyhwp, OCR (poppler+tesseract), and strings as last resort.
- Outputs extracted text in JSON format to stdout and as a file.
- Accepts local file paths as input for automated processing.
- Documentation and example usage available in scripts/README.md.
元数据
常见问题
Hwp Extract Pipeline 是什么?
HWP/HWPX/PDF extraction pipeline: attempt hwp-reader, then pyhwp, then OCR, with safe fallbacks. Use when agent needs reliable text extraction from Korean HW... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 152 次。
如何安装 Hwp Extract Pipeline?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install hwp-extract-pipeline」即可一键安装,无需额外配置。
Hwp Extract Pipeline 是免费的吗?
是的,Hwp Extract Pipeline 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Hwp Extract Pipeline 支持哪些平台?
Hwp Extract Pipeline 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Hwp Extract Pipeline?
由 developheo(@heoboong)开发并维护,当前版本 v1.0.0。
推荐 Skills