← Back to Skills Marketplace
heoboong

Hwp Extract Pipeline

by developheo · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
152
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install hwp-extract-pipeline
Description
HWP/HWPX/PDF extraction pipeline: attempt hwp-reader, then pyhwp, then OCR, with safe fallbacks. Use when agent needs reliable text extraction from Korean HW...
README (SKILL.md)

hwp-extract-pipeline

간단한 HWP/HWPX/PDF 추출 파이프라인 스킬입니다. 핵심 목표는 로컬에 저장된 공고문(한글 파일)을 안정적으로 텍스트로 변환해 JSON 형식으로 반환하는 것입니다.

간단 사용법

  • 실행 스크립트: scripts/extract_hwp.py
  • 입력: 로컬 파일 경로(예: /home/vorox/.openclaw/agents/nalda-mail-opt/data/\x3CPBLN_ID>/getImageFile.do)
  • 출력: JSON 출력(표준출력) 및 데이터 폴더에 \x3Cid>_extracted.json으로 저장

우선순위(폴백 방식)

  1. hwp-reader 호출 (외부 skill 호출 가능시)
  2. pyhwp(venv) 기반 추출
  3. 시스템 OCR (poppler + tesseract) — 시스템 설치 필요할 수 있음
  4. strings 기반 폴백

참고 문서

  • scripts/README.md (간단 사용 예시 및 통합 방법)
Usage Guidance
This skill appears to do exactly what it describes: extract text from local HWP/HWPX/PDF files and save JSON output. Before installing/use, consider: (1) it may execute a local 'hwp-reader' binary or a Python interpreter from ~/.openclaw/venv — ensure those binaries are trusted (an attacker controlling the working directory or venv could cause execution of malicious code); (2) it writes <id>_extracted.json into the current directory and creates a short-lived temp script when using pyhwp; (3) OCR is mentioned but not implemented in the script (system OCR tools are not invoked here). If you will run this on untrusted files or in multi-tenant environments, run it in an isolated container or sandbox and verify any helper binaries (hwp-reader, venv python) are from trusted sources.
Capability Analysis
Type: OpenClaw Skill Name: hwp-extract-pipeline Version: 1.0.0 The skill bundle provides a legitimate utility for extracting text from Korean HWP and HWPX documents using a multi-stage fallback pipeline (hwp-reader, pyhwp, and system strings). The core logic in `scripts/extract_hwp.py` uses safe subprocess execution and standard library functions for ZIP parsing and XML processing. While the script lacks sanitization for the `--id` parameter used in the output filename (a potential path traversal vulnerability), there is no evidence of malicious intent, data exfiltration, or unauthorized remote execution.
Capability Assessment
Purpose & Capability
Name/description match the included script: the code implements a pipeline (hwp-reader -> pyhwp -> HWPX parsing -> strings) to extract text from local HWP/HWPX/PDF files. No unrelated capabilities or extra credentials are requested.
Instruction Scope
SKILL.md and the script restrict operations to local files and produce JSON output. The script will execute local helper binaries (hwp-reader if present), may run the provided or detected Python venv to import pyhwp, reads zip/XML inside HWPX, and calls the system 'strings' binary as a fallback. It writes <id>_extracted.json to the current working directory and creates a short-lived temp extractor script when invoking pyhwp. These behaviors are expected for this purpose but are worth noting because the skill executes local binaries and writes files.
Install Mechanism
No install spec; this is an instruction + script bundle only. Nothing is downloaded or extracted from external URLs and no packages are installed by the skill itself.
Credentials
The skill declares no environment variables or credentials. Runtime behavior inspects ~/.openclaw/venv and the current working directory for helper binaries, which is reasonable for locating a venv or workspace-provided hwp-reader binary.
Persistence & Privilege
always is false and the skill does not request persistent system-wide changes or modify other skills. It writes output files to the working directory only (no system config changes).
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install hwp-extract-pipeline
  3. After installation, invoke the skill by name or use /hwp-extract-pipeline
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of hwp-extract-pipeline. - Provides robust extraction of text from HWP/HWPX/PDF (including scanned) files using a prioritized fallback pipeline. - Supports extraction via hwp-reader, pyhwp, OCR (poppler+tesseract), and strings as last resort. - Outputs extracted text in JSON format to stdout and as a file. - Accepts local file paths as input for automated processing. - Documentation and example usage available in scripts/README.md.
Metadata
Slug hwp-extract-pipeline
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Hwp Extract Pipeline?

HWP/HWPX/PDF extraction pipeline: attempt hwp-reader, then pyhwp, then OCR, with safe fallbacks. Use when agent needs reliable text extraction from Korean HW... It is an AI Agent Skill for Claude Code / OpenClaw, with 152 downloads so far.

How do I install Hwp Extract Pipeline?

Run "/install hwp-extract-pipeline" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Hwp Extract Pipeline free?

Yes, Hwp Extract Pipeline is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Hwp Extract Pipeline support?

Hwp Extract Pipeline is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Hwp Extract Pipeline?

It is built and maintained by developheo (@heoboong); the current version is v1.0.0.

💬 Comments