pdf-ocr-layout

Name: pdf-ocr-layout
Author: baokui

Description

基于智谱 GLM-OCR、GLM-4.7 及 GLM-4.6V 的多模态文档深度解析工具。 Use when: - 需要高精度提取文档（PDF/图片）中的表格并转换为 Markdown 格式 - 需要从文档页面中自动裁剪并提取插图、图表为独立文件 - 需要对提取的图表进行深度语义理解（基于 GLM-4.6V 视觉分析） - 需要对提取的表格数据进行逻辑分析（基于 GLM-4.7 文本分析）核心架构： 1. 视觉提取：GLM-OCR 2. 语义理解：GLM-4.7 (纯文本/表格) + GLM-4.6V (多模态/图像)

Usage Guidance

This package appears to implement the advertised OCR + GLM analysis pipeline, but before installing you should: - Verify the source: there is no homepage or repository listed. Prefer code from a known source if you will send sensitive documents. - Expect document data to be transmitted to Zhipu's API (the scripts Base64-encode images and send full page Markdown/context). Do NOT run on private/sensitive documents unless you're comfortable with that external transmission and the API provider's data retention policy. - Fix/confirm dependencies: SKILL.md lists 'zhipuai' but the code imports 'zai' (from zai import ZhipuAiClient). Confirm the correct client package and install it in a controlled environment (virtualenv/container). - Registry metadata mismatch: the manifest claims no required env vars, but the scripts require ZHIPU_API_KEY. Treat ZHIPU_API_KEY as mandatory and do not place sensitive credentials in shared environments. - If you need higher assurance, ask the publisher for: 1) source repository or release page 2) exact Python package name for the Zhipu client and installation instructions 3) confirmation of what data is sent to the API and the provider's retention/privacy terms Given these inconsistencies and the fact that your documents will be sent to an external API, proceed only after clarifying the above or run the skill in an isolated environment with non-sensitive test files.

Capability Analysis

Type: OpenClaw Skill Name: pdf-ocr-layout Version: 1.0.2 The skill is classified as suspicious due to potential prompt injection vulnerabilities against the backend LLMs (GLM-4.7, GLM-4.6V) and potential path traversal vulnerabilities. The `script/glm_understanding.py` directly embeds content (`full_markdown_context`, `detected_title`) derived from the input document into the LLM prompts without sanitization, which could allow a malicious input document to inject instructions to the backend models. Additionally, the scripts perform file system operations using `file_path` and `output_dir` (e.g., in `script/glm_ocr_extract.py`), which, while necessary for functionality, could be exploited for path traversal if the OpenClaw agent is tricked into providing malicious paths. There is no evidence of intentional malicious behavior such as data exfiltration or backdoor installation; the identified issues are vulnerabilities rather than deliberate malice.

Capability Assessment

ℹ Purpose & Capability

The code and SKILL.md implement a PDF/image layout extraction step plus LLM/VLM analysis against Zhipu models (GLM-OCR, GLM-4.7, GLM-4.6V) — this matches the skill's description. However the registry metadata earlier lists no required environment variables or primary credential, while the SKILL.md and code both require a ZHIPU_API_KEY (inconsistency).

⚠ Instruction Scope

The runtime instructions and included scripts load arbitrary input files, encode images/base64 and send the file contents and the page's full Markdown context to the Zhipu API for analysis. That behavior is coherent for this tool but means the user's document contents will be transmitted to an external service; the instructions do not document any privacy/retention or opt-out. Also the SKILL.md instructs users to set ZHIPU_API_KEY but the registry metadata did not declare it.

ℹ Install Mechanism

There is no install spec (instruction-only), so nothing is auto-downloaded — lower install risk. However the code depends on Python packages and a client library: SKILL.md lists 'zhipuai' as a dependency but the code imports 'zai' (from zai import ZhipuAiClient), a mismatch that will break runtime unless clarified. Required Python libs (pillow, beautifulsoup4) are reasonable for OCR/cropping, but the missing/ambiguous client package is a concern.

⚠ Credentials

At runtime the scripts require a single credential env var ZHIPU_API_KEY to call the remote API — that is proportionate to the stated function. The problem is the registry metadata lists 'Required env vars: none' and 'Primary credential: none', which is inconsistent and could mislead users about what secrets are needed. No other unrelated credentials are requested.

✓ Persistence & Privilege

The skill does not request permanent/always-on privileges, does not modify other skills, and uses normal file I/O within the provided output directory. There is no 'always: true' or other excessive privilege requested.

Version History

v1.0.2

- Added a Chinese documentation file `SKILL_zh.md` for the skill. - No changes to core code or functionality.

v1.0.1

- Skill 名称由 "pdf-ocr-layout-understanding" 更新为 "pdf-ocr-layout"。 - 文档结构优化，脚本阶段内容细分为更清晰的提取阶段和理解阶段。 - 说明内容由列表格式调整为分级标题，便于阅读和理解。 - 各阶段描述从段落式改为条目式，更突出核心流程和要点。 - 统一格式，增强参数表格和返回数据说明的可读性，无核心功能变更。

v1.0.0

- Initial release of pdf-ocr-layout-understanding: multimodal document parsing tool. - High-precision extraction of tables (to Markdown) and automatic cropping of figures/charts from PDFs and images. - Deep semantic analysis: uses GLM-4.7 for logical interpretation of tables and GLM-4.6V for visual understanding of figures. - Returns a structured JSON report with bounding boxes, extracted content, and in-depth semantic insights. - Supports CLI pipeline for PDF/image file input and output to designated directory. - Requires ZHIPU_API_KEY, Python 3.8+, and dependencies: zhipuai, pillow, beautifulsoup4.

Metadata

Slug pdf-ocr-layout

Version 1.0.2

License —

All-time Installs 8

Active Installs 6

Total Versions 3

Frequently Asked Questions

What is pdf-ocr-layout?

基于智谱 GLM-OCR、GLM-4.7 及 GLM-4.6V 的多模态文档深度解析工具。 Use when: - 需要高精度提取文档（PDF/图片）中的表格并转换为 Markdown 格式 - 需要从文档页面中自动裁剪并提取插图、图表为独立文件 - 需要对提取的图表进行深度语义理解（基于 GLM-4.6V 视觉分析） - 需要对提取的表格数据进行逻辑分析（基于 GLM-4.7 文本分析）核心架构： 1. 视觉提取：GLM-OCR 2. 语义理解：GLM-4.7 (纯文本/表格) + GLM-4.6V (多模态/图像). It is an AI Agent Skill for Claude Code / OpenClaw, with 1470 downloads so far.

How do I install pdf-ocr-layout?

Run "/install pdf-ocr-layout" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is pdf-ocr-layout free?

Yes, pdf-ocr-layout is completely free (open-source). You can download, install and use it at no cost.

Which platforms does pdf-ocr-layout support?

pdf-ocr-layout is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created pdf-ocr-layout?

It is built and maintained by baokui (@baokui); the current version is v1.0.2.

More Skills

What is pdf-ocr-layout?

How do I install pdf-ocr-layout?

Is pdf-ocr-layout free?

Which platforms does pdf-ocr-layout support?

Who created pdf-ocr-layout?

💬 Comments