vision ocr

Name: vision ocr
Author: zhangxusong637

Description

用于识别图片和 PDF 文档，调用你已配置的 OCR 与多模态服务输出 Markdown 结果，并可按需发送到飞书。适合截图、扫描件、表格、票据和技术文档。

Usage Guidance

This package appears coherent for OCR/PDF tasks, but check these before enabling: - Do not run node index.js --update-config (or update-config.js) unless you intend to persist VISION_* tokens into the skill's config.json; stored tokens will reside on disk and should be protected by file-system permissions. - Keep VISION_RESOLVE_OPENCLAW_SESSION and VISION_ALLOW_REMOTE_INPUT disabled (default false) unless you explicitly need CLI-session recovery or remote URL downloads; enabling them grants the skill access to OPENCLAW_* env and ~/.openclaw/runtime.json or allows downloading remote attachments. - If you enable automatic Feishu sending, ensure the optional feishu-send-files integration is trustworthy and that you really want results sent to chat targets discovered from context; otherwise keep auto-send off and use --no-send-to-feishu. - Review the local config.json and remove any secrets from the repository before sharing. If you need higher assurance, audit the remaining parts of index.js (network calls to multimodal baseUrl) to confirm it only talks to configured endpoints and does not leak content elsewhere.

Capability Analysis

Type: OpenClaw Skill Name: vision-ocr Version: 1.1.2 The vision-ocr skill provides document and PDF OCR capabilities with Feishu integration, requiring high-risk operations such as executing external processes (index.js uses execFileSync to run pdf-helper.py and feishu-send-files), downloading remote files, and reading sensitive session data from ~/.openclaw/runtime.json. While the code includes security measures like SSRF protection (isPrivateIpAddress) and file size limits, the ability to fetch remote content and access session tokens represents a significant attack surface. No evidence of intentional malice was found, but the high-privilege nature of its operations and the handling of sensitive documents justify a suspicious classification.

Capability Assessment

ℹ Purpose & Capability

The skill's name/description (OCR + PDF → Markdown, optional Feishu send) matches the included code and config (index.js, pdf-helper.py, config.example.json). Skill.json correctly lists Node/Python and VISION_* envs needed for OCR/multimodal services. Minor inconsistency: the top-level Registry metadata reported no required binaries/envs, while the packaged skill declares node/python and VISION_* envs — likely a metadata omission but worth noting.

✓ Instruction Scope

SKILL.md and code limit actions to OCR, PDF → images conversion, optional multimodal model calls, and optional Feishu file sending. Reading of OpenClaw session info (OPENCLAW_* env or ~/.openclaw/runtime.json) and remote-attachment downloading are gated behind explicit CLI flags or env toggles (VISION_RESOLVE_OPENCLAW_SESSION, VISION_ALLOW_REMOTE_INPUT, --resolve-openclaw-session, --allow-remote-input). The skill inspects common message fields to locate local file paths or download URLs; this is appropriate for its stated purpose but does mean it can read any path provided in message context.

✓ Install Mechanism

No network download/install step embedded in the package; it's an instruction/code-only skill. Dependencies (PyMuPDF / Python / Node.js) are standard and expected for PDF → image conversion and runtime. No suspicious remote download URLs or extract steps found in packaging.

ℹ Credentials

Required environment variables (VISION_IMAGEOCR_*, VISION_MULTIMODAL_*, VISION_AUTO_SEND_TO_FEISHU, etc.) are proportional to OCR/multimodal and Feishu features. Caveat: update-config.js can write VISION_* values into a local config.json in the skill directory — this will persist tokens on disk. Reading OPENCLAW_* env and ~/.openclaw/runtime.json only occurs when explicit session-resolve flags are enabled; that behavior is documented but should be treated as sensitive since it can expose session identifiers.

✓ Persistence & Privilege

The skill is not force-included (always: false) and does not attempt to modify global OpenClaw config; update-config.js only writes a local config.json in the skill directory. It does spawn a Python helper (pdf-helper.py) via exec semantics, which is expected for PDF processing.

Version History

v1.1.2

1.1.2 安全增强版发布： - 附件下载增加 SSRF 安全过滤（禁止下载 localhost 和私有网络段文件）。 - 下载功能支持超时和最大文件大小限制。 - PDF helper 调用新增 maxBuffer 和 PYTHON_PATH 配置项，提升兼容性。 - 修订说明和文档同步更新。

v1.1.1

[1.1.1] - 2026-03-31 Changed 修订说明明确：大图自动预处理、base64 统一清理、固定 pdf-helper.py、远程附件和 OpenClaw 会话恢复改为显式开启。 Fixed 包含正确的 pdf-helper.py in release 包，避免运行时动态生成脚本导致的不稳定。 [1.1.0] - 2026-03-30 Added 初始功能包括图片/PDF OCR、Markdown 输出、可选飞书发送、多模态整合、PDF 逐页处理、远程输入权限控制等。 vision-ocr 1.0.0 初始版本发布 - 首次发布 vision-ocr 技能，用于图片和 PDF 文档识别。 - 支持调用用户已配置的 OCR 和多模态服务，输出 Markdown 格式结果。 - 支持多类文档场景，包括截图、扫描件、表格、票据、技术文档等。 - 飞书集成：可按需将识别结果发送到飞书会话（默认关闭）。 - 新增 pdf-helper.py，支持 PDF 文档辅助处理。

v1.1.0

vision-ocr 1.0.0 - Initial release of the vision-ocr skill for OpenClaw. - Supports image and PDF recognition, outputs structured Markdown results. - Integrates with OCR and multimodal services as configured. - Feishu (Lark) sending is supported optionally; results stay local if not enabled. - Designed for screenshots, scans, tables, receipts, technical docs, and handwritten notes. - Provides clear workflow examples and configuration guidance for both CLI and Feishu integration scenarios.

v1.0.9

vision-ocr v1.0.09– 首个版本发布 - 支持识别图片和 PDF 文档，自动输出结构化 Markdown（支持复杂结构与 HTML 表格混排）。 - 集成飞书机器人，支持自动发送 OCR 结果至当前会话（需显式开启）。 - 优化多场景识别流程：文档、扫描件、票据、技术文档与手写内容均可自动适配最佳策略。 - PDF 支持逐页 OCR 和基础信息查看。 - 明确主会话/子会话调用约束，完善集成与使用说明。 - 配置灵活，支持环境变量和本地配置，兼容多 OCR/多模态 API 服务。

v1.0.8

vision-ocr 1.0.8 - 新增 config.json 支持，允许直接使用本地配置文件。 - 恢复 OpenClaw 会话自动检测与飞书当前会话的自动发送能力。 - 移除 run-as-bot.js、CHANGELOG.md，精简冗余脚本和日志位置。 - 支持 OCR-only、图片多类别自动路由、本地和远程附件自动下载识别、代码截图自动优化。 - 增加机器人链路诊断摘要输出，便于排查集成问题。 - 优化文档，明确自动发送前提和配置说明。

v1.0.7

新增通用占位符/无效内容检测，自动切换到图片描述模式；新增--skip-ocr 参数；优化占位符检测逻辑

v1.0.6

统一所有文件版本号为 1.0.6

v1.0.5

修复 1.0.4 版本内容错误，统一所有文件版本号为 1.0.4

v1.0.4

修复 1.0.3 版本内容错误，统一所有文件版本号为 1.0.3

v1.0.3

统一版本号至 1.0.3，修复所有文件版本不一致问题

v1.0.2

Fix: Remove real open_id from code examples in SKILL.md

v1.0.1

Update with fixed file permissions

v1.0.0

首次发布：支持图片/PDF OCR、多模态整合、手写识别优化、飞书自动发送

v2.5.0

vision-ocr v2.5.0 - OCR 和大模型实现一次性整合，流程更简单，性能提升 - 取消多级整合，显著减少 API 调用 - 优化缓存策略：Base64 编码与识别结果均按文件 hash 缓存，重复文件识别秒级返回 - PDF 文件支持流式处理，识别后立即清理临时文件 - 支持自动类型判断、用户确认、识别质量验证与自动重试 - 识别结果可自动发送到飞书；新增性能统计与丰富命令行参数

Metadata

Slug vision-ocr

Version 1.1.2

License MIT-0

All-time Installs 2

Active Installs 1

Total Versions 14

Frequently Asked Questions

What is vision ocr?

用于识别图片和 PDF 文档，调用你已配置的 OCR 与多模态服务输出 Markdown 结果，并可按需发送到飞书。适合截图、扫描件、表格、票据和技术文档。 It is an AI Agent Skill for Claude Code / OpenClaw, with 465 downloads so far.

How do I install vision ocr?

Run "/install vision-ocr" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is vision ocr free?

Yes, vision ocr is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does vision ocr support?

vision ocr is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created vision ocr?

It is built and maintained by zhangxusong637 (@zhangxusong637); the current version is v1.1.2.

More Skills

What is vision ocr?

How do I install vision ocr?

Is vision ocr free?

Which platforms does vision ocr support?

Who created vision ocr?

💬 Comments