Description

Trigger when: (1) User wants to extract text, tables, formulas, or structured data from images/PDFs/scanned documents, (2) User mentions "OCR", "文字识别", "文档解析...

README (SKILL.md)

OpenClaw Skill: glmocr

Name: GLM-OCR-SDK
Author: jaredforreal

Parses documents (images, PDFs, scans) via the GLM-OCR SDK.

📌 On-demand: This skill requires only ZHIPU_API_KEY in the environment. No YAML config files or GPU needed.

⚡ Quick Start

# Install
pip install glmocr

# Set API key (once)
export ZHIPU_API_KEY=sk-xxx
# or add to .env file in working directory:
echo "ZHIPU_API_KEY=sk-xxx" >> .env

# One-liner
import glmocr
result = glmocr.parse("document.pdf")
print(result.markdown_result)
print(result.to_dict())

# CLI — pass API key directly (no env setup needed)
glmocr parse image.png --api-key sk-xxx

# Or load from a specific .env file
glmocr parse image.png --env-file /path/to/.env

# Or rely on env var / auto-discovered .env (set once, then omit)
glmocr parse image.png
glmocr parse ./scans/ --output ./output/ --stdout

Configuration Priority

Constructor kwargs  >  os.environ  >  .env file  >  config.yaml  >  built-in defaults

Agents override everything via constructor kwargs or env vars — no YAML editing needed.

Key Environment Variables

Variable	Description	Example
`ZHIPU_API_KEY`	API key (required for MaaS)	`sk-abc123`
`GLMOCR_MODEL`	Model name	`glm-ocr`
`GLMOCR_TIMEOUT`	Request timeout (seconds)	`600`
`GLMOCR_ENABLE_LAYOUT`	Layout detection on/off	`true`
`GLMOCR_LOG_LEVEL`	`DEBUG` / `INFO` / `WARNING` / `ERROR`	`INFO`

Python API

Convenience function (single call)

import glmocr

# Single file → PipelineResult
result = glmocr.parse("invoice.png")

# Multiple files → list[PipelineResult]
results = glmocr.parse(["page1.png", "page2.png", "report.pdf"])

Class-based (multiple calls / resource reuse)

from glmocr import GlmOcr

parser = GlmOcr(api_key="sk-xxx")   # mode auto-set to "maas"
parser = GlmOcr(mode="maas")        # reads ZHIPU_API_KEY from env

# Always use as context manager or call .close()
with GlmOcr(api_key="sk-xxx") as parser:
    result = parser.parse("document.png")
    print(result.markdown_result)

parser.close()   # if not using `with`

Constructor Parameters

Parameter	Type	Description
`api_key`	`str`	API key. Providing this auto-enables MaaS mode.
`api_url`	`str`	Override MaaS endpoint URL
`model`	`str`	Model name override
`timeout`	`int`	Request timeout in seconds (default: 600)
`enable_layout`	`bool`	Enable layout detection
`log_level`	`str`	Logging level

Working with `PipelineResult`

Fields

result.markdown_result    # str — full document as Markdown
result.json_result        # list[list[dict]] — structured regions per page
result.original_images    # list[str] — absolute paths of input images

`json_result` structure

List of pages → list of regions per page:

[
  [
    {
      "index": 0,
      "label": "title",
      "content": "Annual Report 2024",
      "bbox_2d": [100, 50, 900, 120]
    },
    {
      "index": 1,
      "label": "table",
      "content": "| Q1 | Q2 |\
|---|---|\
| 120 | 145 |",
      "bbox_2d": [100, 140, 900, 400]
    }
  ]
]

Bounding boxes (bbox_2d): [x1, y1, x2, y2] normalised to 0–1000 scale.

Region labels: title, text, table, figure, formula, header, footer, page_number, reference, seal

Serialization

# Dict (JSON-serializable, for passing to other tools)
d = result.to_dict()
# Keys: json_result, markdown_result, original_images, usage (MaaS), data_info (MaaS)

# JSON string
json_str = result.to_json()                 # pretty-printed, ensure_ascii=False
json_str = result.to_json(indent=None)      # compact single line

# Save to disk: writes \x3Cstem>/\x3Cstem>.json + \x3Cstem>/\x3Cstem>.md + layout_vis/
result.save(output_dir="./output")
result.save(output_dir="./output", save_layout_visualization=False)

Error Handling

The SDK does not raise on MaaS errors — check to_dict() for an "error" key:

result = parser.parse("image.png")
d = result.to_dict()
if "error" in d:
    # Handle failure
    print("OCR failed:", d["error"])
else:
    print(d["markdown_result"])

CLI Reference

Agent-preferred interface: use the CLI for most operations. Set ZHIPU_API_KEY in env once, then invoke as needed.

Supported input formats: .jpg, .jpeg, .png, .bmp, .gif, .webp, .pdf

Basic usage

# Parse a single file → saves to ./output/\x3Cstem>/
# MaaS mode is the default; ZHIPU_API_KEY must be set (or use --api-key)
glmocr parse image.png

# Pass API key directly without any env setup
glmocr parse image.png --api-key sk-xxx

# Parse a directory → saves each file to ./output/\x3Cstem>/
glmocr parse ./scans/

# Use self-hosted vLLM/SGLang instead of cloud
glmocr parse image.png --mode selfhosted

# Specify output directory
glmocr parse image.png --output ./results/

Read results in the terminal (agent-friendly)

# Print Markdown + JSON to stdout (and still save to disk)
glmocr parse image.png --stdout

# Print to stdout ONLY — do not write any files
glmocr parse image.png --stdout --no-save

# JSON only (no Markdown output)
glmocr parse image.png --stdout --json-only

# Pipe JSON into jq for structured extraction
glmocr parse image.png --stdout --json-only --no-save | jq '.[0] | map(select(.label=="table"))'

Save control

# Skip layout visualization images (faster, smaller output)
glmocr parse image.png --no-layout-vis

# Parse and save only JSON + Markdown, skip layout vis
glmocr parse image.png --no-layout-vis --output ./results/

Batch processing

# All images in a folder
glmocr parse ./invoice_scans/ --output ./parsed/ --no-layout-vis

# With progress visible in logs
glmocr parse ./docs/ --output ./parsed/ --log-level INFO

Debugging

glmocr parse image.png --log-level DEBUG

Full flag reference

Flag	Default	Description
`--api-key / -k`	env var	API key for MaaS mode (overrides `ZHIPU_API_KEY`)
`--mode`	`maas`	`maas` (cloud, default) or `selfhosted` (local GPU)
`--env-file`	auto	Path to `.env` file (default: auto-discover from cwd)
`--output / -o`	`./output`	Output directory
`--stdout`	off	Print JSON + Markdown to stdout
`--no-save`	off	Skip writing files (use with `--stdout`)
`--json-only`	off	stdout JSON only, no Markdown
`--no-layout-vis`	off	Skip layout visualization images
`--config / -c`	none	Path to YAML config override
`--log-level`	`INFO`	`DEBUG` / `INFO` / `WARNING` / `ERROR`

Typical Agent Workflow

receive document path / URL
       │
       ▼
glmocr.parse(path)            ← single call, handles PDF/image
       │
       ▼
result.to_dict()              ← safe to pass as tool output
       │
       ├── markdown_result    → hand to LLM for reading / summarization
       └── json_result        → structured extraction (tables, formulas, regions by label)

Filter by label

result = glmocr.parse("report.png")
regions = result.json_result[0]  # first page

tables = [r for r in regions if r["label"] == "table"]
formulas = [r for r in regions if r["label"] == "formula"]
body_text = [r for r in regions if r["label"] == "text"]

Multi-page PDF → iterate pages

with GlmOcr(api_key="sk-xxx") as parser:
    result = parser.parse("document.pdf")   # all pages in one PipelineResult
    for page_idx, page_regions in enumerate(result.json_result):
        print(f"Page {page_idx + 1}: {len(page_regions)} regions")
        for region in page_regions:
            print(f"  [{region['label']}] {region['content'][:60]}")

Programmatic config (no env vars)

from glmocr.config import GlmOcrConfig

cfg = GlmOcrConfig.from_env(
    api_key="sk-xxx",
    mode="maas",
    timeout=600,
    log_level="DEBUG",
)

Output Directory Layout

After result.save(output_dir):

output_dir/
  \x3Cimage_stem>/
    \x3Cimage_stem>.json         ← structured regions
    \x3Cimage_stem>.md           ← full Markdown (with cropped figure images)
    imgs/                     ← cropped figures referenced in Markdown
    layout_vis/               ← layout detection overlay images (if enabled)
      \x3Cimage_stem>.jpg

Common Pitfalls

ZHIPU_API_KEY not set: SDK defaults to MaaS mode. Without a key, parse() will fail with a clear error message and quick-fix instructions. Set via export ZHIPU_API_KEY=sk-xxx, add to a .env file, or pass --api-key sk-xxx to the CLI.
Large PDFs: Default timeout is 600s. For very long documents increase with timeout=1200.
result.json_result is a string: Happens when the model returns malformed JSON. The SDK preserves the raw string — parse or log it manually.

Usage Guidance

This skill appears to do what it says (OCR via Zhipu) and legitimately needs a ZHIPU_API_KEY. Before installing or using it, consider the following: 1) Do not pass API keys inline on the CLI (process lists/logs can leak them); prefer environment variables set in a controlled place. 2) Avoid pointing --env-file to system or user-wide files that contain unrelated secrets. 3) Verify the glmocr package source (PyPI package and GitHub repo or release tag) before pip installing; review its code or pinned release hash if you handle sensitive documents. 4) Confirm the default MaaS endpoint and avoid overriding api_url unless you trust the target—api_url override could be used to send your documents + API key to an arbitrary server. 5) Review Zhipu's data retention/privacy policy before uploading confidential documents. If you want higher assurance, provide the glmocr package code or the exact pip package version/sha so it can be audited; without that, exercise caution when processing sensitive data.

Capability Analysis

Type: OpenClaw Skill Name: glmocr-sdk Version: 1.0.4 The 'glmocr-sdk' skill bundle provides documentation and instructions for an AI agent to perform OCR tasks using the legitimate GLM-OCR SDK and Zhipu's cloud API. The SKILL.md file contains standard usage examples for both Python and CLI interfaces, requires a standard API key (ZHIPU_API_KEY), and lacks any indicators of malicious intent, unauthorized data access, or harmful prompt-injection instructions.

Capability Tags

requires-sensitive-credentials

Capability Assessment

✓ Purpose & Capability

Name/description, required primaryEnv (ZHIPU_API_KEY), and the SDK/CLI usage in SKILL.md are consistent: this is an OCR SDK that calls a MaaS API. Requiring the ZHIPU_API_KEY is proportionate to the stated purpose.

⚠ Instruction Scope

SKILL.md instructs the agent to pip install glmocr, set/export ZHIPU_API_KEY, invoke CLI with --api-key or --env-file, and allows constructor/CLI overrides (api_url, model, timeout). Two issues increase risk: (1) the api_url constructor parameter lets calls be redirected to an arbitrary endpoint (potential exfiltration) and (2) the CLI supports loading an arbitrary .env file path or passing the API key on the command line (which can expose secrets in process lists or logs). These behaviors expand the surface beyond a simple 'call Zhipu' flow and are not limited by the metadata.

ℹ Install Mechanism

The skill package is instruction-only (no install spec), but SKILL.md tells users/agents to run `pip install glmocr`. That means an external PyPI package will be installed at runtime; the package code is not included here and was not scanned. Installing uninspected packages is a moderate risk—acceptable for many cases but worth auditing the package/release before use.

ℹ Credentials

Only ZHIPU_API_KEY is declared as required, which matches the service. SKILL.md also documents optional GLMOCR_* env vars (timeouts, logging) that weren't explicitly declared—this is minor. However the CLI/constructor options (api_key inline, --env-file, api_url override) allow the agent to read arbitrary env files or submit data to non-standard endpoints; these capabilities increase the potential for accidental or malicious credential/data exposure.

✓ Persistence & Privilege

always:false; instruction-only skill with no install-time persistence or system-wide config modification requested. The skill does not request elevated persistent privileges.

Version History

v1.0.4

glmocr-sdk 1.0.4 - Updated environment requirements: the `python` binary is no longer explicitly required in the metadata. - No functional or interface changes; documentation and usage remain unchanged.

v1.0.3

- Skill name changed from "glmocr" to "glmocr-sdk". - No code or functional changes detected in this version. - Documentation and usage instructions remain the same.

v1.0.2

- Switched required environment variable from GLMOCR_API_KEY to ZHIPU_API_KEY. - Updated all setup instructions, usage examples, and documentation to reflect the new ZHIPU_API_KEY naming. - CLI, Python API, and config table references now use ZHIPU_API_KEY as the primary way to provide the API key. - No functional or code logic changes detected; environment variable naming is the only update.

v1.0.1

Version 1.0.1 - Added OpenClaw metadata block to the SKILL.md, including required environment variable (GLMOCR_API_KEY), binary dependency (python), primary environment variable designation, emoji, and homepage link. - No code changes or functional updates. Documentation only.

v1.0.0

glm-ocr-sdk v1.0.0 - Initial release of the glmocr skill for document OCR and parsing via the GLM-OCR SDK. - Supports extracting structured text, tables, formulas, and regions from images, PDFs, and scanned documents. - Returns results as structured JSON (with labeled regions and bounding boxes) and Markdown. - Offers simple Python and CLI interfaces; only requires setting GLMOCR_API_KEY (no YAML or GPU needed). - Not intended for real-time camera feeds, audio transcription, or processing non-document images.

Metadata

Slug glmocr-sdk

Version 1.0.4

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 5

Frequently Asked Questions

What is GLM-OCR-SDK?

Trigger when: (1) User wants to extract text, tables, formulas, or structured data from images/PDFs/scanned documents, (2) User mentions "OCR", "文字识别", "文档解析... It is an AI Agent Skill for Claude Code / OpenClaw, with 435 downloads so far.

How do I install GLM-OCR-SDK?

Run "/install glmocr-sdk" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is GLM-OCR-SDK free?

Yes, GLM-OCR-SDK is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does GLM-OCR-SDK support?

GLM-OCR-SDK is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created GLM-OCR-SDK?

It is built and maintained by Jared Wen (@jaredforreal); the current version is v1.0.4.

More Skills

GLM-OCR-SDK