← Back to Skills Marketplace
jaredforreal

GLM-V-Caption

by Jared Wen · GitHub ↗ · v1.0.3 · MIT-0
cross-platform ✓ Security Clean
539
Downloads
1
Stars
0
Active Installs
4
Versions
Install in OpenClaw
/install glmv-caption
Description
Generate captions (descriptions) for images, videos, and documents using ZhiPu GLM-V multimodal model series. Use this skill whenever the user wants to descr...
README (SKILL.md)

GLM-V Caption Skill

Generate captions for images, videos, and documents using the ZhiPu GLM-V multimodal model.

When to Use

  • Describe, caption, summarize, or interpret image/video/document content
  • User mentions "describe this image", "caption", "summarize this video", "图片描述", "视频摘要", "文档解读", "看图说话"
  • Extract visual or textual information from media files
  • Compare multiple images
  • User provides an image/video/file and asks what's in it

Supported Input Types

Type Formats Max Size Max Count Base64
Image jpg, png, jpeg 5MB / 6000×6000px 50
Video mp4, mkv, mov 200MB
File pdf, docx, txt, xlsx, pptx, jsonl 50

⚠️ file_url cannot mix with image_url or video_url in the same request. ⚠️ Videos and files only support URLs — local paths and base64 are NOT supported (images only).

Resource Links

Resource Link
Get API Key https://bigmodel.cn/usercenter/proj-mgmt/apikeys
API Docs Chat Completions / 对话补全

Prerequisites

API Key Setup / API Key 配置(Required / 必需)

This script reads the key from the ZHIPU_API_KEY environment variable and shares it with other Zhipu skills. 脚本通过 ZHIPU_API_KEY 环境变量获取密钥,与其他智谱技能共用同一个 key。

Get Key / 获取 Key: Visit Zhipu Open Platform API Keys / 智谱开放平台 API Keys to create or copy your key.

Setup options / 配置方式(任选一种):

  1. OpenClaw config (recommended) / OpenClaw 配置(推荐): Set in openclaw.json under skills.entries.glmv-caption.env:

    "glmv-caption": { "enabled": true, "env": { "ZHIPU_API_KEY": "你的密钥" } }
    
  2. Shell environment variable / Shell 环境变量: Add to ~/.zshrc:

    export ZHIPU_API_KEY="你的密钥"
    
  3. .env file / .env 文件: Create .env in this skill directory:

    ZHIPU_API_KEY=你的密钥
    

⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔

  1. ONLY use GLM-V API — Execute the script python scripts/glmv_caption.py
  2. NEVER caption media yourself — Do NOT try to describe content using built-in vision or any other method
  3. NEVER offer alternatives — Do NOT suggest "I can try to describe it" or similar
  4. IF API fails — Display the error message and STOP immediately
  5. NO fallback methods — Do NOT attempt captioning any other way

📋 Output Display Rules (MANDATORY)

After running the script, you must show the full raw output to the user exactly as returned. Do not summarize, truncate, or only say "generated". Users need the original model output to evaluate quality.

  • Image captioning: show the full caption text
  • Multiple images: show each image result
  • Video/files: show the full understanding result
  • If token usage is included, you may optionally display it

How to Use

Caption an Image

python scripts/glmv_caption.py --images "https://example.com/photo.jpg"
python scripts/glmv_caption.py --images /path/to/photo.png

Caption Multiple Images

python scripts/glmv_caption.py --images img1.jpg img2.png "https://example.com/img3.jpg"

Caption a Video

python scripts/glmv_caption.py --videos "https://example.com/clip.mp4"

Caption a Document

python scripts/glmv_caption.py --files "https://example.com/report.pdf"
python scripts/glmv_caption.py --files "https://example.com/doc1.docx" "https://example.com/doc2.txt"

Custom Prompt

python scripts/glmv_caption.py --images photo.jpg --prompt "Describe the architecture style in detail"

Save Result

python scripts/glmv_caption.py --images photo.jpg --output result.json

Thinking Mode

python scripts/glmv_caption.py --images photo.jpg --thinking

CLI Reference

python {baseDir}/scripts/glmv_caption.py (--images IMG [IMG...] | --videos VID [VID...] | --files FILE [FILE...]) [OPTIONS]
Parameter Required Description
--images, -i One of Image paths or URLs (supports multiple, base64 OK)
--videos, -v One of Video paths or URLs (supports multiple, mp4/mkv/mov)
--files, -f One of Document paths or URLs (supports multiple, pdf/docx/txt/xlsx/pptx/jsonl)
--prompt, -p No Custom prompt (default: "请详细描述这张图片的内容" / "Please describe this image in detail")
--model, -m No Model name (default: glm-4.6v)
--temperature, -t No Sampling temperature 0-1 (default: 0.8)
--top-p No Nucleus sampling 0.01-1.0 (default: 0.6)
--max-tokens No Max output tokens (default: 1024, max 32768)
--thinking No Enable thinking/reasoning mode
--output, -o No Save result JSON to file
--pretty No Pretty-print JSON output
--stream No Enable streaming output

Note: --images, --videos, and --files are mutually exclusive per API limits.

Response Format

{
  "success": true,
  "caption": "A landscape photo showing a mountain range at sunset...",
  "usage": {
    "prompt_tokens": 128,
    "completion_tokens": 256,
    "total_tokens": 384
  }
}

Key fields:

  • success — whether the request succeeded
  • caption — the generated caption text
  • usage — token usage statistics
  • warning — present when content was blocked by safety review
  • error — error details on failure

Error Handling

API key not configured:

ZHIPU_API_KEY not configured. Get your API key at: https://bigmodel.cn/usercenter/proj-mgmt/apikeys

→ Show exact error to user, guide them to configure

Authentication failed (401/403): API key invalid/expired → reconfigure

Rate limit (429): Quota exhausted → inform user to wait

File not found: Local file missing → check path

Content filtered: warning field present → content blocked by safety review

Usage Guidance
This skill is coherent for its purpose but sends media and requests to Zhipu's API (open.bigmodel.cn) using your ZHIPU_API_KEY. Only install if you trust the provider with any images, videos, or documents you will submit. Avoid sending sensitive or private content unless you accept that it will be processed remotely. When configuring the key, prefer storing it in your agent's configuration (openclaw.json) rather than leaving secrets in world‑readable files. Review the included scripts/glmv_caption.py to confirm logging or additional network behavior meets your policies, and ensure your environment/network rules allow outbound HTTPS to bigmodel.cn before enabling the skill.
Capability Analysis
Type: OpenClaw Skill Name: glmv-caption Version: 1.0.3 The glmv-caption skill is a legitimate tool designed to interface with the ZhiPu GLM-V multimodal API for generating descriptions of images, videos, and documents. The implementation in `scripts/glmv_caption.py` follows standard practices for API integration, including secure handling of the `ZHIPU_API_KEY` environment variable and validation of local file inputs before processing. The instructions in `SKILL.md` are appropriately restrictive to ensure the AI agent utilizes the provided tool rather than hallucinating content, and the network communication is limited to the official ZhiPu API endpoint (open.bigmodel.cn).
Capability Tags
requires-sensitive-credentials
Capability Assessment
Purpose & Capability
Name/description ask for GLM‑V captioning and the code + SKILL.md require ZHIPU_API_KEY and call Zhipu's chat/completions API. The required environment variable and network calls align with the stated purpose.
Instruction Scope
SKILL.md instructs the agent to always run the provided script and forbids any local/fallback captioning; it also mandates showing the full raw model output. These constraints are unusual but consistent with forcing use of the external GLM‑V service. Be aware that local images (when provided) are converted to base64 and uploaded to the third‑party API — this is necessary for remote captioning but is potential data exposure.
Install Mechanism
No install spec or remote downloads; the skill is instruction‑only with one included Python script. No evidence of unexpected installers or external archive fetches.
Credentials
Only one credential is required (ZHIPU_API_KEY), which matches the integration. Note that providing the API key and using the script will send user media (images as base64 and URLs for videos/files) to bigmodel.cn — that data will leave the agent environment and be processed by the provider.
Persistence & Privilege
always is false and the skill does not request changing other skills or system configuration. It suggests storing the API key in OpenClaw config or .env, which is normal for credentialed skills.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install glmv-caption
  3. After installation, invoke the skill by name or use /glmv-caption
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.3
No user-visible changes in this version. - No file changes detected. - Behavior, features, and documentation remain unchanged from previous version.
v1.0.2
## glmv-caption 1.0.2 Changelog - No file changes were detected in this version. - Documentation and usage instructions remain the same. - No new features, bug fixes, or behavioral adjustments reported for this release.
v1.0.1
- Added OpenClaw metadata: environment variable requirements, primary environment variable, binary dependencies, emoji, and homepage link. - No functional or behavioral changes to the skill logic or user interface. - Documentation remains unchanged except for new metadata fields in SKILL.md.
v1.0.0
Initial release of the GLM-V Caption skill. - Generate detailed captions for images, videos, and documents using ZhiPu GLM-V multimodal models. - Supports various input types: images (jpg/png/jpeg, URL/local/base64), videos (mp4/mkv/mov, URL only), and files (pdf/docx/txt/xlsx/pptx/jsonl, URL only). - Includes flexible CLI usage with support for custom prompts, multiple inputs, and output options (JSON, pretty-print, save to file). - Provides clear rules for API key setup, input restrictions, and displaying model output to users. - Comprehensive error handling and response structure ensures transparent feedback on quota, authentication, safety review, and input validity.
Metadata
Slug glmv-caption
Version 1.0.3
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 4
Frequently Asked Questions

What is GLM-V-Caption?

Generate captions (descriptions) for images, videos, and documents using ZhiPu GLM-V multimodal model series. Use this skill whenever the user wants to descr... It is an AI Agent Skill for Claude Code / OpenClaw, with 539 downloads so far.

How do I install GLM-V-Caption?

Run "/install glmv-caption" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is GLM-V-Caption free?

Yes, GLM-V-Caption is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does GLM-V-Caption support?

GLM-V-Caption is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created GLM-V-Caption?

It is built and maintained by Jared Wen (@jaredforreal); the current version is v1.0.3.

💬 Comments