Description

A skill that uses GLM-V native grounding capabilities for coordinate conversion, bounding-box visualization, and more. GLM-V native grounding can locate any...

README (SKILL.md)

GLMV-Grounding Skill

Name: GLM-V-Grounding
Author: jaredforreal

Extract and visualize grounding results produced by GLM-V. Depending on the user prompt, grounding coordinates in model outputs may appear in different forms, including 2D bounding boxes, Objects Detection JSON, 2D points, 3D bounding boxes, and target-tracking JSON.

Note: GLM-V outputs coordinates where x and y are relative coordinates normalized from pixel coordinates x_pixel and y_pixel using image width W and height H (range 0-1000), i.e., x=round(x_pixel/W1000), y=round(y_pixel/H1000). The origin of the pixel coordinate system is the top-left corner. Note: If the prompt does not explicitly specify a grounding format (for example, "find the location of xxx" or "draw a box around xxx"), treat the request as 2D bounding boxes by default.

When to use

Use GLM-V to ground targets in images: obtain grounding results in an image for any prompt-described target, with output formats such as 2D bounding box (default), 2D points, and 3D bounding box.
Use GLM-V to track targets in videos: obtain tracking results in a video for any prompt-described target, with output format like {"0": [{"label": ..., "bbox_2d": ...}, ...], ...}.
Use utility functions for extraction, conversion, and visualization: extract coordinates, points, and JSON from natural text; normalize and de-normalize coordinates; visualize boxes, points, 3D boxes, and video tracking results.

Setup your API Key

Configure ZHIPU_API_KEY to call the GLM-V API.

Get your API key: https://www.bigmodel.cn/usercenter/proj-mgmt/apikeys
Configure it with:

python scripts/config_setup.py setup --api-key YOUR_KEY

Security & Transparency

Primary API key env: ZHIPU_API_KEY (required).
Timeout env: GLM_GROUNDING_TIMEOUT (optional, seconds, default 60).
API endpoint: fixed to official Zhipu Chat Completions endpoint in CLI implementation.
No dynamic key name switching: the skill expects ZHIPU_API_KEY consistently.
URL/local file handling: the skill can read local files or fetch user-provided URLs for processing/visualization; URL inputs are restricted to public http/https targets (localhost/private network targets are rejected).

Runtime Dependencies

Install dependencies before use:

pip install -r scripts/requirements.txt

Main packages used by this skill:

requests
Pillow
opencv-python
numpy
matplotlib
decord

System dependency for video visualization:

ffmpeg

General workflow

	Input (image or video + Prompt)
		|
		▼
	Run glm_grounding_cli.py to get grounding results (natural language)
		|
		▼
	Return results (grounding results, visualized image or video)

How to Use

Run glm_grounding_cli.py to get grounding results

Ground any target in an image

python scripts/glm_grounding_cli.py --image-url "URL provided by user" --prompt "description of target for grounding"

Track any target in a video

python scripts/glm_grounding_cli.py --video-url /path/to/image.jpg --prompt "description of target for tracking" --visualize --visualization-dir "./vis"

Reply with grounding results

After receiving a grounding prompt from the user, your direct reply should be natural language that includes grounding coordinates. Coordinates $x$ and $y$ are relative values in [0, 1000], computed as:

$$ x = round(x_{pixel} / W * 1000) \ y = round(y_{pixel}/H*1000) $$

where $x_{pixel}, y_{pixel}$ are pixel coordinates with origin (0, 0) at the top-left corner of the image, and W/H are the image width/height.

Unless otherwise specified, grounding results should use the following Python data formats:

2D bounding boxes: [[x1, y1, x2, y2], ...], extracted grounding result is a list of boxes, each box has 4 coordinate values
2D points: [[x, y], ...], extracted grounding result is a list of points, each point has 2 coordinate values
2D polygon: [[x1, y1], [x2, y2], ...], extracted grounding result is a polygon coordinate list, each vertex has 2 coordinate values
3D bounding boxes: [{"bbox_3d":[x_center, y_center, z_center, x_size, y_size, z_size, roll, pitch, yaw],"label":"category"}, ...], extracted grounding result is a JSON list where each object contains a category label and one 3D box with 8 coordinate values
Objects Detection JSON: [{'label': 'category', 'bbox_2d': [x1, y1, x2, y2]}, ...], extracted grounding result is a JSON list where each object contains a category label and one box
Video Objects Tracking JSON: {0: [{'label': 'car-1', 'bbox_2d': [1,2,3,4]}, {'label': 'car-2', 'bbox_2d': [2,3,4,5]}], 1: [{'label': 'car-2', 'bbox_2d': [4,5,6,7]}, {'label': 'person-1', 'bbox_2d': [10,20,30,40]}]}, extracted grounding result is a JSON object whose keys are video frame indices and values are lists of JSON objects, each containing a category label and one 2D box

Python example


# 1. User grounding request and your reply
image=https://example.com/image.jpg
prompt="Please box all people wearing Santa hats in the image and tell me their coordinates. Use red boxes, line thickness 3, and label format 'SantaHat-i'."


# 2. Get grounding results
python scripts/glm_grounding_cli.py --image-url $image --prompt $prompt --visualize --visualization-dir "./vis"

#  {
#         "ok": True,
#         "grounding_result": [[100, 200, 300, 400], [500, 600, 700, 800]],
#         "visualizations_result": (
#             {"visualized_image": "./vis/image_vis.jpg"}
#         ),
#         "raw_result": "1. Person 1: box [100, 200, 300, 400]\
2. Person 2: box [500, 600, 700, 800]. The box format is [x1, y1, x2, y2], where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner.",
#         "error": None,
#         "source": source,
#     }

Utility function quick reference

Function	Purpose
`parse_coordinates_from_response(response_str, coords_type='bbox', init_context_window=2000, max_context_window=-1)`	Parse and extract all coordinate results from model responses (supports 2D bbox, point, polygon)
`parse_3d_boxes_from_response(response_str, max_context_window=-1)`	Parse and extract all 3D boxes and labels from model responses (strict and loose matching)
`parse_detection_from_response(response_str, max_context_window=-1)`	Parse and extract all 2D detection results from model responses (Objects Detection JSON format)
`parse_mot_from_response(response_str, max_context_window=-1)`	Parse and extract all video object tracking results from model responses (Video Objects Tracking JSON format)
`visualize_boxes(img_path=None, img_bytes=None, boxes=[], labels=None, renormalize=False, save_path=None, return_b64=False, save_optimized=True, **kwargs)`	Draw 2D boxes on images with labels, custom colors, and line thickness
`visualize_points(img_path=None, img_bytes=None, points=[], labels=None, renormalize=False, diameters=None, save_path=None, return_b64=False, save_optimized=True, distinct_colors=False, colors=None)`	Draw points on images with labels, custom size, and colors
`visualize_3d_boxes_glmv_simple(image_path, cam_params, bbox_3d_list, image_bytes=None, coord_format='xyzwhlpyr', save_path=None, save_optimized=False, return_b64=False, **kwargs)`	Draw projected 3D boxes on images using camera intrinsics (supports rotation and multiple coordinate formats)
`visualize_mot(video_path=None, video_bytes=None, mot_js=None, renormalize=False, save_path=None, return_b64=False, distinct_colors=True, **kwargs)`	Draw Video Objects Tracking boxes on each video frame with labels

Common errors

Coordinate values exceed 1000: if extracted coordinate values are greater than 1000, the model may have produced unnormalized coordinates due to prompt effects. Extract the target phrase from the user request (for example, "people wearing Santa hats"), then query the model again and explicitly require output coordinates to be relative values normalized to 0-1000 based on image size (for example, "Please box all people wearing Santa hats in the image and tell me their coordinates. Ensure the output coordinates are relative values normalized to 0-1000 based on image size.").

Usage Guidance

This skill appears to do what it says: it sends user-provided images/videos to the GLM‑V service (Zhipu) using the ZHIPU_API_KEY and can save visualizations locally. Consider: (1) any media you supply will be transmitted to the external model provider (Zhipu) — avoid sending private/sensitive images you wouldn't want processed by that service; (2) the skill will create a .env file in its directory to store your API key — keep that directory private and add .env to .gitignore if you put the repo under version control; (3) the code resolves hostnames to block private IPs, which causes DNS lookups — if you are in a sensitive network environment, be aware of that network activity; (4) dependencies are installed from PyPI (standard risk) and ffmpeg is a system dependency for video output. If you trust the upstream GLM‑V provider and are comfortable sending media to their API, the package is coherent; otherwise do not install or run it and avoid providing an API key.

Capability Analysis

Type: OpenClaw Skill Name: glmv-grounding Version: 1.0.5 The glmv-grounding skill is a well-structured tool for image and video grounding using the Zhipu GLM-V API. The implementation in glm_grounding_cli.py includes proactive security measures, such as a fixed official API endpoint to prevent credential exfiltration and a robust URL validator (_is_public_url) that resolves hostnames to block access to private, loopback, and reserved IP addresses (mitigating SSRF). Data parsing from the AI model is performed using ast.literal_eval rather than unsafe eval, and the overall logic is consistent with the stated purpose of coordinate visualization and tracking.

Capability Tags

cryptorequires-sensitive-credentials

Capability Assessment

✓ Purpose & Capability

Name/description (grounding, coordinate conversion, visualization, tracking) aligns with the included scripts (CLI, utils for boxes/3D/video/detection) and the declared env var ZHIPU_API_KEY. Required binaries and config paths are minimal and expected for image/video processing.

✓ Instruction Scope

SKILL.md instructs to run the provided CLI and to install the listed Python deps. The code only reads user-supplied images/videos (local files or public http/https URLs), validates/blocks localhost and private IPs for URL inputs, loads/writes a local .env for the API key, and posts requests to a fixed Zhipu Chat Completions endpoint — all within the stated scope.

✓ Install Mechanism

No installers or remote executable downloads are embedded. The skill expects pip install -r scripts/requirements.txt (standard PyPI packages). That is proportional for a Python visualization/vision tool and uses well-known packages listed in requirements.txt.

✓ Credentials

Only ZHIPU_API_KEY (primary credential) and an optional GLM_GROUNDING_TIMEOUT are requested. These map directly to calling the GLM‑V API and controlling request timeouts. The config_setup writes a local .env (skill-scoped) — expected for storing the API key.

✓ Persistence & Privilege

always is false and the skill does not request system-wide or other-skills credentials. It writes/reads a .env within its own skill directory (normal for CLI tools). It does not modify other skills or global agent settings.

Version History

v1.0.5

Version 1.0.5 of glmv-grounding - No code or documentation changes detected in this release. - Functionality and usage remain unchanged from the previous version.

v1.0.4

No changes detected in this release. - Version updated to 1.0.4 with no file modifications or content changes. - Existing features and documentation remain unchanged.

v1.0.3

No visible file changes detected in this version. - No code or documentation updates present. - Version and internal metadata remain unchanged.

v1.0.2

- Added clarification that URL inputs are restricted to public http/https targets; localhost and private network URLs are now rejected for improved security. - Updated metadata with a "source" field in addition to "homepage". - No code or functionality changes detected.

v1.0.1

glmv-grounding 1.0.1 - Added scripts/requirements.txt specifying Python package dependencies. - Documented required system dependency (ffmpeg) and main Python dependencies in SKILL.md. - Declared support for new environment variable GLM_GROUNDING_TIMEOUT and required binary ffmpeg in metadata. - Clarified installation and runtime setup instructions in SKILL.md.

v1.0.0

GLMV-Grounding Skill v1.0.0 - Initial release providing GLM-V native grounding capabilities for images and videos. - Supports coordinate extraction, conversion, and visualization in multiple formats (2D bounding boxes, 2D points, polygons, 3D boxes, and tracking JSON). - Coordinates are normalized to the 0–1000 range based on image size. - Includes utility functions for parsing model responses and visualizing results. - Requires configuration of ZHIPU_API_KEY for API access.

Metadata

Slug glmv-grounding

Version 1.0.5

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 6

Frequently Asked Questions

What is GLM-V-Grounding?

A skill that uses GLM-V native grounding capabilities for coordinate conversion, bounding-box visualization, and more. GLM-V native grounding can locate any... It is an AI Agent Skill for Claude Code / OpenClaw, with 401 downloads so far.

How do I install GLM-V-Grounding?

Run "/install glmv-grounding" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is GLM-V-Grounding free?

Yes, GLM-V-Grounding is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does GLM-V-Grounding support?

GLM-V-Grounding is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created GLM-V-Grounding?

It is built and maintained by Jared Wen (@jaredforreal); the current version is v1.0.5.

More Skills

GLM-V-Grounding