Name: GLMV-Grounding Skill
Author: zai-org

GLMV-Grounding Skill

A skill that uses GLM-V native grounding capabilities for coordinate conversion, bounding-box visualization, and more. GLM-V native grounding can locate any target specified by the prompt in an image and output relative coordinates normalized to 0-1000 based on image size. Coordinate formats include 2D bounding box (default), 2D points, and 3D bounding box. GLM-V also supports spatiotemporal localization and tracking of multiple prompt-specified targets in videos, outputting 2D bounding boxes per second.

zai-org2,274 Sterne30.03.2026

Beruf
Kategorien: Machine Learning

Extract and visualize grounding results produced by GLM-V. Depending on the user prompt, grounding coordinates in model outputs may appear in different forms, including 2D bounding boxes, Objects Detection JSON, 2D points, 3D bounding boxes, and target-tracking JSON.

Note: GLM-V outputs coordinates where x and y are relative coordinates normalized from pixel coordinates x_pixel and y_pixel using image width W and height H (range 0-1000), i.e., x=round(x_pixel/W1000), y=round(y_pixel/H1000). The origin of the pixel coordinate system is the top-left corner. Note: If the prompt does not explicitly specify a grounding format (for example, "find the location of xxx" or "draw a box around xxx"), treat the request as 2D bounding boxes by default.

When to use

Use GLM-V to ground targets in images: obtain grounding results in an image for any prompt-described target, with output formats such as 2D bounding box (default), 2D points, and 3D bounding box.
Use GLM-V to track targets in videos: obtain tracking results in a video for any prompt-described target, with output format like {"0": [{"label": ..., "bbox_2d": ...}, ...], ...}.
Use utility functions for extraction, conversion, and visualization: extract coordinates, points, and JSON from natural text; normalize and de-normalize coordinates; visualize boxes, points, 3D boxes, and video tracking results.

GLMV-Grounding Skill

zai-org2,274 Sterne30.03.2026

Beruf
Kategorien: Machine Learning

When to use

Use GLM-V to ground targets in images: obtain grounding results in an image for any prompt-described target, with output formats such as 2D bounding box (default), 2D points, and 3D bounding box.

Use GLM-V to track targets in videos: obtain tracking results in a video for any prompt-described target, with output format like {"0": [{"label": ..., "bbox_2d": ...}, ...], ...}.

Use utility functions for extraction, conversion, and visualization: extract coordinates, points, and JSON from natural text; normalize and de-normalize coordinates; visualize boxes, points, 3D boxes, and video tracking results.

Function	Purpose
`parse_coordinates_from_response(response_str, coords_type='bbox', init_context_window=2000, max_context_window=-1)`	Parse and extract all coordinate results from model responses (supports 2D bbox, point, polygon)
`parse_3d_boxes_from_response(response_str, max_context_window=-1)`	Parse and extract all 3D boxes and labels from model responses (strict and loose matching)
`parse_detection_from_response(response_str, max_context_window=-1)`	Parse and extract all 2D detection results from model responses (Objects Detection JSON format)
`parse_mot_from_response(response_str, max_context_window=-1)`	Parse and extract all video object tracking results from model responses (Video Objects Tracking JSON format)
`visualize_boxes(img_path=None, img_bytes=None, boxes=[], labels=None, renormalize=False, save_path=None, return_b64=False, save_optimized=True, **kwargs)`	Draw 2D boxes on images with labels, custom colors, and line thickness
`visualize_points(img_path=None, img_bytes=None, points=[], labels=None, renormalize=False, diameters=None, save_path=None, return_b64=False, save_optimized=True, distinct_colors=False, colors=None)`	Draw points on images with labels, custom size, and colors
`visualize_3d_boxes_glmv_simple(image_path, cam_params, bbox_3d_list, image_bytes=None, coord_format='xyzwhlpyr', save_path=None, save_optimized=False, return_b64=False, **kwargs)`	Draw projected 3D boxes on images using camera intrinsics (supports rotation and multiple coordinate formats)
`visualize_mot(video_path=None, video_bytes=None, mot_js=None, renormalize=False, save_path=None, return_b64=False, distinct_colors=True, **kwargs)`	Draw Video Objects Tracking boxes on each video frame with labels

GLMV-Grounding Skill

When to use

GLMV-Grounding Skill

When to use

Setup your API Key

Security & Transparency

Runtime Dependencies

General workflow

How to Use

Run glm_grounding_cli.py to get grounding results

Reply with grounding results

Python example

Utility function quick reference

Common errors

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns