Trigger when: (1) User wants to extract text, tables, formulas, or structured data from images/PDFs/scanned documents, (2) User mentions "OCR", "文字识别", "文档解析", (3) User has a document (screenshot, scanned page, invoice, paper, whiteboard photo) and needs its content in structured form, (4) User asks to parse, digitize, or extract content from a visual document. Invokes the GLM-OCR SDK (pip install glmocr) to parse documents via Zhipu's cloud API. No GPU required. Returns structured JSON (regions with labels + bounding boxes) and Markdown. Agent can operate entirely via CLI — no YAML files needed. NOT for: real-time camera feeds, audio transcription, or non-document images (photos, illustrations).
Parses documents (images, PDFs, scans) via the GLM-OCR SDK.
📌 On-demand: This skill requires only
ZHIPU_API_KEYin the environment. No YAML config files or GPU needed.
# Install
pip install glmocr
# Set API key (once)
export ZHIPU_API_KEY=sk-xxx
# or add to .env file in working directory:
echo "ZHIPU_API_KEY=sk-xxx" >> .env
# One-liner
import glmocr
result = glmocr.parse("document.pdf")
print(result.markdown_result)
print(result.to_dict())
# CLI — pass API key directly (no env setup needed)
glmocr parse image.png --api-key sk-xxx
# Or load from a specific .env file
glmocr parse image.png --env-file /path/to/.env
# Or rely on env var / auto-discovered .env (set once, then omit)
glmocr parse image.png
glmocr parse ./scans/ --output ./output/ --stdout
Constructor kwargs > os.environ > .env file > config.yaml > built-in defaults
Agents override everything via constructor kwargs or env vars — no YAML editing needed.
| Variable | Description | Example |
|---|---|---|
ZHIPU_API_KEY | API key (required for MaaS) | sk-abc123 |
GLMOCR_MODEL | Model name | glm-ocr |
GLMOCR_TIMEOUT | Request timeout (seconds) | 600 |
GLMOCR_ENABLE_LAYOUT | Layout detection on/off | true |
GLMOCR_LOG_LEVEL | DEBUG / INFO / WARNING / ERROR | INFO |
import glmocr
# Single file → PipelineResult
result = glmocr.parse("invoice.png")
# Multiple files → list[PipelineResult]
results = glmocr.parse(["page1.png", "page2.png", "report.pdf"])
from glmocr import GlmOcr
parser = GlmOcr(api_key="sk-xxx") # mode auto-set to "maas"
parser = GlmOcr(mode="maas") # reads ZHIPU_API_KEY from env
# Always use as context manager or call .close()
with GlmOcr(api_key="sk-xxx") as parser:
result = parser.parse("document.png")
print(result.markdown_result)
parser.close() # if not using `with`
| Parameter | Type | Description |
|---|---|---|
api_key | str | API key. Providing this auto-enables MaaS mode. |
api_url | str | Override MaaS endpoint URL |
model | str | Model name override |
timeout | int | Request timeout in seconds (default: 600) |
enable_layout | bool | Enable layout detection |
log_level | str | Logging level |
PipelineResultresult.markdown_result # str — full document as Markdown
result.json_result # list[list[dict]] — structured regions per page
result.original_images # list[str] — absolute paths of input images
json_result structureList of pages → list of regions per page:
[
[
{
"index": 0,
"label": "title",
"content": "Annual Report 2024",
"bbox_2d": [100, 50, 900, 120]
},
{
"index": 1,
"label": "table",
"content": "| Q1 | Q2 |\n|---|---|\n| 120 | 145 |",
"bbox_2d": [100, 140, 900, 400]
}
]
]
Bounding boxes (bbox_2d): [x1, y1, x2, y2] normalised to 0–1000 scale.
Region labels: title, text, table, figure, formula, header, footer, page_number, reference, seal
# Dict (JSON-serializable, for passing to other tools)
d = result.to_dict()
# Keys: json_result, markdown_result, original_images, usage (MaaS), data_info (MaaS)
# JSON string
json_str = result.to_json() # pretty-printed, ensure_ascii=False
json_str = result.to_json(indent=None) # compact single line
# Save to disk: writes <stem>/<stem>.json + <stem>/<stem>.md + layout_vis/
result.save(output_dir="./output")
result.save(output_dir="./output", save_layout_visualization=False)
The SDK does not raise on MaaS errors — check to_dict() for an "error" key:
result = parser.parse("image.png")
d = result.to_dict()
if "error" in d:
# Handle failure
print("OCR failed:", d["error"])