Name: Video Understanding
Author: openclaw

Video Understanding (Gemini)

Analyze videos using Google Gemini's multimodal video understanding. Supports 1000+ video sources via yt-dlp.

Requirements

Returns structured JSON:

transcript — Verbatim transcript with [MM:SS] timestamps
description — Visual description (people, setting, UI, text on screen, flow)
summary — 2-3 sentence summary
duration_seconds — Estimated duration
speakers — Identified speakers

Flag	Description	Default
`-q` / `--question`	Question to answer (added to default fields)	none
`-p` / `--prompt`	Override entire prompt (ignores -q)	structured JSON
`-m` / `--model`	Gemini model	gemini-2.5-flash
`-o` / `--output`	Save output to file	stdout
`--keep`	Keep downloaded video file	false
`--download-only`	Download only, skip analysis	false
`--max-size`	Max file size in MB	500
`--raw`	Raw text output instead of JSON	false