Understand local video files and screen recordings by extracting timestamped key frames into reusable workspace artifacts before analysis, QA, or editing. Use when the task depends on what is visibly happening in a `.mp4`, `.mov`, `.mkv`, or similar video, especially before making editing decisions, reviewing a half-finished cut, or reasoning about pacing, proof, and readable on-screen states.
Use this skill to build a compact storyboard of the source video before making claims about it.
For editing work, source-video understanding comes before timeline surgery:
If timeline preview tools are available, use them too. Do not let timeline inspection replace source-video inspection when the source file is available.
artifacts/video-understanding/.semantic-storyboard.txt first.storyboard.txt and storyboard.md if you need the OCR-backed raw frame map.Example:
python nanobot/skills/video-understanding/scripts/extract_keyframes.py \
/absolute/path/to/input.mp4 \
--output-dir artifacts/video-understanding/input-frames
Use this script as the default entry point. Only drop to custom OCR or image-processing experiments when the extractor clearly failed and you have a specific reason.
The script writes:
manifest.json with source metadata, timestamps, and reasonssemantic-storyboard.txt as the main text artifact for smaller modelssemantic-storyboard.md as the richer semantic beat mapstoryboard.txt as the compact text-only version for smaller modelsstoryboard.md with a scan-friendly index plus OCR snippets when availableframes/*.jpg with timestamped key framesPrefer reusing an existing storyboard when the source path, file size, mtime, and extraction parameters still match. The extractor handles this automatically unless --force is used.
This keeps the frame directory stable so later editing passes can refer back to the same visual evidence.
Use the bundled extractor instead of blindly sampling every second.
The extractor combines:
tesseract is available, so readable UI text appears in the storyboardDefault output location should usually be:
artifacts/video-understanding/<video-stem>-<short-hash>/
After extraction:
semantic-storyboard.txt first for the semantic map of what is happeningstoryboard.txt for the OCR-backed raw frame mapstoryboard.md when you want frame-by-frame detailIn your reasoning, map:
hook, setup, waiting, proof, payoff, ctakeep, speed up, trim, hold longerThis beat map should drive editing decisions.
--output-dir <path> to control the cache location--scene-threshold <float> to be more or less sensitive to visual changes--max-gap-seconds <float> to force additional checkpoints across quiet stretches--min-spacing-seconds <float> to avoid near-duplicate frames--max-frames <int> to cap the storyboard size--no-ocr to skip text extraction when you only want visual frames--no-semantic to skip semantic summarization--semantic-model <name> to choose the semantic model--semantic-max-frames <int> to cap the number of frames used for semantic summarization--force to rebuild a storyboard even if the cache still matches