Full-stack video production assistant. Analyzes video content visually (Gemini), generates transcriptions/SRT subtitles, plans and creates motion graphics (Remotion), generates B-roll images/videos, produces timeline XMLs for Premiere/DaVinci. Downloads YouTube videos with yt-dlp. Use for: video analysis, visual analysis, describe video, what's in this video, transcription, subtitles, motion graphics, B-roll, shorts, timeline XML, clip cutting, silence removal, After Effects, Premiere Pro, DaVinci Resolve, YouTube download. Keywords: video edit, ffmpeg, remotion, after effects, premiere, davinci, shorts, subtitles, motion graphics, clip, render, transcribe, xml, timeline, b-roll, talking head, analyze, yt-dlp, youtube, download, gemini, vision
The agent has built-in vision for images. For videos, always use Gemini via Kolbo MCP.
| Media type | Action |
|---|---|
| Image (jpg, png, etc.) | Agent reads it directly — no upload needed |
| Video — "analyze", "describe", "what's in this?", "what prompts?", file path with no instruction | upload_media → chat_send_message + Gemini |
| Transcription — "transcribe", "subtitles", "SRT", "what's being said", "captions" | transcribe_audio only |
| Both visual + transcript | Run both |
Never use ffmpeg to extract frames for analysis. Never use local Ollama/vision models. Commit to the right action — do not ask the user. Wait for chat_send_message to return before proceeding — it polls until done (up to 2 min). Do NOT fall back to ffmpeg or any other approach if it takes time.
kolbo auth login Is Done)These are available as MCP tools — use them directly without any Python/API key setup:
| Tool | Use |
|---|---|
upload_media | Upload local file to Kolbo CDN → get stable public URL |
chat_send_message | Send message + media_urls array to Gemini for visual analysis |
transcribe_audio | Transcribe audio/video to text + SRT (ElevenLabs Scribe) |
generate_image | Generate B-roll images |
generate_video | Generate B-roll videos |
generate_video_from_image | Animate a still into video |
generate_music | Generate background music |
generate_speech | TTS for voiceover |
generate_sound | Sound effects |
list_models | Browse available models by type |
check_credits | Check remaining Kolbo credit balance |
Step 1 is NOT optional. You cannot skip upload_media or construct the URL yourself.
Step 1: upload_media({ source: "/absolute/path/to/video.mp4" })
→ Returns: { url, thumbnail_url, ... }
→ Save the "url" field — this is the CDN URL you will pass to Gemini
→ NEVER use thumbnail_url (it's a JPG preview, not the video)
Step 2: chat_send_message({
message: "Describe this video in detail. What is shown?",
media_urls: ["<url from step 1>"] ← must be an array, must be the "url" field
})
→ returns: { content: "..." }
❌ Common mistakes that break video analysis:
upload_media and passing a local file path to chat_send_message — local paths don't work.txt URL as the media_urls value — Gemini needs the actual video CDN URLthumbnail_url instead of url from the upload_media responsetranscribe_audio first then passing its output URL as the video — transcription gives text, not videoOmit model — Smart Select detects video/audio and auto-routes to Gemini.
Sessions do NOT remember media between messages. On retry: reuse the same CDN url from step 1 (no re-upload needed) but always pass media_urls again.
Batch analysis (many videos): Pass model: "gemini-3.1-flash-lite-preview" explicitly for cheaper bulk runs.
For YouTube videos — download first with yt-dlp (see below), then follow steps 1–2 above.
Input: local video / YouTube URL / uploaded file
→ [DEFAULT] Visual Analysis: upload_media → chat_send_message (Gemini)
→ [EXPLICIT REQUEST] Transcription: transcribe_audio → SRT / text
→ [EDITING] FFmpeg: cut, silence removal, 9:16 conversion
→ [MOTION GRAPHICS] Remotion: compositions, captions, B-roll
→ Output: Premiere XML / DaVinci EDL / MP4s / SRT
| Service | Use |
|---|---|
Kolbo MCP (upload_media + chat_send_message) | Primary — visual video/image analysis via Gemini |
Kolbo MCP (transcribe_audio) | Primary — transcription, word-level SRT, multilingual |
| yt-dlp | Download YouTube/social media videos |
| FFmpeg | Local video editing, cutting, silence removal, format conversion |
| Remotion Lambda | Cloud render motion graphics |
| fal.ai (MCP) | Image & video B-roll generation |
| ElevenLabs | TTS, voice cloning, SFX (via Kolbo MCP generate_speech) |
| Suno | Background music (via Kolbo MCP generate_music) |
Kolbo MCP tools need no API keys — auth is handled by
kolbo auth login. FFmpeg/yt-dlp need to be installed locally on the machine.
Download video from YouTube, TikTok, Instagram, Twitter, etc.:
# Best quality MP4
yt-dlp -f "bestvideo[height<=1080][ext=mp4]+bestaudio/best" \
--merge-output-format mp4 \
-o "%(id)s.%(ext)s" <url>
# With subtitles
yt-dlp -f "bestvideo[height<=1080][ext=mp4]+bestaudio/best" \
--write-auto-sub --sub-lang en --convert-subs srt \
--merge-output-format mp4 \
-o "%(id)s.%(ext)s" <url>
# Audio only (for transcription)
yt-dlp -f "bestaudio" --extract-audio --audio-format mp3 -o "%(id)s.%(ext)s" <url>
After download → upload to Kolbo CDN with upload_media → analyze visually with chat_send_message.
tempfile.mkdtemp() first (handles spaces in paths)\pos() for RTL rendering-c:v libx264 -crf 18 -c:a aac -b:a 128ksilencedetect -35dB:d=0.4 → trim+concat → atempo=1.14Use ElevenLabs Scribe for word-level SRT with speaker diarization:
import requests
def transcribe(audio_path, api_key, language="he"):
with open(audio_path, "rb") as f:
response = requests.post(
"https://api.elevenlabs.io/v1/speech-to-text",
headers={"xi-api-key": api_key},
files={"file": f},
data={"model_id": "scribe_v1", "language_code": language,
"timestamps_granularity": "word", "diarize": True}
)
return response.json()
filter_complex = (
"[0:v]split[bg][fg];"
"[bg]scale=1080:1920:force_original_aspect_ratio=increase,"
"crop=1080:1920,gblur=sigma=40[blurred];"
"[fg]scale=1080:1920:force_original_aspect_ratio=decrease,"
"pad=1080:1920:(ow-iw)/2:(oh-ih)/2:color=black@0[front];"
"[blurred][front]overlay=0:0"
)
import subprocess, json
def detect_silence(video_path, noise_db=-35, duration=0.4):
result = subprocess.run([
"ffmpeg", "-i", video_path,
"-af", f"silencedetect=noise={noise_db}dB:d={duration}",
"-f", "null", "-"
], capture_output=True, text=True)
# Parse silence_start/silence_end from stderr
...
For comprehensive RTL subtitle handling, load the subtitle-production skill — it contains full patterns for:
Encoding=177~0.74 scale factor)direction: rtl and all the flip rulesgeq filterCRITICAL: Any inline ASS tag (\c, \K, \1c, etc.) between RTL words breaks Unicode bidi in libass — words render LTR. Use separate Dialogue lines per word instead.
For Remotion RTL layout rules (padding flips, transform-origin, gradient direction), load the typography-video skill.
For motion graphics rendering, use the remotion-best-practices skill for detailed Remotion patterns.
For cloud rendering via Remotion Lambda:
npx remotion lambda render <serve-url> <composition-id> --out output.mp4
def generate_premiere_xml(clips, output_path, fps=30):
# Generate FCP7 XML compatible with Premiere Pro
...
Organize outputs per project:
<project>/
├── raw/ # original footage
├── transcripts/ # SRT, word-level JSON
├── clips/ # cut segments
├── shorts/ # 9:16 vertical versions
├── b-roll/ # generated B-roll images/videos
├── motion/ # Remotion compositions
└── export/ # final deliverables + XML timelines
Before writing a new script, ask the user if they already have one for the task — they may have existing tools for clipping, silence removal, or subtitle burning.