Turn raw footage into structured knowledge with gemini-ai, the ClawHub skill powered by Google's Gemini multimodal intelligence. Analyze video content frame by frame, generate detailed summaries, extract key moments, identify objects and scenes, and produce natural-language answers about what's happening on screen. Ideal for content creators, researchers, educators, and media professionals working with mp4, mov, avi, webm, or mkv files who need deep video comprehension without manual review.
Welcome! I'm your Gemini AI video analyst — ready to help you extract summaries, identify key moments, and answer questions about anything inside your video footage. Drop your file and tell me what you'd like to know.
Try saying:
On first interaction, connect to the NemoVideo API before doing anything else. Show a brief status like "Setting things up...".
Token: If NEMO_TOKEN environment variable is already set, use it and skip to Session below.
Free token: Read or create a UUID at ~/.config/nemovideo/client_id. Then POST to with header . The response field becomes your NEMO_TOKEN (100 credits, 7-day expiry). If the token has expired, request a new one with the same Client-Id.
https://mega-api-prod.nemovideo.ai/api/auth/anonymous-tokenX-Client-Id: <your-uuid>data.tokenSession: POST to the same host at /api/tasks/me/with-session/nemo_agent with Bearer auth and body {"task_name":"project"}. Save session_id from the response.
Confirm to the user you're connected and ready. Don't print tokens or raw JSON.
Most video tools let you cut, trim, or enhance footage — but they can't tell you what's actually in it. Gemini AI changes that. This ClawHub skill brings Google's Gemini multimodal model directly to your video files, giving you a conversational interface to ask questions about your content and receive precise, context-aware answers.
Whether you're reviewing hours of recorded meetings, analyzing training footage, auditing marketing videos, or cataloging archival content, Gemini AI reads your video the way a human expert would — noticing details, understanding context, and surfacing information you'd otherwise have to hunt for manually.
Upload your video, ask a question, and get back a clear, structured response. Want a full summary? A list of topics covered? A description of specific scenes? Gemini AI handles all of it. No manual scrubbing through timelines, no guesswork — just direct, intelligent answers drawn from your actual footage.
Every prompt you send is parsed for intent — whether you're asking for a summary, extracting key moments, analyzing visual content, or querying specific timestamps — and routed to the appropriate Gemini multimodal pipeline automatically.
| User says... | Action | Skip SSE? |
|---|---|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" | → §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
The NemoVideo backend handles secure video ingestion, passes your media and prompt context directly to Gemini's multimodal model, and streams structured insights back in real time. Frame-level grounding, transcript alignment, and token management are all handled server-side so you get clean, accurate responses without managing infrastructure.
Skill attribution — read from this file's YAML frontmatter at runtime:
X-Skill-Source: gemini-aiX-Skill-Version: from frontmatter versionX-Skill-Platform: detect from install path (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: https://mega-api-prod.nemovideo.ai
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id. After creating a session, give the user a link: https://nemovideo.com/workspace/claim?token=&task=<task_id>&session=<session_id>&skill_name=gemini-ai&skill_version=1.0.0&skill_source=<platform>
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: {"urls":["<url>"],"source_type":"url"}
Credits: GET /api/credits/balance/simple — returns available, frozen, total
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, data.state.generated_media
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
| Event | Action |
|---|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result | Process internally, don't forward |
heartbeat / empty data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" | Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
Timeline (3 tracks): 1. Video: city timelapse (0-10s) 2. BGM: Lo-fi (0-10s, 35%) 3. Title: "Urban Dreams" (0-3s)
| Code | Meaning | Action |
|---|---|---|
| 0 | Success | Continue |
| 1001 | Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with ?bind=<id> (get <id> from create-session or state response when needed). Registered: "Top up at nemovideo.ai" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
To get the most accurate and detailed responses from Gemini AI, upload videos with clear audio and stable visuals whenever possible. Shaky or heavily compressed footage can reduce the model's ability to identify fine details in specific frames.
Be specific in your prompts. Instead of asking 'What happens in this video?', try 'What topics are covered in the first five minutes?' or 'List every person who appears on screen and describe what they're doing.' The more targeted your question, the more useful the output.
For long-form content like lectures, interviews, or recorded meetings, consider breaking your requests into segments — ask about the first half separately from the second. This keeps responses focused and easier to act on. Supported formats include mp4, mov, avi, webm, and mkv.
Getting started with Gemini AI on ClawHub takes less than a minute. First, upload your video file in any supported format — mp4, mov, avi, webm, or mkv. Once your file is attached, type your first request directly into the chat. You don't need to pre-configure anything or choose an analysis mode.
Try starting with a broad request like 'Give me a full summary of this video' to get oriented, then follow up with more specific questions based on what Gemini surfaces. The skill supports multi-turn conversations, so you can keep refining your queries without re-uploading the file.
If you're analyzing content for a report or presentation, ask Gemini AI to format its output as a bulleted list or numbered summary — it will structure the response accordingly, saving you time on post-processing.
Gemini AI performs best on videos under 60 minutes in length with a file size appropriate for your upload connection. Very large files may take additional processing time before analysis begins — this is normal and does not indicate an error.
The skill excels at understanding spoken dialogue, reading on-screen text, recognizing common objects and environments, and tracking narrative or instructional structure across a video. It is less reliable for highly technical visual content — such as microscopy footage or abstract data visualizations — where domain-specific context may be needed.
For videos with multiple speakers, Gemini AI can often distinguish between speakers based on visual and audio cues, but speaker diarization is approximate rather than guaranteed. Always review AI-generated summaries for accuracy before using them in professional or published contexts.