Name: Google Gemini Media
Author: sundial-org

Gemini Multimodal Media (Image/Video/Speech) Skill

1. Goals and scope

This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:

Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration)
Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API)
Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio)
Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence)
Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
Audio understanding (upload/inline; description, transcription, time-range transcription, token counting)

Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.

Gemini Multimodal Media (Image/Video/Speech) Skill

1. Goals and scope

This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:

Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration)
Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API)
Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio)
Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence)
Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
Audio understanding (upload/inline; description, transcription, time-range transcription, token counting)

Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.

Google Gemini Media

Gemini Multimodal Media (Image/Video/Speech) Skill

1. Goals and scope

Google Gemini Media

Gemini Multimodal Media (Image/Video/Speech) Skill

1. Goals and scope

2. Quick routing (decide which capability to use)

3. Unified engineering constraints and I/O spec (must read)

3.0 Prerequisites (dependencies and tools)

3.1 Authentication and environment variables

3.2 Two file input modes: Inline vs Files API

3.3 Unified handling of binary media outputs

4. Model selection matrix (choose by scenario)

4.1 Image generation (Nano Banana)

4.2 General image/video/audio understanding

4.3 Video generation (Veo)

4.4 Speech generation (TTS)

5. Image generation (Nano Banana)

5.1 Text-to-Image

5.2 Text-and-Image-to-Image

5.3 Multi-turn image iteration (Multi-turn editing)

5.4 ImageConfig

6. Image understanding (Image Understanding)

6.1 Two ways to provide input images

6.2 Inline images (Node.js) minimal template

6.3 Upload and reference with Files API (Node.js) minimal template

6.4 Multi-image prompts

7. Video generation (Veo 3.1)

7.1 Core features (must know)

7.2 SDK (Node.js) minimal template: async polling + download

7.3 REST minimal template: predictLongRunning + poll + download

7.4 Common controls (recommend a unified wrapper)

7.5 Important limits (engineering fallback needed)

8. Video understanding (Video Understanding)

8.1 Video input options

8.2 Files API (Node.js) minimal template

8.3 Timestamp prompting strategy

9. Speech generation (Text-to-Speech, TTS)

9.1 Positioning

9.2 Single-speaker TTS (Node.js) minimal template

9.3 Multi-speaker TTS (max 2 speakers)

9.4 Voice options and language

9.5 "Director notes" (strongly recommended for high-quality voice)

10. Audio understanding (Audio Understanding)

10.1 Typical tasks

10.2 Files API (Node.js) minimal template

10.3 Key limits and engineering tips

11. End-to-end examples (composition)

Example A: Image generation -> validation via understanding

Example B: Video generation -> video understanding -> narration script

Example C: Audio understanding -> time-range transcription -> TTS redub

12. Compliance and risk (must follow)

13. Quick reference (Checklist)

Prose

Clawhub

Feishu Wiki

Coding Agent (bash-first)

Apple Notes

Model Usage