Generates videos using Grok (preferred) and Google Veo 3.1 models. Supports text-to-video, image-to-video, first-last-frame, video extension, and reference images. AI视频生成、文生视频、图生视频、首尾帧生成、视频拓展。
Generates videos using Grok (preferred) and Google Veo 3.1 models.
CRITICAL: Always prefer Grok unless the task requires Veo-specific features.
| Scenario | Use Model | Reason |
|---|---|---|
| Video ≤ 15s (text-to-video / image-to-video) | Grok (preferred) | Grok supports 1-15s directly, faster and cost-effective |
| Video 16-36s (continuous) | Veo | Requires Video Extension, Grok max is 15s |
| First-last-frame storyboard | Veo | Grok does not support first-last-frame |
| Reference images (style consistency) | Veo (generate-preview) | Grok does not support reference images |
| User explicitly requests Veo |
| Veo |
| User preference |
| User explicitly requests Grok | Grok | User preference |
CRITICAL: Video prompts MUST be generated in:
| Model | Speed | Quality | Max Duration | Use Case |
|---|---|---|---|---|
| grok-imagine-video | Fast | Good | 15s | Default - Most use cases |
| Parameter | Values | Default | Description |
|---|---|---|---|
| duration | 1-15 | - | Video duration in seconds |
| resolution | 480p, 720p | 720p | Video resolution |
| aspectRatio | 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3 | 9:16 | Video aspect ratio |
| imageUrl | string | - | Reference image URL for image-to-video |
| Mode | Parameters | Description |
|---|---|---|
| Text-to-video | prompt only | Generate video from text description |
| Image-to-video | prompt + imageUrl | Generate video using a reference image |
generateVideoWithGrok with prompt and optional parametersgetGrokVideoStatus every 30 secondsUse Veo when you need: video extension, first-last-frame, or reference images.
| Model | Speed | Quality | Video Extension | Reference Images | Use Case |
|---|---|---|---|---|---|
| veo-3.1-fast-generate-preview | Fast | Good | Yes | No | Default Veo - Most use cases |
| veo-3.1-generate-preview | Slow | Better | Yes | Yes (max 3) | Best quality + all features |
| veo-3.1-fast-generate-001 | Fast | Good | No | No | Simple generation (no extension) |
| veo-3.1-generate-001 | Slow | Better | No | No | Higher quality (no extension) |
Veo Model Selection:
veo-3.1-fast-generate-preview (supports video extension)| Parameter | Values | Default | Description |
|---|---|---|---|
| duration | 4, 6, 8 | 8 | Video duration in seconds |
| resolution | 720p, 1080p, 4000 | 720p | Video resolution (1080p/4K takes longer) |
| aspectRatio | 16:9, 9:16 | 9:16 | Video aspect ratio (vertical by default for social media) |
| negativePrompt | string | - | What to exclude from the video |
| seed | number | - | Random seed for reproducibility |
Notes:
[Subject & Background] + [Action] + [Style] + [Camera] + [Atmosphere] + [Audio]
Specify your main focus (object, person, animal) and environmental context.
Examples:
Describe what the subject is doing—walking, running, transforming, etc.
Examples:
Add aesthetic direction using keywords.
Common Styles:
Control perspective and movement.
Camera Angles:
Camera Movements:
Specify lighting and color mood.
Examples:
Veo 3.1 natively generates synchronized audio. Include audio descriptions in your prompt.
Use quotation marks for specific speech.
Example:
A man in a suit stands at a podium, speaking confidently: "Welcome to the future of technology."
一位穿西装的男子站在讲台上,自信地说:"欢迎来到科技的未来。"
Describe sounds explicitly.
Examples:
Describe environmental soundscapes.
Examples:
"busy city street with traffic and chatter" / "繁忙的城市街道,车流和人声"
"peaceful forest with birds chirping" / "宁静的森林,鸟鸣声"
"ocean waves crashing on the shore" / "海浪拍打海岸"
"busy city street with traffic and chatter" / "繁忙的城市街道,车流和人声"
"peaceful forest with birds chirping" / "宁静的森林,鸟鸣声"
"ocean waves crashing on the shore" / "海浪拍打海岸"
Modes are auto-detected based on parameters:
| Mode | Parameters | Models Supported |
|---|---|---|
| Text-to-video | prompt only | All |
| Image-to-video | prompt + image | All |
| First-last-frame | prompt + image + lastFrame | All |
| Video extension | prompt + video | Preview models only |
| Reference images | prompt + referenceImages | veo-3.1-generate-preview only |
Extend existing Veo-generated videos.
IMPORTANT: Each extension returns a COMPLETE video (initial + all extensions combined). NO concatenation needed - the API automatically merges the extended content into a single video file.
veo-3.1-generate-preview, veo-3.1-fast-generate-preview)generateVideoWithVeo with video parameter (prefer GCS URI gs://bucket/path for efficiency)Note: When extending videos, prefer using GCS URI over HTTP URL for better performance (no download/base64 conversion needed).
veo-3.1-generate-preview, veo-3.1-fast-generate-preview)generateVideoWithGrok with prompt and optional parametersgetGrokVideoStatus every 30 secondsgenerateVideoWithVeo with prompt and optional parametersgetVeoVideoStatus every 30 secondsgenerateImagegenerateVideoWithVeo with image (first frame) and lastFramegetVeoVideoStatus until completedCRITICAL: For videos ≤ 15 seconds, ALWAYS prefer Grok direct generation. Only use Veo Extension for videos longer than 15 seconds.
| Target Duration | Recommended Approach |
|---|---|
| ≤ 15 seconds | Grok direct generation (preferred, faster and cost-effective) |
| 16-36 seconds | Veo Video Extension (dynamic initial + N extensions) |
| > 36 seconds | First-Last-Frame Storyboard |
Dynamic Initial Duration Calculation:
Parameters:
Calculation Formula:
n = ⌈(target - 8) / 7⌉Reference Table:
| Target | Initial | Extensions | Actual |
|---|---|---|---|
| 18s | 4s | 2 | 18s |
| 22s | 8s | 2 | 22s |
| 27s | 6s | 3 | 27s |
| 29s | 8s | 3 | 29s |
| 36s | 8s | 4 | 36s |
generateVideoWithVeo(video=<previous_video_gcs_uri_or_url>)IMPORTANT: The Gemini video extension API returns a complete video file containing the initial video plus all extensions merged together. You do NOT need to concatenate segments - just use the final extended video directly.
Note: Prefer using GCS URI (gs://bucket/path) for video extension - it's more efficient than HTTP URL.
Example: Target 22 seconds
Note: Extension uses the video URL directly - no need to call uploadAndGetVid.
CRITICAL: This workflow has strict sequential dependencies. Do NOT skip steps.
Use Veo's first-last-frame for precise visual control in multi-segment videos.
Frame Sharing Principle: N video segments require N+1 keyframes
Workflow (MUST follow in order):
Step 1: Generate ALL keyframes FIRST
generateImageimageUrls for style consistencyStep 2: Generate videos with shared frames
generateVideoWithVeo(image=frame1, lastFrame=frame2)generateVideoWithVeo(image=frame2, lastFrame=frame3) ← frame2 shared!Step 3: Concatenate
editing-videos skill, then use submitDirectEditTask to combine all segmentsFor continuous long videos without cuts.
Workflow:
generateVideoWithVeo(video=<previous_video_gcs_uri_or_url>)Note: Prefer using GCS URI (gs://bucket/path) for video extension - it's more efficient than HTTP URL.
Limitations:
When concatenating multiple generated video segments, use submitDirectEditTask with the Track structure.
editing-videos skill firstuploadAndGetVid to get VIDStep 1: Upload all video segments
For each segment URL:
uploadAndGetVid(url) → vid://segment_N, duration_N, width, height
Step 2: Build Track structure
CRITICAL: Choose the correct method based on how videos were generated:
⚠️ MUST use direct concatenation. DO NOT add transitions.
When videos are generated using Veo first-last-frame with shared keyframes:
{
"Canvas": { "Width": 1920, "Height": 1080 },
"Track": [[
{ "Type": "video", "Source": "vid://segment_1", "TargetTime": [0, 8000] },
{ "Type": "video", "Source": "vid://segment_2", "TargetTime": [8000, 16000] },
{ "Type": "video", "Source": "vid://segment_3", "TargetTime": [16000, 24000] }
]]
}
Why no transitions? The shared keyframes between segments already ensure seamless visual continuity.
When videos are independently generated (not using shared keyframes), transitions can help smooth the connection:
{
"Canvas": { "Width": 1920, "Height": 1080 },
"Track": [[
{
"Type": "video",
"Source": "vid://segment_1",
"TargetTime": [0, 8000],
"Extra": [{ "Type": "transition", "Source": "1182376", "Duration": 500 }]
},
{
"Type": "video",
"Source": "vid://segment_2",
"TargetTime": [7500, 15500],
"Extra": [{ "Type": "transition", "Source": "1182376", "Duration": 500 }]
},
{ "Type": "video", "Source": "vid://segment_3", "TargetTime": [15000, 23000] }
]]
}
Key Points:
Step 3: Submit and poll
submitDirectEditTask(Canvas, Track) → taskId
wait 90 seconds
poll getVideoEditTaskStatus(taskId) every 30 seconds until completed
MANDATORY: After generating videos, you MUST output ALL video URLs clearly.
**Generated Video**:

Video URL: url
All N video segments completed!
| # | Status | Preview | URL |
|---|--------|---------|-----|
| 1 | ✅ |  | url1 |
| 2 | ✅ |  | url2 |
| ... | ... | ... | ... |
**All Video URLs**:
1. url1
2. url2
...
**Final Video** (concatenated from N segments):

Final Video URL: final_url
IMPORTANT: Never end a video generation task without explicitly listing all video URLs.