Generate and align AI voiceover tracks for video edits using Gemini TTS. Covers single-speaker narration, multi-speaker dialogue, voice selection, style prompting, and timeline integration.
For basic voiceover, call generate_speech with just the text:
generate_speech(text="Welcome to our product tour.")
Defaults: voice Kore, model gemini-2.5-flash-preview-tts, output saved to public/assets/tts/tts-<timestamp>.wav.
Pick a voice that fits the tone. Here are the top picks by use case:
| Use Case | Recommended Voices |
|---|---|
| Neutral narration | Kore (Firm), Charon (Informative), Schedar (Even), Sadaltager (Knowledgeable) |
| Energetic / upbeat | Puck (Upbeat), Fenrir (Excitable), Laomedeia (Upbeat) |
| Warm storytelling | Sulafat (Warm), Achird (Friendly), Vindemiatrix (Gentle) |
| Serious / authoritative |
| Kore (Firm), Orus (Firm), Alnilam (Firm), Gacrux (Mature) |
| Soft / calm | Achernar (Soft), Enceladus (Breathy) |
| Casual / conversational | Zubenelgenubi (Casual), Callirrhoe (Easy-going), Umbriel (Easy-going) |
Full 30-voice list → references/voices.md
Use style_prompt for delivery control:
generate_speech(
text="This is the moment everything changed.",
voice_name="Sulafat",
style_prompt="Speak slowly and dramatically, like a documentary narrator building suspense."
)
Style prompts control tone, pacing, accent, and emotion. Keep them short and clear for simple tasks.
For two-speaker conversations, use generate_speech with speakers:
generate_speech(
text="""Alex: Hey, did you see the new release?
Sam: Yeah, it looks incredible!""",
speakers=[
{"name": "Alex", "voice_name": "Kore"},
{"name": "Sam", "voice_name": "Puck"}
]
)
speakers list must exactly match the names used in the text.gemini-2.5-flash-preview-tts.Full multi-speaker details → references/multi-speaker.md
generate_speech(...) → saves WAV to public/assets/tts/.get_asset_info(...) to get exact duration.<Sequence> + <Audio> components.inspect_asset(...) — check pronunciation, pacing, emotional fit.generate_speech(
text: str, # Required. The narration text.
output_path: str = None, # Optional. e.g. "public/assets/tts/intro.wav"
voice_name: str = "Kore", # Optional. Any of the 30 prebuilt voices.
style_prompt: str = "", # Optional. Natural-language delivery instructions.
model: str = "gemini-2.5-flash-preview-tts", # Optional. Or "gemini-2.5-pro-preview-tts".
speakers: list = None # Optional. For multi-speaker. List of {name, voice_name}.
)
Returns: { success, path, duration_seconds, sample_rate_hz, channels, size_bytes, ... }
Output is always 24kHz mono WAV (PCM 16-bit).
| Model | Single | Multi | Notes |
|---|---|---|---|
gemini-2.5-flash-preview-tts | ✓ | ✓ | Default. Fast. Use for most tasks. |
gemini-2.5-pro-preview-tts | ✓ | ✗ | Higher quality single-speaker only. |
For simple tasks, a one-line style_prompt is enough:
"Read in a calm, professional tone.""Excited sports announcer style.""Whisper softly, like telling a secret."For complex performances, structure the prompt with these elements:
Full prompting guide with examples → references/prompting-style-and-flow.md
| Situation | Reference to Read |
|---|---|
| Selecting a voice or hearing all options | references/voices.md |
| Multi-speaker dialogue setup | references/multi-speaker.md |
| Complex style / accent / pacing control | references/prompting-style-and-flow.md |
| Non-English narration or language questions | references/languages.md |
| Model selection or API constraints | references/models-and-limits.md |
| Chunking, duration fitting, Remotion placement | references/timeline-integration.md |
| Simple one-line TTS | No references needed — use the info above |