Generate character voices using TTS, voice cloning, and lip-sync tools. Supports Chatterbox, F5-TTS, TTS Audio Suite, RVC, and ElevenLabs. Use when creating speech audio for characters or syncing audio to video.
Creates character voices through TTS/voice cloning and synchronizes them with generated video.
VOICE REQUEST
|
|-- Have reference audio of target voice?
| |-- Yes (5+ seconds) → Chatterbox (MIT, paralinguistic tags)
| |-- Yes (10-15 seconds) → F5-TTS (fastest zero-shot)
| |-- Yes (10+ minutes) → RVC training (highest fidelity)
| |-- Yes (any length, budget) → ElevenLabs (production quality)
|
|-- No reference audio?
| |-- Need emotion control → IndexTTS-2 (8-emotion vectors)
| |-- Need multi-language → TTS Audio Suite (23 languages)
| |-- Need voice design → ElevenLabs Voice Design (describe voice)
| |-- Quick prototype → Any TTS with default voice
|
|-- Need multi-speaker dialog?
| |-- Chatterbox (4 voices) or TTS Audio Suite (character switching)
|
|-- Need lip-sync?
| |-- Best accuracy → Wav2Lip + CodeFormer
| |-- Need head movement → SadTalker
| |-- Full expression control → LivePortrait
| |-- Unlimited length → InfiniteTalk
Strengths: MIT license, beats ElevenLabs 63.8% in blind tests, 5-second sample, emotion control, sub-200ms latency.
Paralinguistic tags:
[laugh] [chuckle] [sigh] [gasp] [cough] [clear throat]
[whisper] [excited] [sad] [angry] [surprised]
Key parameter: exaggeration (0.25-2.0) controls expressiveness.
Limit: 40-second generation cap. Split longer content.
Strengths: Fastest zero-shot cloning, <15 second samples, MIT license, multi-language.
Requirements: Reference audio must be paired with .wav + .txt (matching transcription).
Languages: English, German, Spanish, French, Japanese, Hindi, Thai, Portuguese.
Strengths: Unified multi-engine platform, 23 languages, character switching.
Special features:
[CharacterName] tags[de:Alice], [fr:Bob][pause:1s]Integrates: F5-TTS, Chatterbox, Higgs Audio 2, VibeVoice, IndexTTS-2, RVC.
Strengths: 8-emotion vector control with per-segment parameters.
Emotions: happy, angry, sad, surprised, afraid, disgusted, calm, melancholic.
Use case: Train a model on target voice (10+ min audio), then convert any TTS output.
Pipeline: Text → Any TTS → Base Audio → RVC Model → Character Voice
Training: 300-500 epochs, RMVPE feature extraction.
Tiers:
For each character, establish a voice profile in projects/{project}/characters/{name}/profile.yaml: