XTTS v2 Constraints

Mono input only

XTTS expects mono audio for the reference/speaker WAV. Always downmix stereo to mono before passing:

import librosa
audio, sr = librosa.load(reference_path, sr=22050, mono=True)

Or with ffmpeg:

ffmpeg -i stereo.wav -ac 1 mono.wav

Voice reference from longest segments

Build the reference WAV by concatenating the longest speech segments until total duration ≥ 8 seconds (REFERENCE_AUDIO_MIN_SECONDS from config). This gives XTTS enough voice sample to clone from:

segments_sorted = sorted(segments, key=lambda s: s.end - s.start, reverse=True)
selected = []
total_duration = 0.0
for seg in segments_sorted:
    selected.append(seg)
    total_duration += seg.end - seg.start
    if total_duration >= settings.REFERENCE_AUDIO_MIN_SECONDS:
        break

XTTS v2 Constraints

Mono input only

XTTS expects mono audio for the reference/speaker WAV. Always downmix stereo to mono before passing:

import librosa
audio, sr = librosa.load(reference_path, sr=22050, mono=True)

Or with ffmpeg:

ffmpeg -i stereo.wav -ac 1 mono.wav

Voice reference from longest segments

Build the reference WAV by concatenating the longest speech segments until total duration ≥ 8 seconds (REFERENCE_AUDIO_MIN_SECONDS from config). This gives XTTS enough voice sample to clone from:

segments_sorted = sorted(segments, key=lambda s: s.end - s.start, reverse=True)
selected = []
total_duration = 0.0
for seg in segments_sorted:
    selected.append(seg)
    total_duration += seg.end - seg.start
    if total_duration >= settings.REFERENCE_AUDIO_MIN_SECONDS:
        break

Worker Audio Processing

XTTS v2 Constraints

Mono input only

Voice reference from longest segments

Worker Audio Processing

XTTS v2 Constraints

Mono input only

Voice reference from longest segments

Atempo Time-Stretching

Audio Ducking

Demucs Vocal Separation

ffmpeg Operations Reference

Audio Quality Rules

Songsee

Video Frames

Gifgrep

Qqbot Media

Camsnap

Openai Whisper Api