This skill should be used when writing audio processing code in the worker — XTTS voice cloning constraints, atempo stretch clamping, reference audio preparation, Demucs vocal separation, audio ducking, or ffmpeg audio/video operations.
XTTS expects mono audio for the reference/speaker WAV. Always downmix stereo to mono before passing:
import librosa
audio, sr = librosa.load(reference_path, sr=22050, mono=True)
Or with ffmpeg:
ffmpeg -i stereo.wav -ac 1 mono.wav
Build the reference WAV by concatenating the longest speech segments until total duration ≥ 8 seconds (REFERENCE_AUDIO_MIN_SECONDS from config). This gives XTTS enough voice sample to clone from:
segments_sorted = sorted(segments, key=lambda s: s.end - s.start, reverse=True)
selected = []
total_duration = 0.0
for seg in segments_sorted:
selected.append(seg)
total_duration += seg.end - seg.start
if total_duration >= settings.REFERENCE_AUDIO_MIN_SECONDS:
break
TTS clips must be stretched to match original segment duration. Clamp the ratio to prevent unintelligible audio:
ATEMPO_MIN = 0.75 # from config
ATEMPO_MAX = 1.50
ratio = tts_duration / segment_duration
ratio = max(ATEMPO_MIN, min(ATEMPO_MAX, ratio))
ffmpeg atempo filter (values outside [0.5, 2.0] require chaining):
# For ratios in [0.75, 1.5] a single atempo filter suffices:
ffmpeg_filter = f"atempo={ratio:.4f}"
After stretching, hard-trim the clip to the exact segment duration and apply a 50ms fade-out to prevent audio bleed into the next segment:
# ffmpeg trim + afade
"-af", f"atempo={ratio:.4f},atrim=duration={segment_duration},afade=t=out:st={segment_duration - 0.05}:d=0.05"
During dubbed speech segments, the background (no-vocals) track is ducked to a configurable level, then restored in silence:
volume filter with per-segment timeline expressions, or sidechaincompress.config.py (e.g. DUCK_LEVEL_DB = -12).Uses htdemucs model with --two-stems vocals:
demucs --two-stems vocals --name htdemucs -o {output_dir} {input_file}
Produces:
vocals.wav — isolated speechno_vocals.wav — background music/noiseBoth are uploaded to MinIO after separation. URLs are cached in MongoDB by the API so subsequent re-dubs skip Demucs entirely.
Extract audio from video:
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 44100 audio.wav
Mux dubbed audio back into video:
ffmpeg -i original_video.mp4 -i dubbed_audio.wav -c:v copy -c:a aac -map 0:v:0 -map 1:a:0 output.mp4
Convert to mono 22050 Hz (XTTS reference):
ffmpeg -i input.wav -ac 1 -ar 22050 reference_mono.wav
-c:v copy) — never re-encode video unnecessarily.