Generate speech audio from text using Qwen3 TTS, or clone a voice from reference audio. Triggered when the user wants to convert text to speech, generate audio, read text aloud, or clone/mimic a voice. Supports multiple speakers, English and Chinese, and emotion/style control.
Generate speech audio from text, or clone a voice from a reference audio file.
{baseDir}/scripts/tts — Text-to-speech generation with named speakers.{baseDir}/scripts/voice_clone — Voice cloning from a reference audio file.{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-CustomVoice — Named speaker TTS (0.6B parameters).{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-Base — Voice cloning from reference audio (0.6B parameters).Pre-packaged reference audio files for voice cloning are available at {baseDir}/scripts/reference_audio/. Each speaker has two files:
{baseDir}/scripts/reference_audio/<speaker_name>.wav — Reference audio (mono 24kHz 16-bit WAV){baseDir}/scripts/reference_audio/<speaker_name>.txt — Transcript of the reference audioAvailable reference speakers: trump, elon_musk.
tts — When the user wants to generate speech from text using a named speaker (Vivian, Ryan, etc.). Supports English and Chinese.voice_clone — When the user wants to clone a specific voice from a reference audio file and generate new speech in that voice. If the user asks to clone a voice by speaker name (e.g., "speak like Trump", "use Elon Musk's voice"), check {baseDir}/scripts/reference_audio/ for a matching <speaker_name>.wav and <speaker_name>.txt pair, and use ICL mode with both files.On Linux, the binaries require libtorch shared libraries. Set the library path before running any command:
export LD_LIBRARY_PATH={baseDir}/scripts/libtorch/lib:$LD_LIBRARY_PATH
On macOS, no environment setup is needed (the binaries use the MLX backend). All commands below show the macOS form. On Linux, prefix each command with LD_LIBRARY_PATH={baseDir}/scripts/libtorch/lib:$LD_LIBRARY_PATH.
Generate speech audio from text with a named speaker.
{baseDir}/scripts/tts \
{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-CustomVoice \
"<text>" \
<speaker> \
<language>
| Parameter | Required | Description |
|---|---|---|
| model_path | Yes | Path to the model directory |
| text | Yes | The text to synthesize as speech |
| speaker | Yes | Speaker name (see Available Speakers below) |
| language | Yes | english or chinese |
Vivian, Serena, Ryan, Aiden, Uncle_fu, Ono_anna, Sohee, Eric, Dylan.
Generates output.wav (24kHz mono WAV) in the current working directory.
{baseDir}/scripts/tts \
{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-CustomVoice \
"Hello! Welcome to the Qwen3 text-to-speech system." \
Vivian \
english
Clone a voice from a reference audio file using ICL (In-Context Learning). This encodes the reference audio into codec tokens and conditions generation on both the speaker embedding and the reference audio/text transcript, producing high-fidelity voice cloning.
Both a reference audio file and its transcript text are required.
{baseDir}/scripts/voice_clone \
{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-Base \
<reference_audio.wav> \
"<text>" \
<language> \
"<reference_text>"
| Parameter | Required | Description |
|---|---|---|
| model_path | Yes | Path to the Base model directory |
| reference_audio | Yes | Path to reference WAV file (mono 24kHz 16-bit) |
| text | Yes | The text to synthesize in the cloned voice |
| language | Yes | english or chinese |
| reference_text | Yes | Transcript of the reference audio |
The reference audio must be a mono 24kHz 16-bit WAV file. Convert from other formats with ffmpeg:
ffmpeg -i input.m4a -ac 1 -ar 24000 -sample_fmt s16 reference.wav
Generates output_voice_clone.wav (24kHz mono WAV) in the current working directory.
{baseDir}/scripts/voice_clone \
{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-Base \
reference.wav \
"This is a voice cloning test with in-context learning." \
english \
"The transcript of what was said in the reference audio."
tts.voice_clone.voice_clone with the pre-packaged reference audio.tts: Identify the text, speaker name, and language from the user's request. Default to Vivian and english if not specified.voice_clone with a named reference speaker:
{baseDir}/scripts/reference_audio/<speaker_name>.wav and {baseDir}/scripts/reference_audio/<speaker_name>.txt..txt file..wav file and the transcript text.voice_clone with a user-provided audio file: Ensure the reference audio is a mono 24kHz 16-bit WAV. Convert if needed using ffmpeg. Ask the user for the transcript of the reference audio.Run the appropriate binary using the full paths to the binaries and model directories. On Linux, prefix with LD_LIBRARY_PATH={baseDir}/scripts/libtorch/lib:$LD_LIBRARY_PATH.
If the user says "Say hello world in Trump's voice":
# Read the transcript
REF_TEXT=$(cat {baseDir}/scripts/reference_audio/trump.txt)
# Run voice clone with ICL mode
{baseDir}/scripts/voice_clone \
{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-Base \
{baseDir}/scripts/reference_audio/trump.wav \
"Hello world" \
english \
"$REF_TEXT"
The output WAV file will be in the current working directory:
tts produces output.wavvoice_clone produces output_voice_clone.wavInform the user of the output file path.