Use when you need local neural speech generation or voice cloning with the MOSS-TTS, OmniVoice, or VoxCPM2 engines, especially when choosing between stable long-form TTS, tag-driven voice design, and GPU-first expressive synthesis.
This skill provides access to three local TTS engines for speech generation and voice cloning.
OmniVoice and VoxCPM2 are installed in the local speech generation environment at:
%USERPROFILE%\Downloads\speech-gen\venv
Prefer that interpreter explicitly for local runs:
%USERPROFILE%\Downloads\speech-gen\venv\Scripts\python.exe generate_omnivoice.py ...
%USERPROFILE%\Downloads\speech-gen\venv\Scripts\python.exe generate_voxcpm.py ...
%USERPROFILE%\Downloads\speech-gen\venv\Scripts\python.exe -m voxcpm.cli ...
The wrapper scripts such as generate_voxcpm.py and generate_omnivoice.py live in:
%USERPROFILE%\Downloads\speech-gen
If a task is being run from another workspace, do not assume a per-project .\venv exists. Use the known environment unless the user provides a different one.
speech-genclone CLI path as the default, not as an optional upgrade.male calm or female assertive if a cloned voice is the goal.python generate_moss.py --text "Your text" --output "out.wav" [--reference "ref.wav"]
.\venv\Scripts\python.exe generate_omnivoice.py --text "Your text" --instruct "tags" --output "out.wav"
Plain TTS:
.\venv\Scripts\python.exe generate_voxcpm.py --text "Your text" --output "out.wav"
Basic cloning:
.\venv\Scripts\python.exe generate_voxcpm.py --text "Your text" --reference "ref.wav" --output "out.wav"
Default cloning path when the user has a speech sample and exact transcription:
.\venv\Scripts\python.exe -m voxcpm.cli clone ^
--text "Final target text" ^
--prompt-audio "voice_sample.wav" ^
--prompt-text "Exact transcript of voice_sample.wav" ^
--reference-audio "voice_sample.wav" ^
--output "out.wav"
Use this mode by default for cloning whenever the user can provide the exact transcript of the sample audio. This was the best-performing local workflow for Polish cloning in practice.
Preferred PowerShell invocation for reliable local runs:
$text = @'
Final target text
'@
$prompt = @'
Exact transcript of the sample audio
'@
& 'C:\path\to\venv\Scripts\python.exe' -m voxcpm.cli clone `
--text $text `
--prompt-audio 'C:\path\to\voice_sample.wav' `
--prompt-text $prompt `
--reference-audio 'C:\path\to\voice_sample.wav' `
--output 'C:\path\to\out.wav'
Use this PowerShell here-string pattern when the text is multiline, contains punctuation-heavy Polish narration, or when reproducing a known-good local command exactly matters more than using intermediate text files.
Optional micro-cue pattern for cloned speech:
$text = @'
(gentle wrap-up)Final target text
'@
Use this only as a very soft, short parenthetical cue at the start of the text when the caller explicitly wants boundary-sensitive prosody nudges. Treat it as a subtle delivery hint, not full voice design.
Prepared sample commands without execution:
.\venv\Scripts\python.exe generate_voxcpm_samples.py
Valid English Tags (Strict):
american accent, australian accent, british accent, canadian accent, chinese accent, indian accent, japanese accent, korean accent, portuguese accent, russian accent, female, male, child, teenager, young adult, middle-aged, elderly, very high pitch, high pitch, moderate pitch, low pitch, very low pitch, whisper.
generate_voxcpm.py only exposes plain TTS and basic --reference cloning..\venv\Scripts\python.exe -m voxcpm.cli ...prompt-audio + prompt-text + reference-audio as the default clone path when the user can provide a transcript of the sample.prompt-text should be the exact spoken words from the sample audio, not a paraphrase.--reference cloning if the user explicitly approves that downgrade.voxcpm.cli clone path, do not add --normalize, --control, --no-optimize, custom --cfg-value, or custom --inference-timesteps unless the user explicitly requests experimentation.clear start, steady continuation, slight contrast, or gentle wrap-up, over strong mood or persona descriptions.--normalize first so wetext can expand the text automatically.wetext normalization is unavailable, insufficient, or the user explicitly wants hand-normalized phrasing.optimize=False, load_denoiser=False, cfg_value=2.0, inference_timesteps=8.5247 MiB allocated and ~5.6-5.7 GiB reserved.voxcpm package and the local wrapper generate_voxcpm.py.clone workflow that uses prompt audio and a transcript.python -m voxcpm.cli clone instead of the wrapper unless they explicitly ask for a fallback.--reference and maps to the upstream reference_wav_path API.generate_voxcpm_samples.py prepares the numbered sample set voxcpm_1.wav, voxcpm_2.wav, voxcpm_3.wav, and voxcpm_clone_test.wav.temp\voxcpm_clone_reference.wav.moss_local_config.yaml.models/MOSS-TTS-GGUF/MOSS_TTS_Q4_K_M.ggufdrbaph/OmniVoice-bf16openbmb/VoxCPM2