스킬 파일

Tts

Name: Tts
Author: detoix

Use when you need local neural speech generation or voice cloning with the MOSS-TTS, OmniVoice, or VoxCPM2 engines, especially when choosing between stable long-form TTS, tag-driven voice design, and GPU-first expressive synthesis.

detoix0 스타2026. 4. 13.

직업
카테고리: 머신러닝

스킬 내용

MOSS, OmniVoice & VoxCPM2 TTS

This skill provides access to three local TTS engines for speech generation and voice cloning.

Environment Note

OmniVoice and VoxCPM2 are installed in the local speech generation environment at:

%USERPROFILE%\Downloads\speech-gen\venv

Prefer that interpreter explicitly for local runs:

%USERPROFILE%\Downloads\speech-gen\venv\Scripts\python.exe generate_omnivoice.py ...
%USERPROFILE%\Downloads\speech-gen\venv\Scripts\python.exe generate_voxcpm.py ...
%USERPROFILE%\Downloads\speech-gen\venv\Scripts\python.exe -m voxcpm.cli ...

The wrapper scripts such as generate_voxcpm.py and generate_omnivoice.py live in:

%USERPROFILE%\Downloads\speech-gen

If a task is being run from another workspace, do not assume a per-project .\venv exists. Use the known environment unless the user provides a different one.

관련 스킬

Tts | Skills Pool

speech-gen

python generate_moss.py --text "Your text" --output "out.wav" [--reference "ref.wav"]

.\venv\Scripts\python.exe generate_omnivoice.py --text "Your text" --instruct "tags" --output "out.wav"

.\venv\Scripts\python.exe generate_voxcpm.py --text "Your text" --output "out.wav"

.\venv\Scripts\python.exe generate_voxcpm.py --text "Your text" --reference "ref.wav" --output "out.wav"

.\venv\Scripts\python.exe -m voxcpm.cli clone ^
  --text "Final target text" ^
  --prompt-audio "voice_sample.wav" ^
  --prompt-text "Exact transcript of voice_sample.wav" ^
  --reference-audio "voice_sample.wav" ^
  --output "out.wav"

$text = @'
Final target text
'@
$prompt = @'
Exact transcript of the sample audio
'@
& 'C:\path\to\venv\Scripts\python.exe' -m voxcpm.cli clone `
  --text $text `
  --prompt-audio 'C:\path\to\voice_sample.wav' `
  --prompt-text $prompt `
  --reference-audio 'C:\path\to\voice_sample.wav' `
  --output 'C:\path\to\out.wav'

$text = @'
(gentle wrap-up)Final target text
'@

.\venv\Scripts\python.exe generate_voxcpm_samples.py

The local wrapper generate_voxcpm.py only exposes plain TTS and basic --reference cloning.
For cloning, prefer the upstream CLI entrypoint: .\venv\Scripts\python.exe -m voxcpm.cli ...
Treat prompt-audio + prompt-text + reference-audio as the default clone path when the user can provide a transcript of the sample.
prompt-text should be the exact spoken words from the sample audio, not a paraphrase.
If the user wants cloning and does not provide a transcript, ask for it explicitly before continuing.
Only fall back to basic --reference cloning if the user explicitly approves that downgrade.
Avoid adding style instructions if the goal is to preserve the sample's original speaking style as closely as possible.
When reproducing a known-good clone command, do not add extra tuning flags unless the user explicitly asks for them.
For the default voxcpm.cli clone path, do not add --normalize, --control, --no-optimize, custom --cfg-value, or custom --inference-timesteps unless the user explicitly requests experimentation.
If the caller supplies a parenthetical cue at the start of the text, treat it as a micro-prosody nudge only.
Keep such cues very short and sparse so the model stays close to the cloned speaker identity.
Prefer cues about discourse position or transition, such as clear start, steady continuation, slight contrast, or gentle wrap-up, over strong mood or persona descriptions.
If a cue makes the output sound less like the speaker, remove it rather than strengthening it.
Prefer handling text normalization at the TTS stage rather than pushing engine-specific rewrites upstream into the scriptwriter skill.
If the narration contains digits, dates, abbreviations, passwords, or mixed-language tokens and the raw clone sounds garbled, test --normalize first so wetext can expand the text automatically.
Only manually rewrite the target text when wetext normalization is unavailable, insufficient, or the user explicitly wants hand-normalized phrasing.
Use the local low-VRAM defaults first when staying in the wrapper: optimize=False, load_denoiser=False, cfg_value=2.0, inference_timesteps=8.
Plain English and Polish TTS were verified to fit in VRAM at roughly 5247 MiB allocated and ~5.6-5.7 GiB reserved.
Start with plain TTS before cloning. Cloning is the next likely point of VRAM failure.

Tts

MOSS, OmniVoice & VoxCPM2 TTS

Environment Note

Tts

MOSS, OmniVoice & VoxCPM2 TTS

Environment Note

Workflow

1. Engine Selection

2. Quick Selection Guide

3. Speech Generation

Using MOSS-TTS

Using OmniVoice

Using VoxCPM2

4. VoxCPM2 Practical Guidance

5. Notes

References

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns