MiniMax Voice Maker

Professional text-to-speech skill with emotion detection, voice cloning, and audio processing capabilities powered by MiniMax Voice API and FFmpeg.

Capabilities

Area	Features
TTS	Sync (HTTP/WebSocket), async (long text), streaming
Segment-based	Multi-voice, multi-emotion synthesis from segments.json, auto merge
Voice	Cloning (10s–5min), design (text prompt), management
Audio	Format conversion, merge, normalize, trim, remove silence (FFmpeg)

File structure:

mmVoice_Maker/
├── SKILL.md                       # This overview
├── mmvoice.py                     # CLI tool (recommended for Agents)
├── check_environment.py           # Environment verification
├── requirements.txt
├── scripts/                       # Entry: scripts/__init__.py
│   ├── utils.py                   # Config, data classes
│   ├── sync_tts.py                # HTTP/WebSocket TTS
│   ├── async_tts.py               # Long text TTS
│   ├── segment_tts.py             # Segment-based TTS (multi-voice, multi-emotion)
│   ├── voice_clone.py             # Voice cloning
│   ├── voice_design.py            # Voice design
│   ├── voice_management.py        # List/delete voices
│   └── audio_processing.py        # FFmpeg audio tools
└── reference/                     # Load as needed
    ├── cli-guide.md               # CLI usage guide
    ├── getting-started.md         # Setup and quick test
    ├── tts-guide.md               # Sync/async TTS workflows
    ├── voice-guide.md             # Clone/design/manage
    ├── audio-guide.md             # Audio processing
    ├── script-examples.md         # Runnable code snippets
    ├── troubleshooting.md         # Common issues
    ├── api_documentation.md       # Complete API reference
    └── voice_catalog.md           # Voice selection guide

Area

Features

TTS

Sync (HTTP/WebSocket), async (long text), streaming

Segment-based

Multi-voice, multi-emotion synthesis from segments.json, auto merge

Voice

Cloning (10s–5min), design (text prompt), management

Audio

Format conversion, merge, normalize, trim, remove silence (FFmpeg)

File structure:

mmVoice_Maker/ ├── SKILL.md # This overview ├── mmvoice.py # CLI tool (recommended for Agents) ├── check_environment.py # Environment verification ├── requirements.txt ├── scripts/ # Entry: scripts/__init__.py │ ├── utils.py # Config, data classes │ ├── sync_tts.py # HTTP/WebSocket TTS │ ├── async_tts.py # Long text TTS │ ├── segment_tts.py # Segment-based TTS (multi-voice, multi-emotion) │ ├── voice_clone.py # Voice cloning │ ├── voice_design.py # Voice design │ ├── voice_management.py # List/delete voices │ └── audio_processing.py # FFmpeg audio tools └── reference/ # Load as needed ├── cli-guide.md # CLI usage guide ├── getting-started.md # Setup and quick test ├── tts-guide.md # Sync/async TTS workflows ├── voice-guide.md # Clone/design/manage ├── audio-guide.md # Audio processing ├── script-examples.md # Runnable code snippets ├── troubleshooting.md # Common issues ├── api_documentation.md # Complete API reference └── voice_catalog.md # Voice selection guide

[ { "text": "The scientist explained, \"The results show significant improvement in all test groups.\"", "role": "narrator", "voice_id": "", "emotion": "" }, { "text": "According to the latest report, the economy has grown by 3% this quarter.", "role": "narrator", "voice_id": "", "emotion": "" } ] **Note:** In the preliminary `segments.json`: - Fill in the `text` field with segment content - Fill in the `role` field to identify the character (narrator, male_character, female_character, host, guest, etc.) - Leave `voice_id` empty (to be filled in Step 2.2) - Leave `emotion` empty for speech-2.8 models **Step 2.2: Voice Selection** After segmenting and labeling roles, analyze all detected characters in your text. Consult [voice_catalog.md](reference/voice_catalog.md) **Section 1 "How to Choose a Voice"** to match voices to characters. **⚠️ CRITICAL: Follow the two-step selection process below** **Path A — Professional domains (Story/Narration, News/Announcements, Documentary):** If the content belongs to one of these three professional domains, prioritize selecting from the recommended voices in **voice_catalog.md Section 2.1** (filter by scenario + gender). These voices are specifically optimized for their professional use cases. **Path B — All other scenarios:** Select from **voice_catalog.md Section 2.2**, following this strict priority hierarchy: 1. **First: Match Gender** (non-negotiable) — Male characters MUST use male voices, female characters MUST use female voices 2. **Second: Match Language** — The voice MUST match the content language (Chinese content → Chinese voice, Korean content → Korean voice, English content → English voice, etc.). Never assign a voice from the wrong language. 3. **Third: Match Age** — Determine the age group (Children / Youth / Adult / Elderly / Professional) and select from the corresponding subsection in Section 2.2 4. **Fourth: Match Personality & Role** — Choose the best fit based on personality traits, tone, and character role **Voice Selection Decision Tree:**

**Step 2.3: Emotions Segmentation** *(For non-2.8 series models only)* For models other than speech-2.8 series, analyze emotions in your segments: - For **long segments**, split further based on **emotional transitions** - Add appropriate **emotion tags** to each segment - Refer to Section 3 in [text-processing.md](reference/text-processing.md) for emotion tags and examples - Skip this step for speech-2.8 models (emotion is auto-matched) **Emotion Tags:** - For speech-2.6 series (speech-2.6-hd and speech-2.6-turbo): happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper - For older models: happy, sad, angry, fearful, disgusted, surprised, calm (7 emotions) **Step 2.4: Check and Post-processing** Finally, review and optimize your script: - Verify segment length limits (async TTS ≤1,000,000 characters) - Clean up conversational text (remove speaker names if needed) - Ensure consistency in voice and emotion tags - **Critical check for multi-voice content**: For audiobooks, multi-voice fiction, or content where dialogue is presented from a first-person perspective, verify that narration and dialogue mixed in the same sentence are properly split. **When splitting IS needed (first-person dialogue in fiction/audiobooks):** Example: `"John asked, 'Where are you going?'"` should be split into: - Segment 1: `"John asked, "` - uses narrator voice (describes who is speaking) - Segment 2: `"Where are you going?"` - uses the character's voice (actual dialogue in first-person) This ensures proper voice differentiation: descriptive narration uses the narrator's voice, while the character's spoken words use the character's designated voice. **When splitting is NOT needed (third-person quotes in podcast/documentary/news):** In podcasts, documentaries, or news reports, quoted speech is typically presented in third-person narrative style - the speaker's words are being reported, not performed. Keep these as one segment with the narrator's voice and remove the speaker's name at the beginning: - `"Welcome to our show." → narrator voice, remove the speaker's name (like "The host said:") at the beginning - `"According to experts, 'This technology represents a significant breakthrough.'" → keep as one segment (narrator voice) - `"Scientists noted, 'The experimental results exceeded our expectations.'" → keep as one segment (narrator voice) - **If the split is missing**: Go back to Step 2.1 and ensure dialogue portions are separated from narration with appropriate role labels. **Create segments.json:** After completing all 4 sub-steps, save the final `segments.json` to `<cwd>/audio/segments.json`. ### Step 2.5: Generate Preview for User Confirmation (Highly Recommended) **For multi-voice content (audiobooks, dramas, etc.), always generate a preview first.** This saves time and prevents waste when voice selections need adjustment. **How to generate a preview:** 1. Create a smaller segments file with 10-20 representative segments (include all characters) 2. Generate the preview audio 3. Ask user to listen and confirm voice choices **Preview segments.json example:** ```json [ {"text": "Narration opening...", "role": "narrator", "voice_id": "...", "emotion": ""}, {"text": "Male character speaks...", "role": "male_character", "voice_id": "...", "emotion": ""}, {"text": "Female character speaks...", "role": "female_character", "voice_id": "...", "emotion": ""}, {"text": "More dialogue...", "role": "...", "voice_id": "...", "emotion": ""} ]

Character	Wrong Voice	Correct Voice
唐三藏 (male monk)	`female-yujie` ❌	`Chinese (Mandarin)_Gentleman` ✅
林黛玉 (female)	`male-qn-badao` ❌	`female-shaonv` ✅
曹操 (male warlord)	`female-chengshu` ❌	`Chinese (Mandarin)_Unrestrained_Young_Man` ✅

Scenario	Description	Segments	Voice Selection
Single Voice	User needs one voice for the entire content. Segment only by length (≤1,000,000 chars per segment).	Split by length only	One voice_id for all segments
Multi-Voice	Multiple characters/speakers, each with different voice. Segment by speaker/role changes.	Split by logical unit (speaker, dialogue, etc.)	Different voice_id per role
Podcast/Interview	Host and guest speakers with distinct voices.	Split by speaker	Voice per host/guest
Audiobook/Fiction	Narrator and character voices.	Split by narration vs. dialogue	Voice per narrator/character
Documentary	Mostly narration with occasional quotes.	Keep as one segment	Single narrator voice
Report/Announcement	Formal content with consistent tone.	Keep as one segment	Professional voice

Use case	Example	Split strategy
Single Voice	Long article, news piece, announcement	Split by length (≤1,000,000 chars), same voice for all
Podcast/Interview	"Host: Welcome to the show. Guest: Thank you for having me."	Split by speaker
Documentary narration	"The scientist explained, 'The results are promising.'"	Keep as one segment (narrator voice)
Audiobook/Fiction	"'Who's there?' she whispered."	Split: "'Who's there?'" should be in character voice, while "she whispered." should be in narrator's voice
Report	"According to the report, the economy is growing."	Keep as one segment

Model	Emotion Validation
speech-2.8-hd/turbo	Emotion can be empty (auto emotion matching)
speech-2.6-hd/turbo	All 9 emotions supported
Older models	happy, sad, angry, fearful, disgusted, surprised, calm (7 emotions)

Document	Content for the Agent
reference/cli-guide.md	All CLI commands (`validate`, `generate`, `tts`, `clone`, `design`, `list-voices`, `merge`, `convert`, `check-env`) with options and examples. Use for correct CLI invocation.
reference/getting-started.md	Environment setup (venv, `pip install`, FFmpeg), `MINIMAX_VOICE_API_KEY`, basic synthesis test. Use for first-time setup or “env not working”.
reference/tts-guide.md	Sync TTS (short text), async TTS (long text), streaming TTS, multi-segment production. Use for sync/async/streaming logic and parameters.
reference/voice-guide.md	Voice cloning (quick, high-quality with prompt audio, step-by-step), voice design, voice management. Use for custom voice creation flows.
reference/audio-guide.md	Format conversion, merging (including crossfade and fallback), normalization, trimming, optimization. Use for merge/convert/normalize behavior and options.
reference/script-examples.md	Copy-paste runnable examples for sync TTS, async TTS, segment-based TTS, audio processing, voice clone/design/management. Use for quick Python snippets.
reference/troubleshooting.md	Environment (API key, FFmpeg), API errors, segment-based TTS, audio, voice. Use when an error message or unexpected behavior appears.
reference/api_documentation.md	Full API reference: config, sync/async TTS, emotion parameter, segment-based TTS, voice clone/design/management, audio processing, common parameters, error handling. Use for exact function signatures and parameter details.
reference/voice_catalog.md	System voices list (male/female/beta), selection guide, voice parameters, custom voices, voice IDs. Use to choose or look up `voice_id`.

Mm Voice Maker

MiniMax Voice Maker

Capabilities

File structure:

Mm Voice Maker

MiniMax Voice Maker

Capabilities

File structure:

Main Workflow Guideline (Text to Speech)

Step 1: Verify environment

Step 2: Decision and Pre-processing

Step 3: Present plan to user for confirmation

Step 4: Validate segments.json (model, emotion, voice_id validation)

Step 5: Generate and merge audio

Step 6: Confirm and cleanup

Other Usage

Voice creation (clone / design / list)

Text-to-speech (sync / async)

Audio processing (merge / convert / normalize)

Segment-based TTS (main workflow)

Reference documents (on-demand)

Important notes

Requirements

Limits and constraints

Special features

Troubleshooting

Songsee

Video Frames

Gifgrep

Qqbot Media

Camsnap

Openai Whisper Api