MiniMax multimodal model skill — use MiniMax Multi-Modal models for speech, music, video, and image. Create voice, music, video, and images with MiniMax AI: TTS (text-to-speech, voice cloning, voice design, multi-segment), music (songs, instrumentals), video (text-to-video, image-to-video, start-end frame, subject reference, templates, long-form multi-scene), image (text-to-image, image-to-image with character reference), and media processing (convert, concat, trim, extract). Use when the user mentions MiniMax, multimodal generation, or wants speech/music/video/image AI, MiniMax APIs, or FFmpeg workflows alongside MiniMax outputs.
Generate voice, music, video, and image content via MiniMax APIs — the unified entry for MiniMax multimodal use cases (audio + music + video + image). Includes voice cloning & voice design for custom voices, image generation with character reference, and FFmpeg-based media tools for audio/video format conversion, concatenation, trimming, and extraction.
All generated files MUST be saved to ~/.openclaw/openclaw-data/multimodal/outputs/ (for persistent storage). Every script call MUST include an explicit --output / -o argument pointing to this location. Never omit the output argument or rely on script defaults.
Rules:
mkdir -p ~/.openclaw/openclaw-data/multimodal/outputs/)--output ~/.openclaw/openclaw-data/multimodal/outputs/video.mp4cd into the skill directory to run scripts — run from the agent's working directory using the full script path~/.openclaw/openclaw-data/multimodal/outputs/tmp/rm -rf ~/.openclaw/openclaw-data/multimodal/outputs/tmpbrew install ffmpeg jq # macOS (or apt install ffmpeg jq on Linux)
bash scripts/check_environment.sh
No Python or pip required — all scripts are pure bash using curl, ffmpeg, jq, and xxd.
MiniMax provides two service endpoints for different regions. Set MINIMAX_API_HOST before running any script:
| Region | Platform URL | API Host Value |
|---|---|---|
| China Mainland(中国大陆) | https://platform.minimaxi.com | https://api.minimaxi.com |
| Global(全球) | https://platform.minimax.io | https://api.minimax.io |
# China Mainland
export MINIMAX_API_HOST="https://api.minimaxi.com"
# or Global
export MINIMAX_API_HOST="https://api.minimax.io"
IMPORTANT — When API Host is missing:
Before running any script, check if MINIMAX_API_HOST is set in the environment. If it is NOT configured:
https://api.minimaxi.comhttps://api.minimax.ioexport MINIMAX_API_HOST="https://api.minimaxi.com" (or the global variant) in their terminal or add it to their shell profile (~/.zshrc / ~/.bashrc) for persistenceSet the MINIMAX_API_KEY environment variable before running any script:
export MINIMAX_API_KEY="your-api-key-here"
The key starts with sk-api- or sk-cp-, obtainable from https://platform.minimaxi.com (China) or https://platform.minimax.io (Global)
IMPORTANT — When API Key is missing:
Before running any script, check if MINIMAX_API_KEY is set in the environment. If it is NOT configured:
export MINIMAX_API_KEY="sk-..." in their terminal or add it to their shell profile (~/.zshrc / ~/.bashrc) for persistenceIMPORTANT — Always respect the user's plan limits before generating content. If the user's quota is exhausted or insufficient, warn them before proceeding.
| Capability | Starter | Plus | Max |
|---|---|---|---|
| M2.7 (chat) | 600 req/5h | 1,500 req/5h | 4,500 req/5h |
| Speech 2.8 | — | 4,000 chars/day | 11,000 chars/day |
| image-01 | — | 50 images/day | 120 images/day |
| Hailuo-2.3-Fast 768P 6s | — | — | 2 videos/day |
| Hailuo-2.3 768P 6s | — | — | 2 videos/day |
| Music-2.5 | — | — | 4 songs/day (≤5 min each) |
| Capability | Plus-HS | Max-HS | Ultra-HS |
|---|---|---|---|
| M2.7-highspeed (chat) | 1,500 req/5h | 4,500 req/5h | 30,000 req/5h |
| Speech 2.8 | 9,000 chars/day | 19,000 chars/day | 50,000 chars/day |
| image-01 | 100 images/day | 200 images/day | 800 images/day |
| Hailuo-2.3-Fast 768P 6s | — | 3 videos/day | 5 videos/day |
| Hailuo-2.3 768P 6s | — | 3 videos/day | 5 videos/day |
| Music-2.5 | — | 7 songs/day (≤5 min each) | 15 songs/day (≤5 min each) |
Key quota constraints:
| Capability | Description | Entry point |
|---|---|---|
| TTS | Text-to-speech synthesis with multiple voices and emotions | scripts/tts/generate_voice.sh |
| Voice Cloning | Clone a voice from an audio sample (10s–5min) | scripts/tts/generate_voice.sh clone |
| Voice Design | Create a custom voice from a text description | scripts/tts/generate_voice.sh design |
| Music Generation | Generate songs with lyrics or instrumental tracks | scripts/music/generate_music.sh |
| Image Generation | Text-to-image, image-to-image with character reference | scripts/image/generate_image.sh |
| Video Generation | Text-to-video, image-to-video, subject reference, templates | scripts/video/generate_video.sh |
| Long Video | Multi-scene chained video with crossfade transitions | scripts/video/generate_long_video.sh |
| Media Tools | Audio/video format conversion, concatenation, trimming, extraction | scripts/media_tools.sh |
Entry point: scripts/tts/generate_voice.sh
| User intent | Approach |
|---|---|
| Single voice / no multi-character need | tts command — generate the entire text in one call |
| Multiple characters / narrator + dialogue | generate command with segments.json |
Default behavior: When the user simply asks to generate speech/voice and does NOT mention multiple voices or characters, use the tts command directly with a single appropriate voice. Do NOT split into segments or use the multi-segment pipeline — just pass the full text to tts in one call.
Only use multi-segment generate when:
bash scripts/tts/generate_voice.sh tts "Hello world" -o ~/.openclaw/openclaw-data/multimodal/outputs/hello.mp3
bash scripts/tts/generate_voice.sh tts "你好世界" -v female-shaonv -o ~/.openclaw/openclaw-data/multimodal/outputs/hello_cn.mp3
Complete workflow — follow ALL steps in order:
generate command — this reads segments.json, generates audio for EACH segment via TTS API, then merges them into a single output file with crossfade# Step 1: Write segments.json to ~/.openclaw/openclaw-data/multimodal/outputs/
# (use the Write tool to create ~/.openclaw/openclaw-data/multimodal/outputs/segments.json)
# Step 2: Generate audio from segments.json — this is the CRITICAL step
# It generates each segment individually and merges them into one file
bash scripts/tts/generate_voice.sh generate ~/.openclaw/openclaw-data/multimodal/outputs/segments.json \
-o ~/.openclaw/openclaw-data/multimodal/outputs/output.mp3 --crossfade 200
Do NOT skip Step 2. Writing segments.json alone does nothing — you MUST run the generate command to actually produce audio.
# List all available voices
bash scripts/tts/generate_voice.sh list-voices
# Voice cloning (from audio sample, 10s–5min)
bash scripts/tts/generate_voice.sh clone sample.mp3 --voice-id my-voice
# Voice design (from text description)
bash scripts/tts/generate_voice.sh design "A warm female narrator voice" --voice-id narrator
bash scripts/tts/generate_voice.sh merge part1.mp3 part2.mp3 -o ~/.openclaw/openclaw-data/multimodal/outputs/combined.mp3
bash scripts/tts/generate_voice.sh convert input.wav -o ~/.openclaw/openclaw-data/multimodal/outputs/output.mp3
| Model | Notes |
|---|---|
| speech-2.8-hd | Recommended, auto emotion matching |
| speech-2.8-turbo | Faster variant |
| speech-2.6-hd | Previous gen, manual emotion |
| speech-2.6-turbo | Previous gen, faster |
Default crossfade between segments: 200ms (--crossfade 200).
[
{ "text": "Hello!", "voice_id": "female-shaonv", "emotion": "" },
{ "text": "Welcome.", "voice_id": "male-qn-qingse", "emotion": "happy" }
]
Leave emotion empty for speech-2.8 models (auto-matched from text).
When generating segments.json for audiobooks, podcasts, or any multi-character narration, you MUST split narration text from character dialogue into separate segments with distinct voices.
Rule: Narration and dialogue are ALWAYS separate segments.
A sentence like "Tom said: The weather is great today!" must be split into two segments:
"Tom said:""The weather is great today!"Example — Audiobook with narrator + 2 characters:
[
{ "text": "Morning sunlight streamed into the classroom as students filed in one by one.", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "Tom smiled and turned to Lisa:", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "The weather is amazing today! Let's go to the park after school!", "voice_id": "tom-voice", "emotion": "happy" },
{ "text": "Lisa thought for a moment, then replied:", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "Sure, but I need to drop off my backpack at home first.", "voice_id": "lisa-voice", "emotion": "" },
{ "text": "They exchanged a smile and went back to listening to the lecture.", "voice_id": "narrator-voice", "emotion": "" }
]
Key principles:
"He said:" is narrator, the quoted content is the characterEntry point: scripts/music/generate_music.sh
| Scenario | Mode | Action |
|---|---|---|
| BGM for video / voice / podcast | Instrumental (default) | Use --instrumental directly, do NOT ask user |
| User explicitly asks to "create music" / "make a song" | Ask user first | Ask whether they want instrumental or with lyrics |
When adding background music to video or voice content, always default to instrumental mode (--instrumental). Do not ask the user — BGM should never have vocals competing with the main content.
When the user explicitly asks to create/generate music as the primary task, ask them whether they want:
# Instrumental (for BGM or when user chooses instrumental)
bash scripts/music/generate_music.sh \
--instrumental \
--prompt "ambient electronic, atmospheric" \
--output ~/.openclaw/openclaw-data/multimodal/outputs/ambient.mp3 --download
# Song with lyrics (when user chooses vocal music)
bash scripts/music/generate_music.sh \
--lyrics "[verse]\nHello world\n[chorus]\nLa la la" \
--prompt "indie folk, melancholic" \
--output ~/.openclaw/openclaw-data/multimodal/outputs/song.mp3 --download
# With style fields
bash scripts/music/generate_music.sh \
--lyrics "[verse]\nLyrics here" \
--genre "pop" --mood "upbeat" --tempo "fast" \
--output ~/.openclaw/openclaw-data/multimodal/outputs/pop_track.mp3 --download
Default model: music-2.5
music-2.5 does not support --instrumental directly. When instrumental music is needed, the script automatically applies a workaround:
[intro] [outro] (empty structural tags, no actual vocals), appends pure music, no lyrics to the promptThis produces instrumental-style output without requiring manual intervention. You can always use --instrumental and the script handles the rest.
Entry point: scripts/image/generate_image.sh
Model: image-01 — photorealistic image generation from text prompts, with optional character reference for image-to-image.
| User intent | Mode |
|---|---|
| Generate image from text description (default) | t2i — text-to-image |
| Generate image with a character reference photo (keep same person) | i2i — image-to-image |
Default behavior: When the user asks to generate/create an image without mentioning a reference photo, use t2i mode (default). Only use i2i mode when the user provides a character reference image or explicitly asks to base the image on an existing person's appearance.
Do NOT always default to 1:1. Analyze the user's request and choose the most appropriate aspect ratio:
| User intent / context | Recommended ratio | Resolution |
|---|---|---|
| 头像、图标、社交媒体头像、avatar、icon、profile pic | 1:1 | 1024×1024 |
| 风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper | 16:9 | 1280×720 |
| 传统照片、经典比例、classic photo | 4:3 | 1152×864 |
| 摄影作品、杂志封面、photography、magazine | 3:2 | 1248×832 |
| 人像竖图、海报、portrait photo、poster | 2:3 | 832×1248 |
| 竖版海报、书籍封面、tall poster、book cover | 3:4 | 864×1152 |
| 手机壁纸、社交媒体故事、phone wallpaper、story、reel | 9:16 | 720×1280 |
| 超宽全景、电影画幅、panoramic、cinematic ultrawide | 21:9 | 1344×576 |
| 未指定特定需求 / ambiguous | 1:1 | 1024×1024 |
| User intent | Count (-n) |
|---|---|
| Default / single image request | 1 (default) |
| 用户说"几张"、"多张"、"一些" / "a few", "several" | 3 |
| 用户说"多种方案"、"备选" / "variations", "options" | 3–4 |
| 用户明确指定数量 | Use the specified number (1–9) |
# Basic text-to-image
bash scripts/image/generate_image.sh \
--prompt "A cat sitting on a rooftop at sunset, cinematic lighting, warm tones, photorealistic" \
-o ~/.openclaw/openclaw-data/multimodal/outputs/cat.png
# Landscape with inferred aspect ratio
bash scripts/image/generate_image.sh \
--prompt "Mountain landscape with misty valleys, photorealistic, golden hour" \
--aspect-ratio 16:9 \
-o ~/.openclaw/openclaw-data/multimodal/outputs/landscape.png
# Phone wallpaper (portrait 9:16)
bash scripts/image/generate_image.sh \
--prompt "Aurora borealis over a snowy forest, vivid colors, magical atmosphere" \
--aspect-ratio 9:16 \
-o ~/.openclaw/openclaw-data/multimodal/outputs/wallpaper.png
# Multiple variations
bash scripts/image/generate_image.sh \
--prompt "Abstract geometric art, vibrant colors" \
-n 3 \
-o ~/.openclaw/openclaw-data/multimodal/outputs/art.png
# With prompt optimizer
bash scripts/image/generate_image.sh \
--prompt "A man standing on Venice Beach, 90s documentary style" \
--aspect-ratio 16:9 --prompt-optimizer \
-o ~/.openclaw/openclaw-data/multimodal/outputs/beach.png
# Custom dimensions (must be multiple of 8)
bash scripts/image/generate_image.sh \
--prompt "Product photo of a luxury watch on marble surface" \
--width 1024 --height 768 \
-o ~/.openclaw/openclaw-data/multimodal/outputs/watch.png
Use a reference photo to generate images with the same character in new scenes. Best results with a single front-facing portrait. Supported formats: JPG, JPEG, PNG (max 10MB).
# Character reference — place same person in a new scene
bash scripts/image/generate_image.sh \
--mode i2i \
--prompt "A girl looking into the distance from a library window, warm afternoon light" \
--ref-image face.jpg \
--aspect-ratio 16:9 \
-o ~/.openclaw/openclaw-data/multimodal/outputs/girl_library.png
# Multiple character variations
bash scripts/image/generate_image.sh \
--mode i2i \
--prompt "A woman in a red dress at a gala event, elegant, cinematic" \
--ref-image face.jpg -n 3 \
-o ~/.openclaw/openclaw-data/multimodal/outputs/gala.png
| Ratio | Resolution | Best for |
|---|---|---|
1:1 | 1024×1024 | Default, avatars, icons, social media |
16:9 | 1280×720 | Landscape, banner, desktop wallpaper |
4:3 | 1152×864 | Classic photo, presentations |
3:2 | 1248×832 | Photography, magazine layout |
2:3 | 832×1248 | Portrait photo, poster |
3:4 | 864×1152 | Book cover, tall poster |
9:16 | 720×1280 | Phone wallpaper, social story/reel |
21:9 | 1344×576 | Ultra-wide panoramic, cinematic |
| Option | Description |
|---|---|
--prompt TEXT | Image description, max 1500 chars (required) |
--aspect-ratio RATIO | Aspect ratio (see table above). Infer from user context |
--width PX / --height PX | Custom size, 512–2048, must be multiple of 8, both required together. Overridden by --aspect-ratio if both set |
-n N | Number of images to generate, 1–9 (default 1) |
--seed N | Random seed for reproducibility. Same seed + same params → similar results |
--prompt-optimizer | Enable automatic prompt optimization by the API |
--ref-image FILE | Character reference image for i2i mode (local file or URL, JPG/JPEG/PNG, max 10MB) |
--no-download | Print image URLs instead of downloading files |
--aigc-watermark | Add AIGC watermark to generated images |
| User intent | Script to use |
|---|---|
| Default / no special request | scripts/video/generate_video.sh (single segment, 6s, 768P) |
| User explicitly asks for "long video", "multi-scene", "story", or duration > 10s | scripts/video/generate_long_video.sh (multi-segment) |
Default behavior: Always use single-segment generate_video.sh with duration 6s and resolution 768P unless the user explicitly asks for a long video or multi-scene video. Do NOT automatically split into multiple segments — a single 6s video is the standard output. Only use generate_long_video.sh when the user clearly needs multi-scene or longer content.
Entry point (single video): scripts/video/generate_video.sh
Entry point (long/multi-scene): scripts/video/generate_long_video.sh
Supported resolutions and durations by model:
| Model | Resolution | Duration |
|---|---|---|
| MiniMax-Hailuo-2.3 | 768P only | 6s or 10s |
| MiniMax-Hailuo-2.3-Fast | 768P only | 6s or 10s |
| MiniMax-Hailuo-02 | 512P, 768P (default) | 6s or 10s |
| T2V-01 / T2V-01-Director | 720P | 6s only |
| I2V-01 / I2V-01-Director / I2V-01-live | 720P | 6s only |
| S2V-01 (ref) | 720P | 6s only |
Key rules:
Before calling any video generation script, you MUST optimize the user's prompt by reading and applying references/video-prompt-guide.md. Never pass the user's raw description directly as --prompt.
Optimization steps:
Apply the Professional Formula: Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere
"A puppy in a park""A golden retriever puppy runs toward the camera on a sun-dappled grass path in a park, [跟随] smooth tracking shot, warm golden hour lighting, shallow depth of field, joyful atmosphere"Add camera instructions using [指令] syntax: [推进], [拉远], [跟随], [固定], [左摇], etc.
Include aesthetic details: lighting (golden hour, dramatic side lighting), color grading (warm tones, cinematic), texture (dust particles, rain droplets), atmosphere (intimate, epic, peaceful)
Keep to 1-2 key actions for 6-10 second videos — do not overcrowd with events
For i2v mode (image-to-video): Focus prompt on movement and change only, since the image already establishes the visual. Do NOT re-describe what's in the image.
"A lake with mountains" (just repeating the image)"Gentle ripples spread across the water surface, a breeze rustles the distant trees, [固定] fixed camera, soft morning light, peaceful and serene"For multi-segment long videos: Each segment's prompt must be self-contained and optimized individually. The i2v segments (segment 2+) should describe motion/change relative to the previous segment's ending frame.
# Text-to-video (default: 6s, 768P)
bash scripts/video/generate_video.sh \
--mode t2v \
--prompt "A golden retriever puppy bounds toward the camera on a sunlit grass path, [跟随] tracking shot, warm golden hour, shallow depth of field, joyful" \
--output ~/.openclaw/openclaw-data/multimodal/outputs/puppy.mp4
# Image-to-video (prompt focuses on MOTION, not image content)
bash scripts/video/generate_video.sh \
--mode i2v \
--prompt "The petals begin to sway gently in the breeze, soft light shifts across the surface, [固定] fixed framing, dreamy pastel tones" \
--first-frame photo.jpg \
--output ~/.openclaw/openclaw-data/multimodal/outputs/animated.mp4
# Start-end frame interpolation (sef mode uses MiniMax-Hailuo-02)
bash scripts/video/generate_video.sh \
--mode sef \
--first-frame start.jpg --last-frame end.jpg \
--output ~/.openclaw/openclaw-data/multimodal/outputs/transition.mp4
# Subject reference (face consistency, ref mode uses S2V-01, 6s only)
bash scripts/video/generate_video.sh \
--mode ref \
--prompt "A young woman in a white dress walks slowly through a sunlit garden, [跟随] smooth tracking, warm natural lighting, cinematic depth of field" \
--subject-image face.jpg \
--duration 6 \
--output ~/.openclaw/openclaw-data/multimodal/outputs/person.mp4
Multi-scene long videos chain segments together: the first segment generates via text-to-video (t2v), then each subsequent segment uses the last frame of the previous segment as its first frame (i2v). Segments are joined with crossfade transitions for smooth continuity. Default is 6 seconds per segment.
Workflow:
first_frame_image, prompt describes motion and change from that ending statePrompt rules for each segment:
# Example: 3-segment story with optimized per-segment prompts (default: 6s/segment, 768P)
bash scripts/video/generate_long_video.sh \
--scenes \
"A lone astronaut stands on a red desert planet surface, wind blowing dust particles, [推进] slow push in toward the visor, dramatic rim lighting, cinematic sci-fi atmosphere" \
"The astronaut turns and begins walking toward a distant glowing structure on the horizon, dust swirling around boots, [跟随] tracking from behind, vast desolate landscape, golden light from the structure" \
"The astronaut reaches the structure entrance, a massive doorway pulses with blue energy, [推进] slow push in toward the doorway, light reflects off the visor, awe-inspiring epic scale" \
--music-prompt "cinematic orchestral ambient, slow build, sci-fi atmosphere" \
--output ~/.openclaw/openclaw-data/multimodal/outputs/long_video.mp4
# With custom settings
bash scripts/video/generate_long_video.sh \
--scenes "Scene 1 prompt" "Scene 2 prompt" \
--segment-duration 6 \
--resolution 768P \
--crossfade 0.5 \
--music-prompt "calm ambient background music" \
--output ~/.openclaw/openclaw-data/multimodal/outputs/long_video.mp4
bash scripts/video/add_bgm.sh \
--video input.mp4 \
--generate-bgm --instrumental \
--music-prompt "soft piano background" \
--bgm-volume 0.3 \
--output ~/.openclaw/openclaw-data/multimodal/outputs/output_with_bgm.mp4
bash scripts/video/generate_template_video.sh \
--template-id 392753057216684038 \
--media photo.jpg \
--output ~/.openclaw/openclaw-data/multimodal/outputs/template_output.mp4
| Mode | Default Model | Default Duration | Default Resolution | Notes |
|---|---|---|---|---|
| t2v | MiniMax-Hailuo-2.3 | 6s | 768P | Latest text-to-video |
| i2v | MiniMax-Hailuo-2.3 | 6s | 768P | Latest image-to-video |
| sef | MiniMax-Hailuo-02 | 6s | 768P | Start-end frame |
| ref | S2V-01 | 6s | 720P | Subject reference, 6s only |
Entry point: scripts/media_tools.sh
Standalone FFmpeg-based utilities for format conversion, concatenation, extraction, trimming, and audio overlay. Use these when the user needs to process existing media files without generating new content via MiniMax API.
# Convert between formats (mp4, mov, webm, mkv, avi, ts, flv)
bash scripts/media_tools.sh convert-video input.webm -o output.mp4
bash scripts/media_tools.sh convert-video input.mp4 -o output.mov
# With quality / resolution / fps options
bash scripts/media_tools.sh convert-video input.mp4 -o output.mp4 \
--crf 18 --preset medium --resolution 1920x1080 --fps 30
# Convert between formats (mp3, wav, flac, ogg, aac, m4a, opus, wma)
bash scripts/media_tools.sh convert-audio input.wav -o output.mp3
bash scripts/media_tools.sh convert-audio input.mp3 -o output.flac \
--bitrate 320k --sample-rate 48000 --channels 2
# Concatenate with crossfade transition (default 0.5s)
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 seg3.mp4 -o merged.mp4
# Hard cut (no crossfade)
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4 --crossfade 0
# Simple concatenation
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3
# With crossfade
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3 --crossfade 1
# Extract as mp3
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.mp3
# Extract as wav with higher bitrate
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.wav --bitrate 320k
# Trim by start/end time (seconds)
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 5 --end 15
# Trim by start + duration
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 10 --duration 8
# Mix audio with existing video audio
bash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 -o output.mp4 \
--volume 0.3 --fade-in 2 --fade-out 3
# Replace original audio entirely
bash scripts/media_tools.sh add-audio --video video.mp4 --audio narration.mp3 -o output.mp4 \
--replace
bash scripts/media_tools.sh probe input.mp4
scripts/
├── check_environment.sh # Env verification (curl, ffmpeg, jq, xxd, API key)
├── media_tools.sh # Audio/video conversion, concat, trim, extract
├── tts/
│ └── generate_voice.sh # Unified TTS CLI (tts, clone, design, list-voices, generate, merge, convert)
├── music/
│ └── generate_music.sh # Music generation CLI
├── image/
│ └── generate_image.sh # Image generation CLI (2 modes: t2i, i2i)
└── video/
├── generate_video.sh # Video generation CLI (4 modes: t2v, i2v, sef, ref)
├── generate_long_video.sh # Multi-scene long video
├── generate_template_video.sh # Template-based video
└── add_bgm.sh # Background music overlay
Read these for detailed API parameters, voice catalogs, and prompt engineering: