Name: MiniMax Multi-Modal Toolkit
Author: HuaiminHuang

MiniMax Multi-Modal Toolkit

MiniMax multimodal model skill — use MiniMax Multi-Modal models for speech, music, video, and image. Create voice, music, video, and images with MiniMax AI: TTS (text-to-speech, voice cloning, voice design, multi-segment), music (songs, instrumentals), video (text-to-video, image-to-video, start-end frame, subject reference, templates, long-form multi-scene), image (text-to-image, image-to-image with character reference), and media processing (convert, concat, trim, extract). Use when the user mentions MiniMax, multimodal generation, or wants speech/music/video/image AI, MiniMax APIs, or FFmpeg workflows alongside MiniMax outputs.

HuaiminHuang0 Sterne12.04.2026

Beruf
Kategorien: Medien

Generate voice, music, video, and image content via MiniMax APIs — the unified entry for MiniMax multimodal use cases (audio + music + video + image). Includes voice cloning & voice design for custom voices, image generation with character reference, and FFmpeg-based media tools for audio/video format conversion, concatenation, trimming, and extraction.

Output Directory

All generated files MUST be saved to ~/.openclaw/openclaw-data/multimodal/outputs/ (for persistent storage). Every script call MUST include an explicit --output / -o argument pointing to this location. Never omit the output argument or rely on script defaults.

Rules:

Before running any script, ensure the output directory exists (create if needed: mkdir -p ~/.openclaw/openclaw-data/multimodal/outputs/)
Always use absolute paths: --output ~/.openclaw/openclaw-data/multimodal/outputs/video.mp4
Never cd into the skill directory to run scripts — run from the agent's working directory using the full script path
Intermediate/temp files (segment audio, video segments, extracted frames) are automatically placed in . They can be cleaned up when no longer needed:

MiniMax Multi-Modal Toolkit

HuaiminHuang0 Sterne12.04.2026

Beruf
Kategorien: Medien

Output Directory

Rules:

Before running any script, ensure the output directory exists (create if needed: mkdir -p ~/.openclaw/openclaw-data/multimodal/outputs/)

Always use absolute paths: --output ~/.openclaw/openclaw-data/multimodal/outputs/video.mp4

Never cd into the skill directory to run scripts — run from the agent's working directory using the full script path

Intermediate/temp files (segment audio, video segments, extracted frames) are automatically placed in . They can be cleaned up when no longer needed:

Region	Platform URL	API Host Value
China Mainland（中国大陆）	https://platform.minimaxi.com	`https://api.minimaxi.com`
Global（全球）	https://platform.minimax.io	`https://api.minimax.io`

Capability	Starter	Plus	Max
M2.7 (chat)	600 req/5h	1,500 req/5h	4,500 req/5h
Speech 2.8	—	4,000 chars/day	11,000 chars/day
image-01	—	50 images/day	120 images/day
Hailuo-2.3-Fast 768P 6s	—	—	2 videos/day
Hailuo-2.3 768P 6s	—	—	2 videos/day
Music-2.5	—	—	4 songs/day (≤5 min each)

Capability	Plus-HS	Max-HS	Ultra-HS
M2.7-highspeed (chat)	1,500 req/5h	4,500 req/5h	30,000 req/5h
Speech 2.8	9,000 chars/day	19,000 chars/day	50,000 chars/day
image-01	100 images/day	200 images/day	800 images/day
Hailuo-2.3-Fast 768P 6s	—	3 videos/day	5 videos/day
Hailuo-2.3 768P 6s	—	3 videos/day	5 videos/day
Music-2.5	—	7 songs/day (≤5 min each)	15 songs/day (≤5 min each)

Capability	Description	Entry point
TTS	Text-to-speech synthesis with multiple voices and emotions	`scripts/tts/generate_voice.sh`
Voice Cloning	Clone a voice from an audio sample (10s–5min)	`scripts/tts/generate_voice.sh clone`
Voice Design	Create a custom voice from a text description	`scripts/tts/generate_voice.sh design`
Music Generation	Generate songs with lyrics or instrumental tracks	`scripts/music/generate_music.sh`
Image Generation	Text-to-image, image-to-image with character reference	`scripts/image/generate_image.sh`
Video Generation	Text-to-video, image-to-video, subject reference, templates	`scripts/video/generate_video.sh`
Long Video	Multi-scene chained video with crossfade transitions	`scripts/video/generate_long_video.sh`
Media Tools	Audio/video format conversion, concatenation, trimming, extraction	`scripts/media_tools.sh`

User intent	Approach
Single voice / no multi-character need	`tts` command — generate the entire text in one call
Multiple characters / narrator + dialogue	`generate` command with segments.json

Model	Notes
speech-2.8-hd	Recommended, auto emotion matching
speech-2.8-turbo	Faster variant
speech-2.6-hd	Previous gen, manual emotion
speech-2.6-turbo	Previous gen, faster

Scenario	Mode	Action
BGM for video / voice / podcast	Instrumental (default)	Use `--instrumental` directly, do NOT ask user
User explicitly asks to "create music" / "make a song"	Ask user first	Ask whether they want instrumental or with lyrics

User intent	Mode
Generate image from text description (default)	`t2i` — text-to-image
Generate image with a character reference photo (keep same person)	`i2i` — image-to-image

User intent / context	Recommended ratio	Resolution
头像、图标、社交媒体头像、avatar、icon、profile pic	`1:1`	1024×1024
风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper	`16:9`	1280×720
传统照片、经典比例、classic photo	`4:3`	1152×864
摄影作品、杂志封面、photography、magazine	`3:2`	1248×832
人像竖图、海报、portrait photo、poster	`2:3`	832×1248
竖版海报、书籍封面、tall poster、book cover	`3:4`	864×1152
手机壁纸、社交媒体故事、phone wallpaper、story、reel	`9:16`	720×1280
超宽全景、电影画幅、panoramic、cinematic ultrawide	`21:9`	1344×576
未指定特定需求 / ambiguous	`1:1`	1024×1024

User intent	Count (`-n`)
Default / single image request	`1` (default)
用户说"几张"、"多张"、"一些" / "a few", "several"	`3`
用户说"多种方案"、"备选" / "variations", "options"	`3`–`4`
用户明确指定数量	Use the specified number (1–9)

Ratio	Resolution	Best for
`1:1`	1024×1024	Default, avatars, icons, social media
`16:9`	1280×720	Landscape, banner, desktop wallpaper
`4:3`	1152×864	Classic photo, presentations
`3:2`	1248×832	Photography, magazine layout
`2:3`	832×1248	Portrait photo, poster
`3:4`	864×1152	Book cover, tall poster
`9:16`	720×1280	Phone wallpaper, social story/reel
`21:9`	1344×576	Ultra-wide panoramic, cinematic

Option	Description
`--prompt TEXT`	Image description, max 1500 chars (required)
`--aspect-ratio RATIO`	Aspect ratio (see table above). Infer from user context
`--width PX` / `--height PX`	Custom size, 512–2048, must be multiple of 8, both required together. Overridden by `--aspect-ratio` if both set
`-n N`	Number of images to generate, 1–9 (default 1)
`--seed N`	Random seed for reproducibility. Same seed + same params → similar results
`--prompt-optimizer`	Enable automatic prompt optimization by the API
`--ref-image FILE`	Character reference image for i2i mode (local file or URL, JPG/JPEG/PNG, max 10MB)
`--no-download`	Print image URLs instead of downloading files
`--aigc-watermark`	Add AIGC watermark to generated images

User intent	Script to use
Default / no special request	`scripts/video/generate_video.sh` (single segment, 6s, 768P)
User explicitly asks for "long video", "multi-scene", "story", or duration > 10s	`scripts/video/generate_long_video.sh` (multi-segment)

Model	Resolution	Duration
MiniMax-Hailuo-2.3	768P only	6s or 10s
MiniMax-Hailuo-2.3-Fast	768P only	6s or 10s
MiniMax-Hailuo-02	512P, 768P (default)	6s or 10s
T2V-01 / T2V-01-Director	720P	6s only
I2V-01 / I2V-01-Director / I2V-01-live	720P	6s only
S2V-01 (ref)	720P	6s only

Mode	Default Model	Default Duration	Default Resolution	Notes
t2v	MiniMax-Hailuo-2.3	6s	768P	Latest text-to-video
i2v	MiniMax-Hailuo-2.3	6s	768P	Latest image-to-video
sef	MiniMax-Hailuo-02	6s	768P	Start-end frame
ref	S2V-01	6s	720P	Subject reference, 6s only

MiniMax Multi-Modal Toolkit

Output Directory

MiniMax Multi-Modal Toolkit

Output Directory

Prerequisites

API Host Configuration

API Key Configuration

Plan Limits & Quotas

Standard Plans

High-Speed Plans

Key Capabilities

TTS (Text-to-Speech)

IMPORTANT: Single voice vs Multi-segment — Choose the right approach

Single-voice generation (DEFAULT)

Multi-segment generation (multi-voice / audiobook / podcast)

Voice management

Audio processing

TTS Models

segments.json Format

IMPORTANT: Multi-Segment Script Generation Rules (Audiobooks, Podcasts, etc.)

Music Generation

IMPORTANT: Instrumental vs Lyrics — When to use which

Music Model

Image Generation

IMPORTANT: Mode Selection — t2i vs i2i

IMPORTANT: Aspect Ratio — Infer from user context

IMPORTANT: Image Count — When to generate multiple images

Text-to-Image Examples

Image-to-Image (Character Reference)

Aspect Ratio Reference

Key Options

Video Generation

IMPORTANT: Single vs Multi-Segment — Choose the right script

Video Model Constraints (MUST follow)

IMPORTANT: Prompt Optimization (MUST follow before generating any video)

Long-form Video (Multi-scene)

Add Background Music

Template Video

Video Models

Media Tools (Audio/Video Processing)

Video Format Conversion

Audio Format Conversion

Video Concatenation

Audio Concatenation

Extract Audio from Video

Video Trimming

Add Audio to Video (Overlay / Replace)

Media File Info

Script Architecture

References

Songsee

Video Frames

Gifgrep

Qqbot Media

Camsnap

Openai Whisper Api