MUST read this skill BEFORE entering generate mode for music tasks. Covers prompt crafting framework, structure syntax, and multi-clip strategy.
When crafting music generation prompts, adopt the mindset of a world-class music arranger and producer. Think holistically about the entire piece — its emotional arc, sonic palette, and structural flow — before writing the prompt. Your goal is to translate the user's vision into a single, cohesive musical blueprint that the model can execute in one take whenever possible.
Each call to the music generation tool produces a single audio file with a maximum duration of ~184 seconds (approx. 3 minutes).
The decision to use single-call vs multi-clip is based SOLELY on duration:
Good prompts are descriptive and clear. The prompt must be a single, continuous text string. Construct your prompt by combining the following 9 dimensions. Be descriptive and specific, using adjectives and adverbs to paint a clear sonic picture.
electronic dance, classical, jazz, ambient, 8-bit, cinematic, lo-fifast tempo, slow ballad, 120 BPM, driving beat, syncopated rhythm, gentle waltzin D minor key, in the key of C majorenergetic, melancholy, peaceful, tensepiano, synthesizer, acoustic guitar, string orchestra, electronic drumssparse arrangement, dense layers, warm dark tones, bright crisp tonesstarts with a solo piano, then strings enter, crescendo into a powerful chorus). For more precise control, you can optionally use timestamp cues [mm:ss - mm:ss] and Intensity parameters Intensity: X/10 (Level) — see the Detailed Structure Example below.
starts with a solo piano, then strings enter at the halfway point, solo piano from 0-8s, strings enter at 8s, drums join at 16srain falling, city nightlife, underwater feel, large hall reverb, tight room reverb, wide stereo image, intimate close-mic feel, distant and far away, as if playing in the next roomhigh-quality production, clean mix, vintage recording, raw demo feel, studio dry sound, live concert hall recording, outdoor open-air feelNote on Vocals: The model supports vocal generation for songs. If the user explicitly wants background music without vocals, you MUST append Instrumental only, no vocals to the prompt.
To ensure the model accurately follows your instructions, especially regarding duration, always structure your prompt in this specific order:
Instrumental only, no vocals. Create a 60-second track at 80 BPM.The feeling is nostalgic, introspective, and atmospheric. The sound should be centered around a warm Fender Rhodes...[0:00 - 0:12] Intro... [0:48 - 1:00] Outro...When the user requests a specific structure or duration, use the Arrangement/Structure dimension to write a detailed script using timestamps and intensity markers.
Example:
Instrumental only, no vocals. Create a 60-second track at 80 BPM. The feeling is nostalgic, introspective, and atmospheric - a warm, comforting melancholy with a soft, minor-key feel. The sound should be centered around a warm, slightly overdriven Fender Rhodes and soft, ethereal synth pads. The rhythm is a minimalist, laid-back drum beat with a relaxed, human feel. Weave subtle atmospheric textures, like soft static or room tone, through the entire track for texture.
[0:00 - 0:12] Intro: Begin atmospherically with just the Fender Rhodes playing soft, hazy chords. Drench it in warm reverb and introduce a light atmospheric texture. The mood is like a memory coming into focus. Intensity: 1/10 (Very Low)
[0:12 - 0:24] Verse 1: The laid-back drum beat enters with a simple kick and snare. A soft, ethereal synth pad swells in the background. A clean, subtle sub-bass joins, adding depth. The Rhodes melody becomes slightly more defined, following a simple, melancholic progression. Intensity: 3/10 (Low)
[0:24 - 0:36] Build: The groove deepens as a gentle, syncopated hi-hat is added. A simple, memorable lead melody appears, played on a warm, rounded synth. This section should feel like the gentle peak of the track's focus, with a chord progression that builds a sense of hopeful tension. Intensity: 5/10 (Medium)
[0:36 - 0:48] Chorus: Gracefully pull back the intensity. The synth lead melody fades out, returning focus to the core Rhodes groove and the drums. This gives the track space to breathe, resolving the tension from the build. Intensity: 4/10 (Medium-Low)
[0:48 - 1:00] Outro: The drums and bass drop out completely. The track fades out leaving only the Rhodes playing spacious chords, the lingering synth pad, and the persistent atmospheric texture. Intensity: 2/10 (Very Low)
To specify elements to exclude from the music, describe what you want to discourage the model from generating directly in your main prompt using explicit negative phrasing.
negative_prompt: "drums, fast tempo""Ensure there are no drums or percussion. Avoid fast tempos." or "A drumless, percussion-free ambient track..."Categories of elements commonly excluded:
no drums, no percussion, no vocalsno complex melodies, no sudden dynamic changes, no fast runsavoid dark mood, avoid aggressive energyTo emulate specific parameter controls, use these prompt translations:
sparse arrangement, minimal layers, lots of space between notesdense, busy arrangement with many overlapping layers and fillsbright, crisp tones, emphasizing high frequencies and presenceEnsure there are no drums, no percussion, no beat, no rhythm section to the promptOnly bass and drums, rhythm section only. No melody, no chords, no harmony, no piano, no guitar, no strings, no synth pads.When the user's request exceeds the single-call limit (~184 seconds), generate multiple independent clips and concatenate them. Think like a professional arranger to make them sound like a cohesive song: Plan the entire song structure first, then write prompts for each clip.
Plan the song's progression (e.g., Intro → Verse → Chorus → Bridge → Outro). Determine which musical elements define the song's core identity (DNA) and which elements drive the narrative forward. Divide the total duration into logical chunks of up to ~180s each.
Category A: Always Lock (Identical across all clips) These elements are the song's DNA. Changing them will cause jarring transitions.
Category B: Default Lock, Intentional Vary These elements are usually locked, but can be changed if the arrangement plan specifically calls for it.
Category C: Should Vary (Different across clips) These elements drive the song's narrative.
The core principle is Precise State Alignment: use timestamp cues to ensure the musical state at the end of one clip exactly matches the musical state at the beginning of the next clip. This means matching instrumentation, energy level, and dynamic intensity.
Avoiding Instrumentation & Loudness Jumps:
intimate, gentle arrangement vs full, powerful arrangement).Prompt design rules for each clip type:
Use ffmpeg via the shell tool to apply crossfades between the clips. With precise state alignment via timestamps, shorter crossfades (0.5-1s) are usually sufficient. Calculate the crossfade duration to align with the beat grid based on the BPM (e.g., at 120 BPM, 1 beat = 0.5s, so a 2-beat crossfade = 1.0s).