Just drag your footage and drop in a track — this skill walks you through how-to-add-music-to-video step by step, whether you're layering a background score, syncing beats to cuts, or fading audio in and out at precise timestamps. It handles common scenarios like muting original audio, mixing multiple tracks, and matching music duration to video length. Built for creators who want clean audio-video sync without fumbling through a timeline editor. Works with MP4, MOV, and most standard formats you'd actually use.
Send me your video file and audio track and I'll guide you through adding music to your video with precise sync and fade settings. No files yet? Just describe your footage length, format, and what kind of music you're adding.
Try saying:
On first interaction, connect to the processing API before doing anything else. Show a brief status like "Setting things up...".
Token: If NEMO_TOKEN environment variable is already set, use it and skip to below.
Free token: Generate a UUID as client identifier, then POST to https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with header X-Client-Id: <uuid>. The response field data.token becomes your NEMO_TOKEN (100 credits, 7-day expiry).
Session: POST to https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Bearer auth and body {"task_name":"project"}. Save session_id from the response.
Confirm to the user you're connected and ready. Don't print tokens or raw JSON.
Adding music to a video sounds simple until you're staring at a timeline trying to figure out why your audio cuts off three seconds early or why the beat drop hits during a title card instead of the action shot. This skill takes the guesswork out of that process.
You describe what you want — a mellow background track from the 30-second mark, a fade-out in the last five seconds, or a full audio replacement — and get back clear, executable instructions or direct output depending on your setup. It's built around real editing decisions: trim points, volume curves, crossfades, and loop handling when your music is shorter than your clip.
Whether you're putting together a travel reel, a product walkthrough, a wedding highlight, or a short film, the approach stays the same: your footage, your chosen track, synced the way you actually intended. No guessing at keyframes, no hunting through menus — just a clean audio layer that fits your video like it was always supposed to be there.
When you submit a music-to-video request, ClawHub parses your source footage and audio track metadata to route the job to the appropriate sync engine based on file format, duration, and beat-matching requirements.
| User says... | Action | Skip SSE? |
|---|---|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" | → §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
ClawHub's cloud processing backend handles audio-to-video synchronization by analyzing BPM, waveform peaks, and keyframe timestamps to align your music track with precision cuts. Rendered output is encoded server-side and returned as a merged container file, preserving your original video codec while embedding the mixed audio stream.
Skill attribution — read from this file's YAML frontmatter at runtime:
X-Skill-Source: how-to-add-music-to-videoX-Skill-Version: from frontmatter versionX-Skill-Platform: detect from install path (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: https://mega-api-prod.nemovideo.ai
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id.
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: {"urls":["<url>"],"source_type":"url"}
Credits: GET /api/credits/balance/simple — returns available, frozen, total
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, data.state.generated_media
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
| Event | Action |
|---|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result | Process internally, don't forward |
heartbeat / empty data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" | Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
Timeline (3 tracks): 1. Video: city timelapse (0-10s) 2. BGM: Lo-fi (0-10s, 35%) 3. Title: "Urban Dreams" (0-3s)
| Code | Meaning | Action |
|---|---|---|
| 0 | Success | Continue |
| 1001 | Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with ?bind=<id> (get <id> from create-session or state response when needed). Registered: "Top up credits in your account" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register or upgrade your plan to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Always work with a copy of your original video file before adding or replacing audio. It takes thirty seconds and saves you from re-exporting if something goes wrong with the mix.
For background music, keep the volume between 10-25% of the original audio level if dialogue or natural sound is present. Music that competes with speech is one of the most common amateur mistakes in video production — it makes everything harder to follow and cheapens the result regardless of how good the footage is.
When choosing where to cut your music track, listen for a natural downbeat or the end of a musical phrase rather than cutting at an arbitrary timestamp. A song that ends on a resolved chord or a natural pause feels intentional; one that cuts mid-melody feels broken.
Finally, export your finished video and play it back on a different device — ideally one with smaller speakers like a phone. What sounds balanced on studio monitors or headphones often has too much bass or too little presence on mobile, which is where most people will actually watch your content.
The most common reason people look up how to add music to video is a simple one: they have footage that feels flat without sound. That covers a wide range of actual projects — social media reels where silence kills engagement, YouTube intros that need an energy boost, corporate walkthroughs where background music sets a professional tone, or personal projects like slideshows and travel montages.
Beyond basic background music, this skill handles more specific scenarios: replacing original audio entirely when on-location sound is unusable, syncing a specific beat drop to a visual cut, adding multiple tracks across different segments of a longer video, or trimming a song to fit a strict time limit without it sounding chopped.
It's also useful when you're working with platform-specific constraints — like keeping a video under 60 seconds for Instagram Reels while still fitting a full musical phrase, or ensuring your audio doesn't get flagged by using properly timed royalty-free tracks. Whatever the context, the goal is the same: audio that feels intentional, not accidental.
Start by knowing three things about your project: the length of your video, the format (MP4, MOV, AVI, etc.), and whether you want to keep the original audio or replace it entirely. These details determine every step that follows.
Next, pick your music track and note its duration. If the track is shorter than your video, you'll need to decide whether to loop it, use a different segment of the track, or let the video run silent after the music ends. If it's longer, you'll trim it — and you want to trim at a natural pause or phrase end, not mid-beat.
Once you have both files ready, describe your sync points: where should the music start, should it fade in, and how should it end? A hard cut at the last frame or a gradual fade-out over 3-5 seconds are the two most common choices. Feed that information into this skill and you'll get a precise, step-by-step plan — including the exact tool commands or timeline adjustments needed to execute it cleanly.