Add B-roll footage and Remotion animations to your edited video. Auto-detects format. Creates the final composed video.
Today: !date +%Y-%m-%d
Takes an edited video (from /video-edit) and composes the final output by inserting Remotion animations and real B-roll at the right moments. This is the LAST step before /video-subtitle.
Pipeline: Record -> /video-edit -> /video-animate -> /video-finalize -> /video-subtitle
[active-business]/brand.md (Section 4: Brand Voice) for brand style.[active-business]/lessons.md for any video finalization lessons.Before running this skill, these must exist:
/video-editbroll-map.md/video-animateAsk for the edited video path if not provided as argument.
ffprobe -v quiet -print_format json -show_streams "[input_video]"
Parse width and height:
Tell user: "Detected [format] video ([width]x[height]). I'll optimize for [format]."
Get specs: resolution, fps, duration, codec. All output must match these specs.
Check Remotion: Look for system/tools/video-animations/node_modules. If not installed, skip animation insertions and only do real B-roll.
Check for existing transcription file (from /video-edit). If not found:
1. Check for [video]-transcript.json in the same folder
2. If not found:
a. Try: whisper --help (local openai-whisper)
b. If not available: check for ELEVENLABS_API_KEY
c. If neither: "Transcription needed but no tool found. Install whisper: pip install openai-whisper or set ELEVENLABS_API_KEY."
3. Run: whisper [input] --model base --output_format json --word_timestamps True
4. Save to [video]-transcript.json
Parse broll-map.md to get the animation placement plan. Each entry has:
Match each broll-map entry to a transcript timestamp by fuzzy-matching the script moment text against the transcription.
Scan the transcript for "show moments" where B-roll would cover important visual content. Two detection layers:
Explicit callouts:
Demonstration language:
Scan for tool names and screen indicators. When 2+ mentions cluster within 30 seconds, flag as a demonstration zone.
Tool names: Claude, ChatGPT, GPT, Cursor, VS Code, Chrome, Safari, browser, terminal, dashboard, Notion, Google Docs, spreadsheet, Figma, Canva, Facebook Ads Manager, Meta, Google Ads, Analytics, Stripe, Zapier, Make
Screen indicators: screen, tab, window, sidebar, menu, button, dropdown, settings, interface, prompt, output, response, result
Rule: If a tool name AND a screen indicator appear within 10 seconds of each other, treat it as a demonstration zone.
| Detection type | Before buffer | After buffer |
|---|---|---|
| Verbal cue (explicit callout) | 3 seconds | 8 seconds |
| Verbal cue (demonstration language) | 3 seconds | 6 seconds |
| Tool/app context cluster | 5 seconds | 12 seconds |
[SAFE] flag -- always insert animation regardless of show-moment zone[NEVER] flag -- skip animation regardless[FORCE-TIMESTAMP:MM:SS] -- insert at exact timestampPattern: dark -> light -> dark -> light -> ...
Context-driven overrides:
| Setting | Landscape | Vertical | Square |
|---|---|---|---|
| Max animation duration | 8s | 6s | 7s |
| Max talking head gap | 25s | 15s | 20s |
| Real B-roll clip length | 3-5s | 2-4s | 3-4s |
If an animation exceeds the max, trim from the BACK (keep the opening).
Two independent methods, results merged:
Method 1: Visual Noun Matching (highest priority) Scan transcript for concrete visual nouns matching real footage in the B-roll catalog (if one exists). Visual noun matches override gap rules and can replace planned animations.
Method 2: Gap Filling After placing animations and visual noun matches, scan for remaining gaps:
| Gap Size | Landscape | Vertical | Square |
|---|---|---|---|
| Mandatory fill | >30s | >20s | >25s |
| Recommended fill | 20-30s | 12-20s | 15-25s |
| Skip | <20s | <12s | <15s |
For each gap, use semantic segment matching:
Selection priority:
Present ALL B-roll decisions via AskUserQuestion BEFORE rendering:
"Here's the full B-roll plan. Approve or adjust?"
Animations (from broll-map):
1. [timestamp] -- [AnimationName] (dark/light, Xs) -- "[script moment]"
Real B-roll (from media library):
A. [timestamp] -- [filename] (Xs) -- "[why: visual noun match / gap fill]"
Conflicts (real footage vs animation at same moment):
At [timestamp]: Animation [X] OR real clip [Y]?
Options: Approve plan / Adjust specific items / Skip all real B-roll
Create the final edit decision list (EDL):
Rules:
Landscape:
| Section | B-roll frequency |
|---|---|
| First 30s | Every 3-5s |
| Minutes 1-3 | Every 10-15s |
| Minutes 3-8 | Every 15-20s |
| After 8 min | Every 20-30s |
Vertical:
| Section | B-roll frequency |
|---|---|
| First 5s | Hook -- can use B-roll immediately |
| Seconds 5-15 | Every 3-5s |
| Seconds 15-30 | Every 5-8s |
| After 30s | Every 8-12s |
Square: Between landscape and vertical pacing.
For all formats:
Build ffmpeg filter_complex with multi-input sources.
def build_ffmpeg(segments, source_video, broll_files, output_file, source_fps, output_w, output_h):
inputs = ["-i", source_video]
for bf in broll_files:
inputs.extend(["-i", bf])
filter_parts = []
concat_inputs = []
for i, seg in enumerate(segments):
if seg['type'] == 'talk':
filter_parts.append(
f"[0:v]trim=start={seg['start']:.3f}:end={seg['end']:.3f},"
f"setpts=PTS-STARTPTS,fps={source_fps},setsar=1:1[v{i}];"
)
filter_parts.append(
f"[0:a]atrim=start={seg['start']:.3f}:end={seg['end']:.3f},"
f"asetpts=PTS-STARTPTS[a{i}];"
)
else:
inp_idx = seg['input_index']
filter_parts.append(
f"[{inp_idx}:v]trim=start={seg['broll_start']:.3f}:end={seg['broll_end']:.3f},"
f"setpts=PTS-STARTPTS,scale={output_w}:{output_h}:flags=lanczos,"
f"fps={source_fps},setsar=1:1[v{i}];"
)
filter_parts.append(
f"[0:a]atrim=start={seg['start']:.3f}:end={seg['end']:.3f},"
f"asetpts=PTS-STARTPTS[a{i}];"
)
concat_inputs.append(f"[v{i}][a{i}]")
n = len(segments)
filter_complex = "\n".join(filter_parts)
filter_complex += f"\n{''.join(concat_inputs)}concat=n={n}:v=1:a=1[outv][outa]"
filter_file = output_file.replace('.mp4', '-filter.txt')
with open(filter_file, 'w') as f:
f.write(filter_complex)
cmd = [
"ffmpeg", "-y", *inputs,
"-filter_complex_script", filter_file,
"-map", "[outv]", "-map", "[outa]",
"-c:v", "libx264", "-preset", "medium", "-crf", "18",
"-c:a", "aac", "-b:a", "320k",
"-movflags", "+faststart",
output_file
]
return cmd
Run the ffmpeg command. Save output as {original-filename}-final.mp4 in the same directory.
# Check specs match source
ffprobe -v quiet -print_format json -show_streams -show_format output.mp4
# Duration should match edited video (+/- 0.5s)
# Scene changes should be more than the edited video
ffmpeg -i output.mp4 -vf "select='gt(scene,0.15)',showinfo" -vsync vfr -f null /dev/null 2>&1 | grep showinfo | wc -l
| Metric | Landscape | Vertical | Square |
|---|---|---|---|
| Resolution | 1920x1080 | 1080x1920 | 1080x1080 |
| Bitrate | 8-12 Mbps | 6-10 Mbps | 6-10 Mbps |
| FPS | Matches source | Matches source | Matches source |
| Duration | Same as edited (+/- 0.5s) | Same | Same |
After each finalize run, append used clips to a usage log in the project folder:
## [date] -- [video title]
- clip-name-1.mp4 (at 2:34)
- animation-name.mp4 (at 5:12)
When selecting clips for future videos, check the last 3 entries. Deprioritize (don't block) recently used clips.
Ask: "Want me to run a quality check on the final video?"
If yes:
Finalized: [filename]
Duration: [X]m [Y]s
Format: [landscape/vertical/square] ([WxH])
B-roll insertions: [N] animations, [M] real footage clips
Skipped: [K] animations (show moments)
Variant pattern: dark/light/dark/light...
Output: [path to final file]
| Setting | Landscape | Vertical | Square |
|---|---|---|---|
| Output | 1920x1080 | 1080x1920 | 1080x1080 |
| CRF | 18 | 18 | 18 |
| Audio | AAC 320kbps | AAC 320kbps | AAC 320kbps |
| Max B-roll | 8s | 6s | 7s |
| Max talk gap | 25s | 15s | 20s |
| Gap fill (mandatory) | >30s | >20s | >25s |
| Gap fill (recommended) | 20-30s | 12-20s | 15-25s |
| Variant pattern | dark/light alternating | dark/light alternating | dark/light alternating |
| Show-moment buffer (verbal) | 3s before, 6-8s after | 3s before, 6-8s after | 3s before, 6-8s after |
| Show-moment buffer (tool) | 5s before, 12s after | 5s before, 12s after | 5s before, 12s after |
Want to change anything? If you give feedback, I'll apply it and add a lesson to your brand's lessons.md so I remember next time.