Name: Video Finalize
Author: ifikirenglish-coder

Video Finalize | Skills Pool

broll-map.md

/video-animate

ffprobe -v quiet -print_format json -show_streams "[input_video]"

1. Check for [video]-transcript.json in the same folder
2. If not found:
   a. Try: whisper --help (local openai-whisper)
   b. If not available: check for ELEVENLABS_API_KEY
   c. If neither: "Transcription needed but no tool found. Install whisper: pip install openai-whisper or set ELEVENLABS_API_KEY."
3. Run: whisper [input] --model base --output_format json --word_timestamps True
4. Save to [video]-transcript.json

Pattern: dark -> light -> dark -> light -> ...

"Here's the full B-roll plan. Approve or adjust?"

Animations (from broll-map):
1. [timestamp] -- [AnimationName] (dark/light, Xs) -- "[script moment]"

Real B-roll (from media library):
A. [timestamp] -- [filename] (Xs) -- "[why: visual noun match / gap fill]"

Conflicts (real footage vs animation at same moment):
At [timestamp]: Animation [X] OR real clip [Y]?

Options: Approve plan / Adjust specific items / Skip all real B-roll

def build_ffmpeg(segments, source_video, broll_files, output_file, source_fps, output_w, output_h):
    inputs = ["-i", source_video]
    for bf in broll_files:
        inputs.extend(["-i", bf])

    filter_parts = []
    concat_inputs = []

    for i, seg in enumerate(segments):
        if seg['type'] == 'talk':
            filter_parts.append(
                f"[0:v]trim=start={seg['start']:.3f}:end={seg['end']:.3f},"
                f"setpts=PTS-STARTPTS,fps={source_fps},setsar=1:1[v{i}];"
            )
            filter_parts.append(
                f"[0:a]atrim=start={seg['start']:.3f}:end={seg['end']:.3f},"
                f"asetpts=PTS-STARTPTS[a{i}];"
            )
        else:
            inp_idx = seg['input_index']
            filter_parts.append(
                f"[{inp_idx}:v]trim=start={seg['broll_start']:.3f}:end={seg['broll_end']:.3f},"
                f"setpts=PTS-STARTPTS,scale={output_w}:{output_h}:flags=lanczos,"
                f"fps={source_fps},setsar=1:1[v{i}];"
            )
            filter_parts.append(
                f"[0:a]atrim=start={seg['start']:.3f}:end={seg['end']:.3f},"
                f"asetpts=PTS-STARTPTS[a{i}];"
            )
        concat_inputs.append(f"[v{i}][a{i}]")

    n = len(segments)
    filter_complex = "\n".join(filter_parts)
    filter_complex += f"\n{''.join(concat_inputs)}concat=n={n}:v=1:a=1[outv][outa]"

    filter_file = output_file.replace('.mp4', '-filter.txt')
    with open(filter_file, 'w') as f:
        f.write(filter_complex)

    cmd = [
        "ffmpeg", "-y", *inputs,
        "-filter_complex_script", filter_file,
        "-map", "[outv]", "-map", "[outa]",
        "-c:v", "libx264", "-preset", "medium", "-crf", "18",
        "-c:a", "aac", "-b:a", "320k",
        "-movflags", "+faststart",
        output_file
    ]
    return cmd

# Check specs match source
ffprobe -v quiet -print_format json -show_streams -show_format output.mp4

# Duration should match edited video (+/- 0.5s)
# Scene changes should be more than the edited video
ffmpeg -i output.mp4 -vf "select='gt(scene,0.15)',showinfo" -vsync vfr -f null /dev/null 2>&1 | grep showinfo | wc -l

## [date] -- [video title]
- clip-name-1.mp4 (at 2:34)
- animation-name.mp4 (at 5:12)

Finalized: [filename]
Duration: [X]m [Y]s
Format: [landscape/vertical/square] ([WxH])
B-roll insertions: [N] animations, [M] real footage clips
Skipped: [K] animations (show moments)
Variant pattern: dark/light/dark/light...
Output: [path to final file]

Setting	Landscape	Vertical	Square
Output	1920x1080	1080x1920	1080x1080
CRF	18	18	18
Audio	AAC 320kbps	AAC 320kbps	AAC 320kbps
Max B-roll	8s	6s	7s
Max talk gap	25s	15s	20s
Gap fill (mandatory)	>30s	>20s	>25s
Gap fill (recommended)	20-30s	12-20s	15-25s
Variant pattern	dark/light alternating	dark/light alternating	dark/light alternating
Show-moment buffer (verbal)	3s before, 6-8s after	3s before, 6-8s after	3s before, 6-8s after
Show-moment buffer (tool)	5s before, 12s after	5s before, 12s after	5s before, 12s after

Detection type	Before buffer	After buffer
Verbal cue (explicit callout)	3 seconds	8 seconds
Verbal cue (demonstration language)	3 seconds	6 seconds
Tool/app context cluster	5 seconds	12 seconds

Section	B-roll frequency
First 30s	Every 3-5s
Minutes 1-3	Every 10-15s
Minutes 3-8	Every 15-20s
After 8 min	Every 20-30s

Section	B-roll frequency
First 5s	Hook -- can use B-roll immediately
Seconds 5-15	Every 3-5s
Seconds 15-30	Every 5-8s
After 30s	Every 8-12s

Metric	Landscape	Vertical	Square
Resolution	1920x1080	1080x1920	1080x1080
Bitrate	8-12 Mbps	6-10 Mbps	6-10 Mbps
FPS	Matches source	Matches source	Matches source
Duration	Same as edited (+/- 0.5s)	Same	Same

Video Finalize

Video Finalize

Step 0 -- Context Loading

Prerequisites

Video Finalize

Video Finalize

Step 0 -- Context Loading

Prerequisites

Step 1 -- Input & Format Detection

Step 2 -- Transcription

Step 3 -- Load B-Roll Map

Step 4 -- Show-Moment Detection (2-Layer)

Layer 1: Verbal Cue Phrases (case-insensitive)

Layer 2: Tool/App Context Detection

Buffers

Override Rules

Step 5 -- Variant Selection & Real B-Roll

Variant Selection (Deterministic)

Format-Specific Animation Rules

Real B-Roll Selection

Approval Gate

Step 6 -- Build Segment List

Format-Specific Pacing

Step 7 -- FFmpeg Composition

Step 8 -- Render

Step 9 -- Verification

Quality Targets

Step 10 -- Cross-Video Tracking

Step 11 -- Optional QA Review

Step 12 -- Report

Key Rules

Quick Reference

Feedback

Songsee

Video Frames

Gifgrep

Qqbot Media

Camsnap

Openai Whisper Api

Setting	Landscape	Vertical	Square
Max animation duration	8s	6s	7s
Max talking head gap	25s	15s	20s
Real B-roll clip length	3-5s	2-4s	3-4s

Gap Size	Landscape	Vertical	Square
Mandatory fill	>30s	>20s	>25s
Recommended fill	20-30s	12-20s	15-25s
Skip	<20s	<12s	<15s