Edit any video by conversation. Transcribe, cut, color grade, generate overlay animations, burn subtitles — for talking heads, montages, tutorials, travel, interviews. No presets, no menus. Ask questions, confirm the plan, execute, iterate, persist. Production-correctness rules are hard; everything else is artistic freedom.
takes_packed.md). Everything else — filler tagging, retake detection, shot classification, emphasis scoring — you derive at decision time.These are the things where deviation produces silent failures or broken output. They are not taste, they are correctness. Memorize them.
-c copy concat, not single-pass filtergraph. Otherwise you double-encode every segment when overlays are added.afade=t=in:st=0:d=0.03,afade=t=out:st={dur-0.03}:d=0.03). Otherwise audible pops at every cut.setpts=PTS-STARTPTS+T/TB to shift the overlay's frame 0 to its window start. Otherwise you see the middle of the animation during the overlay window.output_time = word.start - segment_start + segment_offset. Otherwise captions misalign after segment concat.Agent tool; total wall time ≈ slowest one.<videos_dir>/edit/. Never write inside the video-use/ project directory.Everything else in this document is a worked example. Deviate whenever the material calls for it.
The skill lives in video-use/. User footage lives wherever they put it. All session outputs go into <videos_dir>/edit/.
<videos_dir>/
├── <source files, untouched>
└── edit/
├── project.md ← memory; appended every session
├── takes_packed.md ← phrase-level transcripts, the LLM's primary reading view
├── edl.json ← cut decisions
├── transcripts/<name>.json ← cached raw Scribe JSON
├── animations/slot_<id>/ ← per-animation source + render + reasoning
├── clips_graded/ ← per-segment extracts with grade + fades
├── master.srt ← output-timeline subtitles
├── downloads/ ← yt-dlp outputs
├── verify/ ← debug frames / timeline PNGs
├── preview.mp4
└── final.mp4
ELEVENLABS_API_KEY in .env at project root or env. Ask and write .env if missing.ffmpeg + ffprobe on PATH.pip install -e ..yt-dlp, manim, Remotion installed only on first use.skills/manim-video/. Read its SKILL.md when building a Manim slot.transcribe.py <video> — single-file Scribe call. --num-speakers N optional. Cached.transcribe_batch.py <videos_dir> — 4-worker parallel transcription. Use for multi-take.pack_transcripts.py --edit-dir <dir> — transcripts/*.json → takes_packed.md (phrase-level, break on silence ≥ 0.5s).timeline_view.py <video> <start> <end> — filmstrip + waveform PNG. On-demand visual drill-down. Not a scan tool — use it at decision points, not constantly.render.py <edl.json> -o <out> — per-segment extract → concat → overlays (PTS-shifted) → subtitles LAST. --preview for 720p fast. --build-subtitles to generate master.srt inline.grade.py <in> -o <out> — ffmpeg filter chain grade. Presets + --filter '<raw>' for custom.For animations, create <edit>/animations/slot_<id>/ with Bash and spawn a sub-agent via the Agent tool.
Inventory. ffprobe every source. transcribe_batch.py on the directory. pack_transcripts.py to produce takes_packed.md. Sample one or two timeline_views for a visual first impression.
Pre-scan for problems. One pass over takes_packed.md to note verbal slips, obvious mis-speaks, or phrasings to avoid. Plain list, feed into the editor brief.
Converse. Describe what you see in plain English. Ask questions shaped by the material. Collect: content type, target length/aspect, aesthetic/brand direction, pacing feel, must-preserve moments, must-cut moments, animation and grade preferences, subtitle needs. Do not use a fixed checklist — the right questions are different every time.
Propose strategy. 4–8 sentences: shape, take choices, cut direction, animation plan, grade direction, subtitle style, length estimate. Wait for confirmation.
Execute. Produce edl.json via the editor sub-agent brief. Drill into timeline_view at ambiguous moments. Build animations in parallel sub-agents. Apply grade per-segment. Compose via render.py.
Preview. render.py --preview.
Self-eval (before showing the user). Run timeline_view on the rendered output (not the sources) at every cut boundary (±1.5s window). Check each image for:
Also sample: first 2s, last 2s, and 2–3 mid-points — check grade consistency, subtitle readability, overall coherence. Run ffprobe on the output to verify duration matches the EDL expectation.
If anything fails: fix → re-render → re-eval. Cap at 3 self-eval passes — if issues remain after 3, flag them to the user rather than looping forever. Only present the preview once the self-eval passes.
Iterate + persist. Natural-language feedback, re-plan, re-render. Never re-transcribe. Final render on confirmation. Append to project.md.
(laughs), (sighs), (applause) mark beats. Extend past them.pack_transcripts.py reads all transcripts/*.json and produces one markdown file where each take is a list of phrase-level lines, each prefixed with its [start-end] time range. Phrases break on any silence ≥ 0.5s OR speaker change. This is the artifact the editor sub-agent reads to pick cuts — it gives word-boundary precision from text alone at 1/10 the tokens of raw JSON.
Example line:
## C0103 (duration: 43.0s, 8 phrases)
[002.52-005.36] S0 Ninety percent of what a web agent does is completely wasted.
[006.08-006.74] S0 We fixed this.
When the task is "pick the best take of each beat across many clips," spawn a dedicated sub-agent with a brief shaped like this. The structure is load-bearing; the pitch-shape example is not.
You are editing a <type> video. Pick the best take of each beat and
assemble them chronologically by beat, not by source clip order.
INPUTS:
- takes_packed.md (time-annotated phrase-level transcripts of all takes)
- Product/narrative context: <2 sentences from the user>
- Speaker(s): <name, role, delivery style note>
- Expected structure: <pick an archetype or invent one>
- Verbal slips to avoid: <list from the pre-scan pass>
- Target runtime: <seconds>
Common structural archetypes (pick, adapt, or invent):
- Tech launch / demo: HOOK → PROBLEM → SOLUTION → BENEFIT → EXAMPLE → CTA
- Tutorial: INTRO → SETUP → STEPS → GOTCHAS → RECAP
- Interview: (QUESTION → ANSWER → FOLLOWUP) repeat
- Travel / event: ARRIVAL → HIGHLIGHTS → QUIET MOMENTS → DEPARTURE
- Documentary: THESIS → EVIDENCE → COUNTERPOINT → CONCLUSION
- Music / performance: INTRO → VERSE → CHORUS → BRIDGE → OUTRO
- Or invent your own.
RULES:
- Start/end times must fall on word boundaries from the transcript.
- Pad cut boundaries (working window 30–200ms).
- Prefer silences ≥ 400ms as cut targets.
- Unavoidable slips are kept if no better take exists. Note them in "reason".
- If over budget, revise: drop a beat or trim tails. Report total and self-correct.
OUTPUT (JSON array, no prose):
[{"source": "C0103", "start": 2.42, "end": 6.85, "beat": "HOOK",
"quote": "...", "reason": "..."}, ...]
Return the final EDL and a one-line total runtime check.
Your job is to reason about the image, not apply a preset. Look at a frame (via timeline_view), decide what's wrong, adjust one thing, look again.
Mental model is ASC CDL. Per channel: out = (in * slope + offset) ** power, then global saturation. slope → highlights, offset → shadows, power → midtones.
Example filter chains (grade.py has --list-presets; use them as starting points or mix your own):
warm_cinematic — retro/technical, subtle teal/orange split, desaturated. Shipped in a real launch video. Safe for talking heads.neutral_punch — minimal corrective: contrast bump + gentle S-curve. No hue shifts.none — straight copy. Default when the user hasn't asked.For anything else — portraiture, nature, product, music video, documentary — invent your own chain. grade.py --filter '<raw ffmpeg>' accepts any filter string.
Hard rules: apply per-segment during extraction (not post-concat, which re-encodes twice). Never go aggressive without testing skin tones.
Subtitles have three dimensions worth reasoning about: chunking (1/2/3/sentence per line), case (UPPER/Title/Natural), and placement (margin from bottom). The right combo depends on content.
Worked styles — pick, adapt, or invent:
bold-overlay — short-form tech launch, fast-paced social. 2-word chunks, UPPERCASE, break on punctuation, Helvetica 18 Bold, white-on-outline, MarginV=35. render.py ships with this as SUB_FORCE_STYLE.
FontName=Helvetica,FontSize=18,Bold=1,
PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,BackColour=&H00000000,
BorderStyle=1,Outline=2,Shadow=0,
Alignment=2,MarginV=35
natural-sentence (if you invent this mode) — narrative, documentary, education. 4–7 word chunks, sentence case, break on natural pauses, MarginV=60–80, larger font for readability, slightly wider max-width. No shipped force_style — design one if you need it.
Invent a third style if neither fits. Hard rules: subtitles LAST (Rule 1), output-timeline offsets (Rule 5).
Animations match the content and the brand. Get the palette, font, and visual language from the conversation — never assume a default. If the user hasn't told you, propose a palette in the strategy phase and wait for confirmation before building anything.
Tool options:
skills/manim-video/SKILL.md and its references for depth.None is mandatory. Invent hybrids if useful (e.g., PIL background with a Remotion layer on top).
Duration rules of thumb, context-dependent:
narration_length + 1s (universal).Animation payoff timing (rule for sync-to-narration): get the payoff word's timestamp. Start the overlay reveal_duration seconds earlier so the landing frame coincides with the spoken payoff word. Without this sync the animation feels disconnected.
Easing (universal — never linear, it looks robotic):
def ease_out_cubic(t): return 1 - (1 - t) ** 3
def ease_in_out_cubic(t):
if t < 0.5: return 4 * t ** 3
return 1 - (-2 * t + 2) ** 3 / 2
ease_out_cubic for single reveals (slow landing). ease_in_out_cubic for continuous draws.
Typing text anchor trick: center on the FULL string's width, not the partial-string width — otherwise text slides left during reveal.
Example palette (the launch video — one aesthetic among infinite):
(10, 10, 10) near-black#FF5A00 / (255, 90, 0) orange(110, 110, 110) dim gray/System/Library/Fonts/Menlo.ttc (index 1)This is one style. If the brand is warm and serif, use that. If it's colorful and playful, use that. If the user handed you a style guide, follow it. If they didn't, propose one and confirm.
Parallel sub-agent brief — each animation is one sub-agent spawned via the Agent tool. Each prompt is self-contained (sub-agents have no parent context). Include:
<edit>/animations/slot_<id>/render.mp4)One sub-agent = one file (unique filenames, parallel agents don't overwrite each other).
Match the source unless the user asked for something specific. Common targets: 1920×1080@24 cinematic, 1920×1080@30 screen content, 1080×1920@30 vertical social, 3840×2160@24 4K cinema, 1080×1080@30 square. render.py defaults the scale to 1080p from any source; pass --filter or edit the extract command for other targets. Worth asking the user which delivery format matters.
{
"version": 1,
"sources": {"C0103": "/abs/path/C0103.MP4", "C0108": "/abs/path/C0108.MP4"},
"ranges": [
{"source": "C0103", "start": 2.42, "end": 6.85,
"beat": "HOOK", "quote": "...", "reason": "Cleanest delivery, stops before slip at 38.46."},
{"source": "C0108", "start": 14.30, "end": 28.90,
"beat": "SOLUTION", "quote": "...", "reason": "Only take without the false start."}
],
"grade": "warm_cinematic",
"overlays": [
{"file": "edit/animations/slot_1/render.mp4", "start_in_output": 0.0, "duration": 5.0}
],
"subtitles": "edit/master.srt",
"total_duration_s": 87.4
}
grade is a preset name or raw ffmpeg filter. overlays are rendered animation clips. subtitles is optional and applied LAST.
project.mdAppend one section per session at <edit>/project.md:
## Session N — YYYY-MM-DD
**Strategy:** one paragraph describing the approach
**Decisions:** take choices, cuts, grades, animations + why
**Reasoning log:** one-line rationale for non-obvious decisions
**Outstanding:** deferred items
On startup, read project.md if it exists and summarize the last session in one sentence before asking whether to continue.
Things that consistently fail regardless of style: