Expert knowledge for AI video clipping — yt-dlp downloading, whisper transcription, SRT generation, and ffmpeg processing
All tools (ffmpeg, ffprobe, yt-dlp, whisper) use identical CLI flags on Windows, macOS, and Linux. The differences are only in shell syntax:
| Feature | macOS / Linux | Windows (cmd.exe) |
|---|---|---|
| Suppress stderr | 2>/dev/null | 2>NUL |
| Filter output | | grep pattern | | findstr pattern |
| Delete files | rm file1 file2 | del file1 file2 |
| Null output device | -f null - | -f null - (same) |
| ffmpeg subtitle paths | subtitles=clip.srt | subtitles=clip.srt (relative OK, absolute needs C\\:/path) |
IMPORTANT: ffmpeg filter paths (-vf "subtitles=...") always need forward slashes. On Windows with absolute paths, escape the colon: subtitles=C\\:/Users/me/clip.srt
Prefer using file_write tool for creating SRT/text files instead of shell echo/heredoc.
# Best video up to 1080p + best audio, merged
yt-dlp -f "bv[height<=1080]+ba/b[height<=1080]" --restrict-filenames -o "source.%(ext)s" "URL"
# 720p max (smaller, faster)
yt-dlp -f "bv[height<=720]+ba/b[height<=720]" --restrict-filenames -o "source.%(ext)s" "URL"
# Audio only (for transcription-only workflows)
yt-dlp -x --audio-format wav --restrict-filenames -o "audio.%(ext)s" "URL"
# Get full metadata as JSON (duration, title, chapters, available subs)
yt-dlp --dump-json "URL"
# Key fields: duration, title, description, chapters, subtitles, automatic_captions
# Download auto-generated subtitles in json3 format (word-level timing)
yt-dlp --write-auto-subs --sub-lang en --sub-format json3 --skip-download --restrict-filenames -o "source" "URL"
# Download manual subtitles if available
yt-dlp --write-subs --sub-lang en --sub-format srt --skip-download --restrict-filenames -o "source" "URL"
# List available subtitle languages
yt-dlp --list-subs "URL"
--restrict-filenames — safe ASCII filenames (no spaces/special chars) — important on all platforms--no-playlist — download single video even if URL is in a playlist-o "template.%(ext)s" — output template (%(ext)s auto-detects format)--cookies-from-browser chrome — use browser cookies for age-restricted content--extract-audio / -x — extract audio only--audio-format wav — convert audio to wav (for whisper)# Extract mono 16kHz WAV (whisper's preferred input format)
ffmpeg -i source.mp4 -vn -ar 16000 -ac 1 -y audio.wav
# Standard transcription with word-level timestamps
whisper audio.wav --model small --output_format json --word_timestamps true --language en
# Faster alternative (same flags, 4x speed)
whisper-ctranslate2 audio.wav --model small --output_format json --word_timestamps true --language en
| Model | VRAM | Speed | Quality | Use When |
|---|---|---|---|---|
| tiny | ~1GB | Fastest | Rough | Quick previews, testing pipeline |
| base | ~1GB | Fast | OK | Short clips, clear speech |
| small | ~2GB | Good | Good | Default — best balance |
| medium | ~5GB | Slow | Better | Important content, accented speech |
| large-v3 | ~10GB | Slowest | Best | Final production, multiple languages |
Note: On macOS Apple Silicon, consider mlx-whisper as a faster native alternative.
{
"text": "full transcript text...",
"segments": [
{
"id": 0,
"start": 0.0,
"end": 4.52,
"text": " Hello everyone, welcome back.",
"words": [
{"word": " Hello", "start": 0.0, "end": 0.32, "probability": 0.95},
{"word": " everyone,", "start": 0.32, "end": 0.78, "probability": 0.91},
{"word": " welcome", "start": 0.78, "end": 1.14, "probability": 0.98},
{"word": " back.", "start": 1.14, "end": 1.52, "probability": 0.97}
]
}
]
}
segments[].words[] gives word-level timing when --word_timestamps trueprobability indicates confidence (< 0.5 = likely wrong){
"events": [
{
"tStartMs": 1230,
"dDurationMs": 5000,
"segs": [
{"utf8": "hello ", "tOffsetMs": 0},
{"utf8": "world ", "tOffsetMs": 200},
{"utf8": "how ", "tOffsetMs": 450},
{"utf8": "are you", "tOffsetMs": 700}
]
}
]
}
For each event and each segment within it:
word_start_ms = event.tStartMs + seg.tOffsetMsword_start_secs = word_start_ms / 1000.0word_text = seg.utf8.trim()Events without segs are line breaks or formatting — skip them.
Events with segs containing only "\n" are newlines — skip them.
1