Converts audio and video files into text, subtitles, or captions using OpenAI Whisper CLI. Supports multiple models for accuracy/speed tradeoff, language detection, and output formats (txt/srt/vtt/json). Triggers: user asks to transcribe audio, transcribe video, extract speech from media, generate subtitles, create captions, or build searchable transcripts.
Meeting recordings, interview videos, and webinars contain valuable information locked in audio/video format. Transcription extracts this as searchable text for documentation, accessibility, and downstream processing (summarization, citation, captioning). Whisper CLI provides fast, accurate offline transcription without API rate limits or per-minute costs.
Transcribe video and audio files using OpenAI Whisper CLI.
Whisper must be installed:
pip install openai-whisper
Verify installation:
whisper --help
cd "{baseDir}/Videos"
Standard transcription (recommended defaults):
whisper "filename.mp4" --model medium --language en --output_format txt
Example with full path:
whisper "{baseDir}/Videos/my_video.mp4" --model medium --language en --output_format txt
Whisper creates the transcript in the current working directory with the same filename:
my_video.mp4my_video.txtTo verify:
ls -la *.txt
cat "my_video.txt"
Or use the Read tool to view in Claude.
| Task | Command |
|---|---|
| Transcribe to text | whisper "file.mp4" --model medium --language en --output_format txt |
| Generate subtitles | whisper "file.mp4" --model medium --language en --output_format srt |
| Web captions (VTT) | whisper "file.mp4" --model medium --language en --output_format vtt |
| All formats at once | whisper "file.mp4" --model medium --language en --output_format all |
| Model | Speed | Accuracy | VRAM | When to Use |
|---|---|---|---|---|
tiny | ~1 min/10min | Lower | ~1GB | Quick test, draft (accuracy sacrifice acceptable) |
base | ~2 min/10min | Moderate | ~1GB | Clear speech, low background noise |
small | ~4 min/10min | Good | ~2GB | General use, balanced cost/accuracy |
medium | ~8 min/10min | High | ~5GB | Default - interviews, meetings, professional content |
large | ~15 min/10min | Highest | ~10GB | Accents, poor audio, critical accuracy required |
| Language | Code |
|---|---|
| English | en |
| Dutch | nl |
| German | de |
| French | fr |
| Spanish | es |
| Auto-detect | (omit --language flag) |
cd "{baseDir}/Videos/"
whisper "CEO_Interview.mp4" --model medium --language en --output_format txt
cat "CEO_Interview.txt"
whisper "Presentation_2025.mp4" --model medium --language en --output_format srt
Output: Presentation_2025.srt (can be imported into video editors)
whisper "Teammeeting_Jan.mp4" --model medium --language nl --output_format txt
whisper "3hour_webinar.mp4" --model tiny --language en --output_format txt
(Use tiny for speed when accuracy is less critical)
whisper "International_Panel.mp4" --model large --output_format txt
(Omit --language to auto-detect, use large for best accuracy)
Save to specific directory:
whisper "video.mp4" --model medium --language en --output_format txt --output_dir "/path/to/output/"
Save next to source file:
cd "/path/to/source/folder" && whisper "video.mp4" --model medium --language en --output_format txt
| Problem | Solution |
|---|---|
| CUDA out of memory | Use smaller model: --model small or --model base |
| Very slow | Use --model tiny or --model base |
| Poor accuracy | Use --model large and/or specify correct --language |
| Output in wrong folder | Use --output_dir or cd to target folder first |
| Command not found | Run pip install openai-whisper |
For long videos, run in background:
whisper "long_video.mp4" --model medium --language en --output_format txt &
Check if still running:
ps aux | grep whisper
| Setting | Value | Rationale |
|---|---|---|
| Default Model | medium | Balances accuracy (~90-95% WER for clear speech) with reasonable speed (~8min/10min video) and VRAM (~5GB typical) without extreme overhead |
| Default Language | en | Assumes English context; auto-detect (--language omitted) adds ~5-10% processing overhead but handles mixed-language audio |
VRAM Threshold - Switch to small | 4GB available | medium requires ~5GB peak; below 4GB, small model recommended to avoid OOM |
VRAM Threshold - Switch to base | 2GB available | Extreme resource constraint; base still provides acceptable accuracy for clear speech |
| Processing Speed Ratio | 1min Whisper per 10min video | Approximate for medium model on modern GPU; actual varies by audio quality, sampling rate, and hardware |
whisper --help
# Should print usage info; if "command not found", run: pip install openai-whisper
whisper "{baseDir}/test_sample.mp4" --model tiny --language en --output_format txt
# Rationale: Use tiny model for quick validation (30 seconds max processing)
cat "{baseDir}/test_sample.txt"
# Verify output is readable text with minimal formatting errors
Test each output format on same file:
whisper "{baseDir}/test_sample.mp4" --model small --output_format txt
whisper "{baseDir}/test_sample.mp4" --model small --output_format srt
whisper "{baseDir}/test_sample.mp4" --model small --output_format vtt
# Verify .txt is plain text, .srt has timestamps [00:00:00,000 --> 00:00:05,000], .vtt is WebVTT format
whisper "{baseDir}/multilingual_sample.mp4" --output_format txt
# Omit --language flag to test auto-detection
# Check output header for detected language in JSON format if needed
After each transcription: