Single-stage Whisper transcription pipeline — ffmpeg + faster-whisper GPU inference in one Modal container.

Pipeline code is bundled at ./transcribe.py and ./src/. After npx skills add, runs from any directory.

Workflow

1. Prepare slug and identify files

Slug = task identifier (volume directory name). Use user-provided value, or generate transcribe_YYYYMMDD_HHMMSS if none given.

Directory input? Scan for audio/video (.m4a, .mp3, .mp4, .wav, .flac, .ogg, .aac, .mov, .avi), list with index, ask user to confirm selection.

Specific files? Use directly, no listing needed.

2. Upload to volume

Ensure volume exists (idempotent):

Model	RTF (L4)	Memory	Accuracy
tiny	~0.03x	~1GB	Low
base	~0.06x	~1GB	Medium
small	~0.09x	~2GB	Good
medium	~0.13x	~5GB	High
large-v3	~0.19x	~6GB	Highest

Speech Transcribe

Workflow

1. Prepare slug and identify files

2. Upload to volume

Speech Transcribe

Workflow

1. Prepare slug and identify files

2. Upload to volume

3. Run pipeline

4. Download results

5. Clean up

6. Report

Setup

Model Options

Error Handling

Songsee

Video Frames

Gifgrep

Qqbot Media

Camsnap

Openai Whisper Api