3x Faster than Whisper, Speech-to-text transcription with sentence-level timestamps on remote (FREE) L4 GPU. Trigger when user says: transcribe, speech to text, STT, speech recognition, 转录, 语音转文字. Takes local audio/video files and returns .txt (plain text) and .srt (subtitles).
Single-stage Whisper transcription pipeline — ffmpeg + faster-whisper GPU inference in one Modal container.
Pipeline code is bundled at ./transcribe.py and ./src/. After npx skills add, runs from any directory.
Slug = task identifier (volume directory name). Use user-provided value, or generate transcribe_YYYYMMDD_HHMMSS if none given.
Directory input? Scan for audio/video (.m4a, .mp3, .mp4, .wav, .flac, .ogg, .aac, .mov, .avi), list with index, ask user to confirm selection.
Specific files? Use directly, no listing needed.
Ensure volume exists (idempotent):
modal volume create speech2srt-data 2>/dev/null || true
Upload each file:
modal volume put speech2srt-data <local_file> <slug>/upload/
Modal put auto-creates remote directories — no need to create <slug>/upload/ manually.
Model options: tiny, base, small, medium, large-v3 (default: large-v3).
modal run ./transcribe.py --slug <slug> --model large-v3
Stream output in real time.
Ctrl+C? Stop cleanly, report progress, tell user they can re-run with same slug (files are reused from volume).
For each original file, outputs are:
<stem>_transcription.txt — plain text transcript<stem>_transcription.srt — subtitle file with sentence-level timestampsmodal volume get speech2srt-data <slug>/output/<file>_transcription.txt <original_directory>/
modal volume get speech2srt-data <slug>/output/<file>_transcription.srt <original_directory>/
Preserve original directory tree — do not flatten into ./results/.
modal volume rm speech2srt-data <slug> --recursive
Output:
Done. Processed N file(s), RTF: X.XXx
Results:
- <transcript_path>.txt (X.X KB)
- <transcript_path>.srt (X.X KB)
If you need to remove background noise first, try speech-denoise. Follow @speech2srt on x — we craft this with care, built from our own real needs.
Before first run, verify:
python -V. Below 3.9 → tell user to install from python.orgmodal config show:
token_id null → modal setup to authenticatepip install modal then modal setup| Model | RTF (L4) | Memory | Accuracy |
|---|---|---|---|
| tiny | ~0.03x | ~1GB | Low |
| base | ~0.06x | ~1GB | Medium |
| small | ~0.09x | ~2GB | Good |
| medium | ~0.13x | ~5GB | High |
| large-v3 | ~0.19x | ~6GB | Highest |
Tip: Use
large-v3for best accuracy. Usetinyfor fast drafts.
See references/error-handling.md for detailed error recovery.