Speech-to-text transcription using multiple engines (Whisper, Google Speech, Azure, AssemblyAI). Record audio, transcribe files, real-time transcription, speaker diarization, timestamps, and multi-language support. Use for meeting transcription, voice notes, audio file processing, or accessibility features.
Comprehensive speech-to-text capabilities using multiple STT engines. Record audio, transcribe files, real-time processing, speaker identification, and multi-language support.
When asked to transcribe audio:
Core (required):
pip install sounddevice soundfile numpy --break-system-packages
Whisper (OpenAI - local, free):
pip install openai-whisper --break-system-packages
# For faster processing with GPU:
pip install openai-whisper torch --break-system-packages
Google Speech (requires API key):
pip install google-cloud-speech --break-system-packages
Azure Speech (requires API key):
pip install azure-cognitiveservices-speech --break-system-packages
AssemblyAI (requires API key):
pip install assemblyai --break-system-packages
Optional enhancements:
pip install pydub webrtcvad --break-system-packages # Audio processing
pip install pyaudio --break-system-packages # Alternative audio backend
See reference/setup-guide.md for detailed installation.
| Engine | Cost | Speed | Quality | Features | Best For |
|---|---|---|---|---|---|
| Whisper | Free | Medium | High | Multilingual, local | Privacy, offline, free |
| Pay-per-use | Fast | High | Punctuation, diarization | Real-time, accuracy | |
| Azure | Pay-per-use | Fast | High | Translation, custom | Enterprise integration |
| AssemblyAI | Pay-per-use | Medium | Very High | Diarization, sentiment | Analysis, insights |
Simple recording:
# Record 30 seconds
python scripts/record_audio.py --duration 30 --output recording.wav
# Record until stopped (Ctrl+C)
python scripts/record_audio.py --output recording.wav
# Record with voice activity detection
python scripts/record_audio.py --vad --output recording.wav
Advanced recording:
# Choose microphone
python scripts/list_devices.py # List available mics
python scripts/record_audio.py --device 1 --output recording.wav
# Specify quality
python scripts/record_audio.py \
--sample-rate 48000 \
--channels 2 \
--output recording.wav
Using Whisper (local, free):
# Basic transcription
python scripts/transcribe_whisper.py --file recording.wav
# Choose model size (tiny, base, small, medium, large)
python scripts/transcribe_whisper.py \
--file recording.wav \
--model medium
# With timestamps
python scripts/transcribe_whisper.py \
--file recording.wav \
--timestamps \
--output transcript.json
# Multiple languages
python scripts/transcribe_whisper.py \
--file recording.wav \
--language es # Spanish
Using Google Cloud:
# Export API key
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
# Transcribe
python scripts/transcribe_google.py \
--file recording.wav \
--language en-US
# With speaker diarization
python scripts/transcribe_google.py \
--file recording.wav \
--diarization \
--speakers 2
Using Azure:
# Set credentials
export AZURE_SPEECH_KEY="your-key"
export AZURE_SPEECH_REGION="westus"
# Transcribe
python scripts/transcribe_azure.py --file recording.wav
# Real-time
python scripts/transcribe_azure_realtime.py --microphone
Using AssemblyAI:
# Set API key
export ASSEMBLYAI_API_KEY="your-key"
# Transcribe with features
python scripts/transcribe_assemblyai.py \
--file recording.wav \
--diarization \
--sentiment \
--topics
Stream from microphone:
# Whisper streaming (chunked)
python scripts/stream_whisper.py
# Google streaming
python scripts/stream_google.py
# Azure continuous recognition
python scripts/stream_azure.py
Plain text:
python scripts/transcribe_whisper.py --file audio.wav --output transcript.txt
JSON with metadata:
python scripts/transcribe_whisper.py \
--file audio.wav \
--format json \
--output transcript.json
# Output includes:
# - Text segments
# - Timestamps
# - Confidence scores
# - Language detection
SRT subtitles:
python scripts/transcribe_whisper.py \
--file video.mp4 \
--format srt \
--output subtitles.srt
VTT subtitles:
python scripts/transcribe_whisper.py \
--file video.mp4 \
--format vtt \
--output subtitles.vtt
Scenario: Record and transcribe meeting with speaker labels
# 1. Record meeting
python scripts/record_audio.py \
--output meeting.wav \
--vad # Stop on silence
# 2. Transcribe with speaker diarization
python scripts/transcribe_google.py \
--file meeting.wav \
--diarization \
--speakers 4 \
--output meeting.json
# 3. Format for readability
python scripts/format_transcript.py \
--input meeting.json \
--format markdown \
--output meeting.md
# Result: Formatted transcript with speaker labels and timestamps
Scenario: Quick voice note → markdown document
# Record voice note
python scripts/quick_note.py
# (Records audio, transcribes with Whisper, saves as markdown)
# Output: voice-note-2025-01-20-14-30.md
Scenario: Transcribe multiple audio files
# Batch process folder
python scripts/batch_transcribe.py \
--input ./recordings/ \
--output ./transcripts/ \
--engine whisper \
--model base
# Progress shown for each file
Scenario: Generate subtitles for video
# Extract audio from video
python scripts/extract_audio.py --video lecture.mp4 --output audio.wav
# Generate subtitles
python scripts/transcribe_whisper.py \
--file audio.wav \
--format srt \
--output lecture.srt
# Embed in video (requires ffmpeg)
python scripts/embed_subtitles.py \
--video lecture.mp4 \
--subtitles lecture.srt \
--output lecture-subbed.mp4
Scenario: Transcribe and translate
# Transcribe Spanish audio
python scripts/transcribe_whisper.py \
--file spanish-audio.wav \
--language es \
--output transcript-es.txt
# Translate to English
python scripts/transcribe_whisper.py \
--file spanish-audio.wav \
--task translate \
--output transcript-en.txt
| Model | Parameters | Size | Speed | VRAM | Accuracy |
|---|---|---|---|---|---|
| tiny | 39M | ~75MB | ~32x | ~1GB | Good |
| base | 74M | ~142MB | ~16x | ~1GB | Better |
| small | 244M | ~466MB | ~6x | ~2GB | Great |
| medium | 769M | ~1.5GB | ~2x | ~5GB | Excellent |
| large | 1550M | ~2.9GB | 1x | ~10GB | Best |
Recommendation:
tiny or base (fast, good enough)small or medium (balanced)large (best accuracy, slower)medium or largetiny or baseWhisper supports 99+ languages:
# Common languages
en # English
es # Spanish
fr # French
de # German
it # Italian
pt # Portuguese
nl # Dutch
pl # Polish
ru # Russian
ja # Japanese
ko # Korean
zh # Chinese
ar # Arabic
hi # Hindi
Full list: reference/language-codes.md
Identify who said what:
# Google (best diarization)
python scripts/transcribe_google.py \
--file meeting.wav \
--diarization \
--speakers 3 # Hint: 3 speakers expected
# AssemblyAI
python scripts/transcribe_assemblyai.py \
--file meeting.wav \
--diarization
# Output format:
# Speaker 1: Hello everyone, let's begin
# Speaker 2: Thanks for joining
# Speaker 1: Today's agenda includes...
Post-process with names:
python scripts/label_speakers.py \
--transcript meeting.json \
--labels "Alice,Bob,Charlie" \
--output meeting-labeled.txt
Enhance audio quality:
# Reduce noise
python scripts/denoise_audio.py \
--input noisy.wav \
--output clean.wav
# Normalize volume
python scripts/normalize_audio.py \
--input quiet.wav \
--output normalized.wav
# Convert format
python scripts/convert_audio.py \
--input audio.m4a \
--output audio.wav
Transcript with timestamps:
{
"segments": [
{
"start": 0.0,
"end": 3.5,
"text": "Welcome to today's meeting.",
"confidence": 0.95
},
{
"start": 3.5,
"end": 7.2,
"text": "Let's review the quarterly results.",
"confidence": 0.92
}
]
}
Search by timestamp:
# Find text at specific time
python scripts/find_at_time.py \
--transcript meeting.json \
--time "5:30" # 5 minutes 30 seconds
# Extract time range
python scripts/extract_range.py \
--transcript meeting.json \
--start "2:00" \
--end "5:00" \
--output excerpt.txt
Per hour of audio:
Free tiers:
Recording:
record_audio.py - Record from microphonelist_devices.py - List audio devicestest_microphone.py - Test mic inputTranscription:
transcribe_whisper.py - Whisper transcriptiontranscribe_google.py - Google Cloud STTtranscribe_azure.py - Azure Speechtranscribe_assemblyai.py - AssemblyAIReal-time:
stream_whisper.py - Whisper streamingstream_google.py - Google streamingstream_azure.py - Azure continuousProcessing:
batch_transcribe.py - Batch processingformat_transcript.py - Format outputextract_audio.py - Extract from videodenoise_audio.py - Noise reductionUtilities:
quick_note.py - Record + transcribelabel_speakers.py - Add speaker namesfind_at_time.py - Search by timestampconvert_audio.py - Format conversion"No module named 'whisper'"
pip install openai-whisper --break-system-packages
"Microphone not working"
# List devices
python scripts/list_devices.py
# Test specific device
python scripts/test_microphone.py --device 1
"Out of memory" (Whisper)
# Use smaller model
python scripts/transcribe_whisper.py --file audio.wav --model tiny
# Or process in chunks
python scripts/transcribe_chunked.py --file large-audio.wav
"Poor transcription quality"
"API authentication failed"
# Google
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
# Azure
export AZURE_SPEECH_KEY="your-key"
export AZURE_SPEECH_REGION="region"
# AssemblyAI
export ASSEMBLYAI_API_KEY="your-key"
See examples/ for complete workflows: