Name: Parakeet
Author: ThePlasmak

搵技能.../

Parakeet | Skills Pool

--long-form

Feature	Parakeet	faster-whisper
Accuracy	✅ Best (6.34% avg WER)	Good (distil: 7.08% WER)
Speed	✅ ~3380× realtime	~20× realtime
Auto punctuation	✅ Built-in	❌ Requires post-processing
Languages	25 European	✅ 99+ worldwide
Diarization	❌ Not supported	✅ pyannote speaker ID
Chapters/search	❌ Not supported	✅ Chapter detection, search
Output formats	text, JSON, SRT, VTT	✅ text, JSON, SRT, VTT, ASS, LRC, TTML, CSV, TSV, HTML
Translation	❌ Not supported	✅ Any language → English
VRAM usage	~2GB	~1.5GB (distil)
Long audio	✅ Up to 3 hours	Limited by VRAM
Streaming	✅ Chunked inference	✅ Segment streaming
Noise handling	Good (built-in robustness)	✅ --denoise, --normalize
Filler removal	❌	✅ --clean-filler

Task	Command	Notes
Basic transcription	`./scripts/transcribe audio.wav`	Auto-punctuated, GPU-accelerated
Transcribe MP3	`./scripts/transcribe audio.mp3`	Auto-converts via ffmpeg
SRT subtitles	`./scripts/transcribe audio.wav --format srt -o subs.srt`	Timestamps auto-enabled
VTT subtitles	`./scripts/transcribe audio.wav --format vtt -o subs.vtt`	WebVTT format
Word timestamps	`./scripts/transcribe audio.wav --timestamps --format json -o out.json`	Word/segment/char level
JSON output	`./scripts/transcribe audio.wav --format json -o result.json`	Full metadata + timestamps
Long audio (>24 min)	`./scripts/transcribe lecture.wav --long-form`	Local attention, up to ~3 hours
Streaming output	`./scripts/transcribe audio.wav --streaming`	Print segments as transcribed
Specify language	`./scripts/transcribe audio.wav -l fr`	Validate language (v3 auto-detects)
YouTube/URL	`./scripts/transcribe https://youtube.com/watch?v=...`	Auto-downloads via yt-dlp
Batch process	`./scripts/transcribe *.wav -o ./transcripts/`	Output to directory
Batch with skip	`./scripts/transcribe *.wav --skip-existing -o ./out/`	Resume interrupted batches
English-only model	`./scripts/transcribe audio.wav -m nvidia/parakeet-tdt-0.6b-v2`	v2 English-only
Larger model	`./scripts/transcribe audio.wav -m nvidia/parakeet-tdt-1.1b`	1.1B param English model
CPU mode	`./scripts/transcribe audio.wav --device cpu`	If no GPU available
Quiet mode	`./scripts/transcribe audio.wav -q`	Suppress progress messages
Subtitle word wrapping	`./scripts/transcribe audio.wav --format srt --max-words-per-line 8 -o subs.srt`	Split long subtitle cues
Char-based wrapping	`./scripts/transcribe audio.wav --format srt --max-chars-per-line 42 -o subs.srt`	Character limit per line
Show version	`./scripts/transcribe --version`	Print NeMo version
System check	`./setup.sh --check`	Verify GPU, Python, NeMo, ffmpeg
Upgrade NeMo	`./setup.sh --update`	Upgrade without full reinstall

Model	Params	Languages	Speed	Use Case
`nvidia/parakeet-tdt-0.6b-v3`	600M	25 EU languages	~3380× RT	Default, best multilingual
`nvidia/parakeet-tdt-0.6b-v2`	600M	English only	~3380× RT	English-only, slightly better English WER
`nvidia/parakeet-tdt-1.1b`	1.1B	English only	Slower	Maximum English accuracy

# Full install (creates venv, installs NeMo + CUDA PyTorch)
./setup.sh

# Verify installation
./setup.sh --check

# Upgrade NeMo toolkit
./setup.sh --update

# For CUDA 12.x
uv pip install --python ./venv/bin/python torch torchaudio --index-url https://download.pytorch.org/whl/cu121

# Basic transcription (auto-punctuated, auto-capitalized)
./scripts/transcribe audio.wav

# Transcribe an MP3 (auto-converts to WAV via ffmpeg)
./scripts/transcribe recording.mp3

# SRT subtitles
./scripts/transcribe audio.wav --format srt -o subtitles.srt

# WebVTT subtitles
./scripts/transcribe audio.wav --format vtt -o subtitles.vtt

# Full JSON with word/segment/char timestamps
./scripts/transcribe audio.wav --timestamps --format json -o result.json

# Transcribe from YouTube URL
./scripts/transcribe https://youtube.com/watch?v=dQw4w9WgXcQ

# Long lecture (>24 min, up to 3 hours)
./scripts/transcribe lecture.wav --long-form

# Streaming mode (print segments as they're transcribed)
./scripts/transcribe audio.wav --streaming

# Batch process a directory
./scripts/transcribe ./recordings/ -o ./transcripts/

# Batch with glob, skip already-done files
./scripts/transcribe *.wav --skip-existing -o ./transcripts/

# Use English-only v2 model
./scripts/transcribe audio.wav --model nvidia/parakeet-tdt-0.6b-v2

# JSON output with metadata
./scripts/transcribe audio.wav --format json -o result.json

# Specify expected language (validation only; v3 auto-detects)
./scripts/transcribe audio.wav --language fr

Input:
  AUDIO                 Audio file(s), directory, glob pattern, or URL
                        Native: .wav, .flac (16kHz mono preferred)
                        Converts via ffmpeg: .mp3, .m4a, .mp4, .mkv, .ogg, .webm, .aac, .wma, .avi, .opus
                        URLs auto-download via yt-dlp (YouTube, direct links, etc.)

Model & Language:
  -m, --model NAME      NeMo ASR model (default: nvidia/parakeet-tdt-0.6b-v3)
                        Also: nvidia/parakeet-tdt-0.6b-v2 (English), nvidia/parakeet-tdt-1.1b (larger)
  -l, --language CODE   Expected language code, e.g. en, es, fr (v3 auto-detects if omitted)
                        Used for validation; does not force the model

Output Format:
  -f, --format FMT      text | json | srt | vtt (default: text)
  --timestamps          Enable word/segment/char timestamps (auto-enabled for srt, vtt, json)
  --max-words-per-line N  For SRT/VTT, split segments into sub-cues of at most N words
  --max-chars-per-line N  For SRT/VTT, split lines so each fits within N characters
                        Takes priority over --max-words-per-line when both are set
  -o, --output PATH     Output file or directory (directory for batch mode)

Long-Form & Streaming:
  --long-form           Enable local attention for audio >24 min (up to ~3 hours)
                        Changes attention model to rel_pos_local_attn with [256,256] context
  --streaming           Print segments as they are transcribed (chunked inference)

Inference:
  --batch-size N        Inference batch size (default: 32; reduce if OOM)

Batch Processing:
  --skip-existing       Skip files whose output already exists (batch mode)

Device:
  --device DEV          auto | cpu | cuda (default: auto)
  -q, --quiet           Suppress progress and status messages

Utility:
  --version             Print version info and exit

Hello, welcome to the meeting. Today we'll discuss the quarterly results
and our plans for next quarter.

{
  "file": "audio.wav",
  "text": "Hello, welcome to the meeting.",
  "duration": 600.5,
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, welcome to the meeting.",
      "words": [
        {"word": "Hello,", "start": 0.0, "end": 0.4},
        {"word": "welcome", "start": 0.5, "end": 0.9},
        {"word": "to", "start": 0.95, "end": 1.1},
        {"word": "the", "start": 1.15, "end": 1.3},
        {"word": "meeting.", "start": 1.35, "end": 2.0}
      ]
    }
  ],
  "words": [...],
  "chars": [...],
  "stats": {
    "processing_time": 0.18,
    "realtime_factor": 3380.0
  }
}

Benchmark	WER
Open ASR Leaderboard (avg)	6.34%
LibriSpeech test-clean	1.93%
LibriSpeech test-other	3.59%
AMI	11.31%
GigaSpeech	9.59%
SPGI Speech	3.97%
TED-LIUM v3	2.75%
VoxPopuli	6.14%

Platform	Acceleration	Notes
Linux + NVIDIA GPU	CUDA	~3380× realtime 🚀
WSL2 + NVIDIA GPU	CUDA	~3380× realtime 🚀
Linux (no GPU)	CPU	Functional but slow
macOS	❌	NeMo has limited macOS support

Parakeet

🦜 Parakeet — NVIDIA Speech-to-Text

When to Use

Parakeet

🦜 Parakeet — NVIDIA Speech-to-Text

When to Use

🦜 Parakeet vs 🗣️ faster-whisper — When to Use Which

Quick Reference

Model Selection

Available Models

Supported Languages (v3)

Accuracy Benchmarks (v3)

Setup

Linux / WSL2

Platform Support

GPU Support

Usage

Options

Output Formats

Text (default)

JSON (`--format json`)

SRT (`--format srt`)

Songsee

Video Frames

Gifgrep

Qqbot Media

Camsnap

Openai Whisper Api

Parakeet

🦜 Parakeet — NVIDIA Speech-to-Text

When to Use

Parakeet

🦜 Parakeet — NVIDIA Speech-to-Text

When to Use

🦜 Parakeet vs 🗣️ faster-whisper — When to Use Which

Quick Reference

Model Selection

Available Models

Supported Languages (v3)

Accuracy Benchmarks (v3)

Setup

Linux / WSL2

Platform Support

GPU Support

Usage

Options

Output Formats

Text (default)

JSON (--format json)

SRT (--format srt)

Songsee

Video Frames

Gifgrep

Qqbot Media

Camsnap

Openai Whisper Api

JSON (`--format json`)

SRT (`--format srt`)