Extract, transcribe, and summarize audio or video files using OpenAI Whisper. Use this skill whenever the user wants to transcribe audio or video, extract what was said in a recording, get a transcript of a meeting/interview/lecture/podcast, or generate a summary of spoken content. Also trigger when the user mentions files like .mp3, .mp4, .wav, .m4a, .ogg, .flac, .webm, .mkv, .mov and wants text out of them. Generates a .md file with an AI summary followed by the full literal transcript. También se activa en castellano: "transcribir", "transcripción", "extraer audio", "qué dice este audio", "transcribir reunión", "transcribir entrevista", "pasar audio a texto", "resumir grabación", "transcribir este video", "extraer texto de audio", "transcript", "whisper", "grabar y resumir".
Transcribe audio or video with Whisper, then produce a .md file containing an AI summary
followed by the complete literal transcript.
Before doing anything else, verify that Whisper is installed:
whisper --help > /dev/null 2>&1 && echo "OK" || echo "NOT FOUND"
If not found, tell the user to run:
pipx install openai-whisper
brew install ffmpeg # if ffmpeg is missing
Then stop and wait — do not proceed until Whisper is available.
Ask the following in a single message if the user has not already provided them. Never ask more than once, and never ask for things already mentioned in the conversation.
Required:
Spanish, English, Portuguese. If unsure, say "auto-detect" and Whisper will figure it out (slower).Optional (ask only if not obvious):
medium. Options: tiny (fastest, less accurate), base, small, medium (recommended balance), large-v3 (most accurate, ~3 GB download). Ask if the user cares about speed vs. accuracy..md file. Default: same directory as the audio file.Wait for the user's answers before proceeding to Step 2.
Run Whisper on the provided file. Use the --output_format json flag to capture word-level
timing and text cleanly, and --output_dir to control where the raw output goes.
whisper "<file_path>" --model <model> --language <language_code_or_auto> --output_format json --output_dir /tmp/whisper-extract-temp
Important: Always emit this as a single line — never split with
\continuations. A trailing space after\is not a line continuation in zsh; it becomes an escaped space that Whisper receives as a second (empty) file path, causing ffmpeg to fail withError opening input file .
Language codes: es for Spanish, en for English, pt for Portuguese, fr for French, etc.
For auto-detect, omit --language entirely.
If the file is large (> 1 hour): Whisper will take several minutes. Tell the user:
"Starting transcription — this may take a few minutes depending on file length and model."
After the command completes, read the JSON output from /tmp/whisper-extract-temp/ and extract
the text field. This is the full raw transcript.
If Whisper fails (file not found, unsupported format, ffmpeg missing), report the exact error and suggest a fix before continuing.
Given the full transcript text and the recording context provided by the user, produce a structured summary. Write it in the output language chosen in Step 1.
The summary must cover:
Keep the summary concise: aim for 150-300 words. Do not pad it.
Construct and save the output Markdown file.
Use this pattern: YYYY-MM-DD-<slugified-context>.md
Examples:
2026-04-14-team-meeting-q3-roadmap.md2026-04-14-candidate-interview-backend.md2026-04-14-lecture-clean-architecture.mdIf today's date is available in context, use it. Otherwise, use the file's modification date
via stat or just omit the date prefix and use the slugified context alone.
Use this exact template:
---