Turn YouTube videos into dependable markdown transcripts and polished summaries — even when caption coverage is messy. This skill works with manual closed captions (CC), auto-generated subtitles, or no usable subtitles at all by using subtitle-first extraction with local Whisper fallback. Supports private/restricted videos via cookies, batch processing, transcript cleanup, language backfill, source-language or user-selected summary language, and end-to-end completion reporting. Ideal for YouTube research, technical walkthroughs, founder content, tutorials, private/internal uploads, and batch video summarization workflows.
The YouTube summarizer that still works when captions are broken, missing, or inconsistent.
Outputs: raw markdown transcript + polished markdown summary + session-ready result block.
Unlike caption-only tools, this skill still works when subtitles are missing by falling back to local Whisper transcription.
Generate a raw transcript markdown file and a polished summary markdown file from one or more YouTube videos.
This skill is self-contained. It does not require any other YouTube summarizer skill or prior workflow context.
For a fresh macOS setup, new users should be able to copy-paste the following exactly:
brew install yt-dlp ffmpeg whisper-cpp
MODELS_DIR="$HOME/.openclaw/workspace"
MODEL_PATH="$MODELS_DIR/ggml-medium.bin"
mkdir -p "$MODELS_DIR"
if [ ! -f "$MODEL_PATH" ]; then
curl -L https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium.bin \
-o "$MODEL_PATH.part" && mv "$MODEL_PATH.part" "$MODEL_PATH"
else
echo "Model already exists at $MODEL_PATH — leaving it unchanged."
fi
command -v python3 yt-dlp ffmpeg whisper-cli
ls -lh "$MODEL_PATH"
What this does:
yt-dlp, ffmpeg, and whisper-cli~/.openclaw/workspace~/.openclaw/openclaw.json or any other OpenClaw config fileIf you want to store models elsewhere, pass --models-dir /path/to/models when running the workflow.
python3 scripts/run_youtube_workflow.py "https://www.youtube.com/watch?v=VIDEO_ID"
This creates a dedicated per-video folder, writes the raw transcript markdown, creates the summary placeholder markdown, and prints JSON describing the outputs plus the exact follow-up commands/prompts needed to finish the summary step.
Important: the workflow script alone is not the finished deliverable. The current OpenClaw session must still:
unknownSummary.md with a real polished summaryscripts/complete_youtube_summary.py to validate/finalize the resultpython3 scripts/run_youtube_workflow.py "https://www.youtube.com/watch?v=VIDEO_ID" \
--summary-language zh-CN
python3 scripts/run_youtube_workflow.py "https://www.youtube.com/watch?v=VIDEO_ID" \
--cookies /path/to/cookies.txt
or
python3 scripts/run_youtube_workflow.py "https://www.youtube.com/watch?v=VIDEO_ID" \
--cookies-from-browser chrome
See references/batch-input-format.md.
python3 scripts/run_youtube_workflow.py --batch-file ./youtube-urls.txt
This skill is designed to keep working across the messy reality of YouTube:
That makes it materially more reliable than caption-only workflows. It works well for caption-rich videos, caption-poor videos, and private/internal uploads where subtitle coverage is inconsistent.
Core capabilities:
For each video, create exactly one dedicated output folder containing these final deliverables:
SANITIZED_VIDEO_NAME_transcript_raw.mdSANITIZED_VIDEO_NAME_Summary.mdBy default, delete only the known intermediate media, subtitle, and WAV files created by the workflow. Do not wipe unrelated files that may already exist in the per-video folder.
Verify these tools exist before running the workflow:
yt-dlpffmpegwhisper-clipython3The workflow also requires a supported Whisper ggml model file in the configured models directory.
Use these scripts directly:
scripts/run_youtube_workflow.py — main deterministic workflow for metadata, download/subtitles, transcription, placeholder summary creation, cleanup, and workflow metadata emissionscripts/backfill_detected_language.py — update transcript_raw.md, Summary.md, and workflow metadata after the current session LLM decides the major transcript languagescripts/complete_youtube_summary.py — validate that Summary.md is no longer a placeholder, optionally backfill language, compute the final end-to-end timing report for one item, and emit a session-ready result blockscripts/normalize_transcript_text.py — convert raw timestamped transcript text into cleaner summary input without modifying the raw transcript filescripts/finalize_youtube_summary.py — lower-level timing helper used by the completion flowscripts/prepare_video_paths.py — derive sanitized folder and output file paths from a title and video IDUseful references:
references/detailed-workflow.md — full operational workflow, completion rules, batch guidance, naming rules, and practical notesreferences/summary-template.md — required structure and writing rules for the final Summary.mdreferences/session-output-template.md — required user-facing output format to return to the current OpenClaw session after completionreferences/batch-input-format.md — input format for queue / batch processing~/Downloadsggml-mediumggml-base, ggml-small, ggml-mediumsourceAt a high level, the skill does this:
SANITIZED_VIDEO_NAME_transcript_raw.mdSANITIZED_VIDEO_NAME_Summary.md as a placeholderscripts/complete_youtube_summary.py to validate completion, backfill language if needed, and emit a session-ready result blockFor a normal end-to-end request, completion means all of the following are true:
unknown, the language was backfilled into both markdown filesscripts/complete_youtube_summary.py was run successfullyIf the workflow script succeeded but the summary/completion step did not happen yet, describe the state as partial/in-progress rather than complete.
Read these as needed:
references/detailed-workflow.md when you need the full implementation contract, batch guidance, naming rules, cleanup rules, timing flow, or debugging detailsreferences/summary-template.md before writing the final polished Summary.mdreferences/session-output-template.md before returning the final user-facing per-video result blockreferences/batch-input-format.md when handling --batch-fileThis skill is optimized for dependable end-to-end output, not just quick transcript extraction: