Summarize video/audio content from multiple platforms into structured notes with keyframe screenshots. Supports Bilibili (B站), YouTube, Douyin (抖音), Xiaohongshu (小红书), TikTok, and 1800+ sites via yt-dlp. Triggers on video URLs from any of these platforms, or Chinese requests like '总结视频', '视频笔记', '视频内容'. Detects URLs containing bilibili.com, youtube.com, youtu.be, douyin.com, xiaohongshu.com, tiktok.com, and more.
Extract subtitles/transcripts from video platforms and generate structured notes. No login or cookies required.
| Platform | URL patterns | Extraction method |
|---|---|---|
| Bilibili (B站) | bilibili.com/video/, b23.tv/, BV* | Public API (WBI signing) |
| YouTube | youtube.com/watch, youtu.be/, youtube.com/shorts/ | youtube-transcript-api |
| Douyin (抖音) | douyin.com/, v.douyin.com/ | yt-dlp |
| Xiaohongshu (小红书) | xiaohongshu.com/, xhslink.com/ | yt-dlp |
| TikTok |
tiktok.com/yt-dlp |
| Any other | Any URL supported by yt-dlp (1800+ sites) | yt-dlp |
Ensure Python 3 is available:
python --version
Before running the extraction script, check and install required dependencies based on the video platform:
For Bilibili videos -- no extra dependencies needed (uses Python stdlib only).
For YouTube videos:
pip install youtube-transcript-api
For Douyin, Xiaohongshu, TikTok, or any other platform:
pip install yt-dlp
Optional -- Whisper transcription (for videos without subtitles):
pip install openaipip install faster-whisperconfig.json (see Whisper Setup below)Only install what is needed. If the user provides a Bilibili URL, skip dependency installation entirely.
Check the video URL to determine the platform. If it's NOT a Bilibili URL, ensure the required package is installed:
pip install youtube-transcript-api yt-dlp
If these are already installed, skip this step.
Run the extraction script with the video URL:
python "<skill_path>/video_subtitle.py" "<VIDEO_URL>"
Replace <skill_path> with the absolute path to this skill's directory, and <VIDEO_URL> with the video URL provided by the user.
The script automatically:
The script outputs a JSON object to stdout containing:
title - Video titleauthor - Uploader nameduration - Video durationdescription - Video descriptionplatform - Detected platform: bilibili, youtube, douyin, xiaohongshu, tiktok, or genericurl - Canonical video URLsource - Extraction method: subtitle, ai_conclusion, transcript_api, yt_dlp_subs, whisper_local, whisper_apisubtitle_text - Full subtitle/transcript text (if available)frames - Array of keyframe screenshots (if ffmpeg is installed and extract_frames is enabled in config.json). Each frame has path (absolute file path) and timestamp (e.g. "01:30")error - Error message (if extraction failed)If the output contains an error field:
config.jsonWhen subtitle text is successfully extracted, summarize it into this BibiGPT-style format. Use the EXACT structure below, including emojis, section naming, and formatting.
If frames array is present in the JSON output, embed one screenshot per content section using the Read tool to view the frame image file, then insert it with markdown image syntax right after the section header. Match frames to sections by timestamp order -- assign one frame per section sequentially. If there are more sections than frames, some sections will have no image. If there are more frames than sections, distribute evenly.
# AI 一键总结:[{title}]({url})
# 🤖 {title} — 通俗解释
### 🏷️ {Section 1 Title}

- {Key point from transcript, preserving original examples and analogies}
- {Another key point}
* {Sub-point or example}
* {Sub-point or example}
### 💡 {Section 2 Title}

- {Content organized by topic}
- {Preserve vivid analogies from the video}
### 🧠 {Section 3 Title}

- {Continue grouping content logically}
### 🍎 {Section 4 Title}

- {More content sections as needed}
(... more sections using rotating emojis: 🏷️ 💡 🧠 🍎 🔑 🔢 🧱 🎯 ...)
### Summary
- {One paragraph summarizing the entire video content concisely}
### Highlights
* 🧠 {Highlight 1 with emoji} [#tag1] [#tag2] [#tag3]
* 🔪 {Highlight 2 with emoji} [#tag1] [#tag2] [#tag3]
* 🧮 {Highlight 3 with emoji} [#tag1] [#tag2] [#tag3]
* 🔢 {Highlight 4 with emoji} [#tag1] [#tag2] [#tag3]
* 🧱 {Highlight 5 with emoji} [#tag1] [#tag2] [#tag3]
[#tag1] [#tag2] [#tag3] [#tag4] [#tag5]
### Questions
* {Thought-provoking question 1 related to the video content}
* {Thought-provoking question 2 that extends the topic}
Guidelines for summarization:
- for main bullet points and * (with 4 spaces indent) for sub-points/examples syntax. Each section should have at most one image, placed directly after the ### header line. Use the absolute path value from the frames array as-isFor videos without subtitles, Whisper can transcribe the audio. Edit config.json in this skill's directory:
Option A -- OpenAI Whisper API (fast, requires API key, costs ~$0.006/min):
{
"whisper_mode": "api",
"openai_api_key": "sk-your-key-here",
"language": "zh"
}
Option B -- Local faster-whisper (free, requires model download ~1-3GB):
{
"whisper_mode": "local",
"whisper_model": "base",
"language": "zh"
}
Model sizes: tiny (fast, less accurate) / base (balanced) / small / medium / large (slow, most accurate).
These platforms require browser cookies for yt-dlp to access video content. The script tries --cookies-from-browser automatically, but on Windows with Chrome 127+ this often fails due to DPAPI encryption.
Recommended: export cookies.txt manually
douyin.com (or xiaohongshu.com) in your browser (login is NOT required, just visit the page)cookies.txt in this skill's directory (<skill_path>/cookies.txt)The script will automatically detect and use this file for Douyin/Xiaohongshu/TikTok requests.
The script can extract keyframe screenshots from videos to embed in the summary. This requires ffmpeg to be installed.
To enable/disable, edit config.json:
{
"extract_frames": true,
"frames_per_video": 6
}
extract_frames: true (default) to capture keyframes, false to skipframes_per_video: number of evenly-spaced frames to extract (default 6)Screenshots are cached in the screenshots/ directory. If ffmpeg is not installed, frame extraction is silently skipped.