Create comprehensive, report-style learning notes from any video — called 'learnFromVideo'. Use this skill whenever the user shares a video URL or file and wants notes, a summary, a study guide, or learning material from it. Also triggers when the user says 'learn from this video', 'learnFromVideo', 'create notes from this video', 'make notes for this video', 'I watched this video and need notes', 'summarize this video in detail', 'take notes from this lecture', 'notes from this tutorial', 'I don't have time to watch this video', or shares one or more video URLs/files and asks for any kind of written output about the content. This skill handles single videos and multiple videos on the same topic, producing a professional Word document (.docx) report with full detail — not a brief summary, but a thorough report capturing everything spoken AND shown in the video, including diagrams, workflows, code, and architecture recreated as Mermaid flowcharts.
Create professional, report-style learning notes from any video. The user provides one or more video URLs or local video files, and you produce a comprehensive Word document that captures EVERYTHING — what was spoken AND what was shown on screen — combined together so the reader understands the complete picture as if they watched the video themselves.
When people find great learning videos online, they often don't have time to watch them fully. A transcript alone doesn't capture the full picture — it misses diagrams, architecture flows, code shown on screen, and visual explanations. This skill bridges that gap by combining the transcript WITH actual screenshots from the video to produce notes so thorough that reading them is as good as watching the video.
IMPORTANT: This skill does NOT follow a rigid chapter-based template. Every video is different. The report should read like a natural, flowing document where:
Think of it this way: if the speaker says "here's how the architecture works" and shows a diagram, the report should explain what they said AND show the reconstructed diagram RIGHT THERE — not 20 pages later in a "diagrams chapter."
This skill supports two modes. Default is Detailed Mode unless the user explicitly requests fast/quick.
Triggers when user says "quick notes", "fast summary", "brief notes", or "just the highlights".
Triggers for all other requests, or when user says "detailed", "comprehensive", or "full notes".
This skill uses a 5-agent pipeline organized in 3 phases. Agents 2, 3, and 4 run in PARALLEL after Agent 1 completes.
┌─────────────────┐
│ Agent 1: │
│ Transcript │
│ Analyst │
└────────┬────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Agent 2: │ │ Agent 3: │ │ Agent 4: │
│ Screenshot │ │ Code │ │ Visual │
│ Extractor │ │ Specialist │ │ Content │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└────────────────┼────────────────┘
▼
┌──────────────────┐
│ Agent 5: │
│ Document │
│ Assembler │
└──────────────────┘
This agent runs FIRST because everything else depends on its timestamp analysis.
Steps:
Set PATH for yt-dlp:
export PATH="$HOME/.local/bin:$PATH"
Run the bundled transcript fetcher:
pip install youtube-transcript-api --break-system-packages 2>/dev/null
python <skill-path>/scripts/fetch_transcript.py "VIDEO_URL"
If the skill files are not in the expected path, check these locations:
/tmp/learn-from-video/scripts/fetch_transcript.pyIf the script fails (video has no captions, is private, etc.), tell the user and ask them to paste the transcript manually. Then proceed with the pasted text.
Analyze the ENTIRE transcript and identify every important timestamp. Tag each with a content type:
[CODE timestamp_start-timestamp_end] — code shown in editor/terminal[DIAGRAM timestamp] — architecture, flowchart, or diagram shown[SLIDE timestamp] — slide with text/bullet points[UI timestamp] — app interface, dashboard, website shown[TERMINAL timestamp] — command line output, running commands[DATA timestamp] — tables, charts, statistics, benchmarks[KEY_CONCEPT timestamp] — important concept being explained with visualSignal phrases to detect:
Produce a structured JSON output:
{
"video_id": "abc123",
"title": "Video Title",
"duration": "12:34",
"thematic_outline": [
{ "theme": "Introduction", "start": "0:00", "end": "1:30" },
{ "theme": "Core Concept", "start": "1:30", "end": "5:00" }
],
"key_timestamps": [
{ "time": "3:45", "seconds": 225, "type": "CODE", "description": "Shows Express route handler", "duration_hint": "3:45-4:12" },
{ "time": "7:20", "seconds": 440, "type": "DIAGRAM", "description": "System architecture overview" }
],
"transcript_text": "full transcript with timestamps..."
}
For multiple videos, run for each URL and process all transcripts.
Downloads the video and extracts frames based on Agent 1's timestamp analysis.
Steps:
Set PATH and install yt-dlp:
export PATH="$HOME/.local/bin:$PATH"
pip install yt-dlp --break-system-packages 2>/dev/null
Download the video at 720p (critical for code readability):
yt-dlp -f "best[height<=720][ext=mp4]" -o "/tmp/lfv-screenshots/video.mp4" "VIDEO_URL"
Fallback chain if download fails:
yt-dlp -f "best[height<=720][ext=mp4]" (default — good for code readability)yt-dlp -f "best[height<=480][ext=mp4]" (fallback if 720p too large or unavailable)python -m yt_dlp -f "best[height<=720][ext=mp4]" (fallback if yt-dlp binary not in PATH)For local video files, skip download and use the file directly.
Extract frames using ADAPTIVE intervals based on Agent 1's content type tags:
For CODE timestamps: every 3-5 seconds within the code window
# Example: code shown 3:45-4:12, extract at 3s intervals
for i in $(seq 225 3 252); do
ffmpeg -ss $i -i /tmp/lfv-screenshots/video.mp4 -frames:v 1 -q:v 2 \
/tmp/lfv-screenshots/code_$(printf "%04d" $i).jpg 2>/dev/null
done
For DIAGRAM/SLIDE timestamps: one frame at the start + one at the end
ffmpeg -ss $START -i /tmp/lfv-screenshots/video.mp4 -frames:v 1 -q:v 2 \
/tmp/lfv-screenshots/diagram_$(printf "%04d" $START).jpg 2>/dev/null
For UI/DEMO timestamps: every 10-15 seconds
For normal talking sections (no visual content tagged): every 30-45 seconds as safety net
Regular interval safety net: Also extract at 30-second intervals throughout, to catch anything Agent 1 missed.
Deduplicate: Compare consecutive frames and skip near-identical ones (same slide shown for 2+ minutes).
Delete the video file after extraction to save disk space.
Produce a manifest JSON:
{
"video_id": "abc123",
"download_quality": "720p",
"total_frames": 45,
"frames": [
{ "filename": "frame_0225.jpg", "timestamp": "3:45", "seconds": 225, "type": "CODE", "source": "targeted" },
{ "filename": "frame_0250.jpg", "timestamp": "4:10", "seconds": 250, "type": "CODE", "source": "interval" }
],
"deduplicated": 8,
"video_deleted": true
}
This agent specializes in extracting, completing, and explaining code from screenshots. This is one of the most valuable features of the skill — readers get working code they can actually use.
Receives: All frames tagged as [CODE] or [TERMINAL] from Agent 2's manifest, plus the transcript context from Agent 1.
Multi-Pass Extraction Process:
First Pass — Extract: Read each code screenshot and transcribe EVERY visible line of code exactly as shown.
Gap Detection: If code appears cut off or partially scrolled:
Second Pass — Combine: Merge code fragments from multiple screenshots of the same code block into one complete block.
Third Pass — Complete: Using the transcript context (what the speaker was explaining) AND the agent's own knowledge, complete any partial code to a working example:
Mark sources clearly:
// === FROM VIDEO [3:45] ===
const app = express();
app.get('/api', handler);
// === FROM VIDEO [3:52] — scrolled down ===
function handler(req, res) {
res.json({ status: 'ok' });
}
// === ADDED FOR COMPLETENESS ===
import express from 'express';
app.listen(3000);
Add explanations: For each significant code block:
Produces:
{
"code_blocks": [
{
"id": "code_01",
"language": "javascript",
"filename": "server.js",
"timestamp_range": "3:45-4:12",
"source_frames": ["frame_0225.jpg", "frame_0228.jpg", "frame_0231.jpg"],
"raw_captured": "// lines exactly as seen in video",
"completed_code": "// full working code with FROM VIDEO and ADDED FOR COMPLETENESS markers",
"explanation": "This code sets up an Express.js server with...",
"patterns": ["middleware pattern", "error handling"],
"completeness": "partial_completed"
}
]
}
Processes all non-code visual content: diagrams, slides, UI screenshots, data tables.
Receives: All frames tagged as [DIAGRAM], [SLIDE], [UI], [DATA] from Agent 2's manifest.
For each screenshot:
Smart Screenshot Selection — decide embed vs. skip:
EMBED a screenshot if it shows:
SKIP embedding if:
Produces:
{
"visuals": [
{
"id": "vis_01",
"type": "DIAGRAM",
"timestamp": "7:20",
"source_frame": "frame_0440.jpg",
"text_extracted": "All text visible in the screenshot",
"mermaid_code": "graph LR\n A[Client] --> B[API]",
"description": "System architecture showing three-tier design",
"embed_recommended": true
},
{
"id": "vis_02",
"type": "SLIDE",
"timestamp": "2:00",
"source_frame": "frame_0120.jpg",
"text_extracted": "Title: Key Concepts\n- Point 1\n- Point 2",
"embed_recommended": true
}
]
}
Receives outputs from ALL 4 agents and builds the final Word document.
Steps:
Estimate document size before generating:
~2KB per text paragraph
~50KB per embedded screenshot (compressed JPG)
~1KB per code block
~0.5KB per table
Expected size = (paragraphs × 2) + (screenshots × 50) + (code_blocks × 1) + (tables × 0.5) KB
Log this estimate so the user knows what to expect.
Read the docx skill for document creation rules:
Read the docx SKILL.md for formatting instructions
Read references/report_structure.md for formatting guidelines and docx-js code patterns.
Install docx-js (local, NOT global):
npm install docx
Build the complete Word document by:
embed_recommended: true)Fixed elements (always include):
TOC Implementation — Use MANUAL TOC:
Do NOT rely on the docx-js TableOfContents widget — it creates an empty placeholder that only populates when opened in Microsoft Word and manually updated. Instead, create a MANUAL table of contents by listing the section headings as Paragraph elements with page references. This works reliably across all Word processors (Word, Google Docs, LibreOffice).
// Manual TOC entry example
new Paragraph({
children: [
new TextRun({ text: "1. Executive Summary", size: 24 }),
new TextRun({ text: " ............................. ", size: 24, color: "999999" }),
new TextRun({ text: "3", size: 24 }),
]
})
The Adaptive Core Content:
Structure it however makes the most sense for THIS video:
Within each content section, combine everything together:
Embedding Actual Screenshots:
Use ImageRun from docx-js to embed screenshots directly in the document:
const { ImageRun } = require('docx');
const fs = require('fs');
// Read the image file
const imageBuffer = fs.readFileSync('/tmp/lfv-screenshots/frame_0125.jpg');
// Create an image paragraph
new Paragraph({
children: [new ImageRun({
data: imageBuffer,
transformation: { width: 560, height: 315 }, // 16:9 aspect ratio
type: 'jpg', // REQUIRED: must specify image type
})],
alignment: AlignmentType.CENTER,
});
IMPORTANT ImageRun notes:
type parameter is REQUIRED (use 'jpg' for JPEG, 'png' for PNG)fs.readFileSync()Save the generated .docx file to the user's workspace folder. Use a descriptive filename:
[VideoTitle]_LearnFromVideo_Notes.docx[Topic]_Combined_LearnFromVideo_Notes.docxSanitize the filename (remove special characters, limit to 80 chars).
Present the file link to the user with a brief note about what was captured (e.g., "Created 35-page report covering both videos with 15 embedded screenshots, 6 inline diagrams, 8 code blocks with analysis").
After generating the document, verify quality by reading back 3-5 random sections and checking:
When the user provides multiple video URLs:
Theme Identification and Merge Strategy:
Document structure for multi-video:
export PATH="$HOME/.local/bin:$PATH" before any yt-dlp commandnpm install docx in working directory) instead of global install. This is the recommended approach./tmp/learn-from-video/) before modifying.These notes come from real-world experience running this skill:
~/.local/bin/ — ALWAYS run export PATH="$HOME/.local/bin:$PATH" before any yt-dlp command/tmp/ for modification--break-system-packages flag with pip installs to avoid venv errorsffmpeg -ss TIMESTAMP -i video.mp4 -frames:v 1 -q:v 2 output.jpg for individual frames.FROM VIDEO and ADDED FOR COMPLETENESS markers.npm install docx (local, not global) to avoid permission errors.type parameter ('jpg' or 'png') — this is mandatory.TableOfContents widget for cross-platform compatibility.ShadingType.CLEAR not ShadingType.SOLID for cell shading (SOLID creates black backgrounds).columnWidths on Table AND width on each TableCell.LevelFormat.BULLET for bullet lists (never unicode bullet characters).WidthType.DXA not WidthType.PERCENTAGE for table widths.python <docx-skill-path>/scripts/office/validate.py output.docx.For 20+ screenshots, launch 3 agents in parallel:
{ timestamp, content_type, text, code, diagrams }The document should be thorough enough that someone who reads it gets 90%+ of the value of watching the video:
This skill includes an automated evaluation framework. See eval/eval.json for binary assertions and references/self_improve_prompt.md for the autonomous improvement loop.
Layer 1 — Skill Activation (Description): Tests whether Claude triggers the skill for the correct prompts and doesn't trigger for wrong ones.
Should trigger: "create notes from this video", "learn from this video", "summarize this lecture" Should NOT trigger: "summarize this PDF", "write a report about AI", "take notes from this meeting"
Layer 2 — Output Quality (eval.json):
30 binary assertions across 3 test types (short tutorial, code-heavy video, multi-video).
Run the self-improvement loop from references/self_improve_prompt.md to autonomously improve this skill.
Every document includes: