Summarize content from any source — URLs, local files, YouTube videos, and raw text. Use when the user asks to summarize a webpage, PDF, document, article, video, or any content. Supports multiple output formats (bullet points, executive summary, detailed notes) and configurable length. Can also extract raw content without summarization.
Summarize content from any source: URLs, local files, YouTube videos, clipboard text, and more. Flexible output formats with configurable depth and style.
No mandatory external dependencies for basic text summarization — the AI model handles it directly.
The agent should use available web browsing/fetching tools to retrieve URL content. If running in an environment with shell access:
# For advanced HTML parsing (optional)
pip install beautifulsoup4 requests
# For PDF text extraction (optional)
pip install PyPDF2
# or
pip install pdfplumber
If the content source is a YouTube URL, this skill delegates to the youtube-summarizer or bilibili-watcher skills if available. Otherwise, it uses:
pip install youtube-transcript-api
| Input Type | How to Provide | Notes |
|---|---|---|
| URL (webpage) | Paste the URL | HTML content extracted automatically |
| URL (YouTube) | Paste YouTube link | Transcript extracted via API |
| Local file (text) | File path | .txt, .md, .rst, .csv |
| Local file (PDF) | File path | Requires PyPDF2 or pdfplumber |
| Local file (HTML) | File path | Parsed with BeautifulSoup |
| Local file (DOCX) | File path | Requires python-docx |
| Raw text | Paste directly | Any length |
| Clipboard | "Summarize my clipboard" | If clipboard access available |
Determine what the user wants summarized and how to access it:
Input Analysis:
1. Is it a URL? → Fetch the content
2. Is it a file path? → Read the file
3. Is it raw text? → Use directly
4. Is it a YouTube link? → Extract transcript
5. Is it multiple sources? → Process each, then combine
URL Detection Patterns:
import re
def classify_input(text: str) -> str:
"""Classify the input type."""
text = text.strip()
# YouTube URLs
youtube_pattern = r'(youtube\.com|youtu\.be|youtube\.com/shorts)'
if re.search(youtube_pattern, text):
return 'youtube'
# Bilibili URLs
if 'bilibili.com' in text or 'b23.tv' in text:
return 'bilibili'
# General URLs
if re.match(r'https?://', text):
return 'url'
# File paths
if any(text.endswith(ext) for ext in ['.pdf', '.txt', '.md', '.html', '.docx', '.rst', '.csv']):
return 'file'
# Raw text
return 'text'
Use the available web fetching tools to retrieve and parse HTML content. Extract the main article text, removing navigation, ads, footers, and other boilerplate.
Key extraction goals:
from bs4 import BeautifulSoup
import requests
def extract_url_content(url: str) -> dict:
"""Extract main content from a URL."""
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (compatible; ContentSummarizer/1.0)'
}, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Remove script, style, nav, footer elements
for tag in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
tag.decompose()
# Try to find the main article content
article = soup.find('article') or soup.find('main') or soup.find('body')
title = soup.find('title')
title_text = title.get_text().strip() if title else 'Untitled'
return {
'title': title_text,
'text': article.get_text(separator='\n', strip=True) if article else '',
'url': url
}
from pathlib import Path
def extract_file_content(filepath: str) -> dict:
"""Extract text from various file formats."""
path = Path(filepath)
suffix = path.suffix.lower()
if suffix in ('.txt', '.md', '.rst', '.csv'):
text = path.read_text(encoding='utf-8')
return {'title': path.name, 'text': text, 'format': suffix}
elif suffix == '.pdf':
return extract_pdf(filepath)
elif suffix == '.html':
text = path.read_text(encoding='utf-8')
soup = BeautifulSoup(text, 'html.parser')
for tag in soup(['script', 'style']):
tag.decompose()
return {
'title': path.name,
'text': soup.get_text(separator='\n', strip=True),
'format': 'html'
}
elif suffix == '.docx':
return extract_docx(filepath)
else:
# Try reading as plain text
try:
text = path.read_text(encoding='utf-8')
return {'title': path.name, 'text': text, 'format': 'unknown'}
except UnicodeDecodeError:
raise ValueError(f"Cannot read binary file: {filepath}")
def extract_pdf(filepath: str) -> dict:
"""Extract text from PDF using available libraries."""
try:
import pdfplumber
with pdfplumber.open(filepath) as pdf:
pages = [page.extract_text() or '' for page in pdf.pages]
return {
'title': Path(filepath).name,
'text': '\n\n'.join(pages),
'format': 'pdf',
'pages': len(pdf.pages)
}
except ImportError:
pass
try:
from PyPDF2 import PdfReader
reader = PdfReader(filepath)
pages = [page.extract_text() or '' for page in reader.pages]
return {
'title': Path(filepath).name,
'text': '\n\n'.join(pages),
'format': 'pdf',
'pages': len(reader.pages)
}
except ImportError:
raise RuntimeError("Install pdfplumber or PyPDF2 to read PDFs: pip install pdfplumber")
def extract_docx(filepath: str) -> dict:
"""Extract text from DOCX files."""
try:
from docx import Document
doc = Document(filepath)
paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]
return {
'title': Path(filepath).name,
'text': '\n\n'.join(paragraphs),
'format': 'docx'
}
except ImportError:
raise RuntimeError("Install python-docx to read DOCX files: pip install python-docx")
Delegate to the youtube-summarizer skill or use youtube-transcript-api directly:
from youtube_transcript_api import YouTubeTranscriptApi
def extract_youtube_content(url: str) -> dict:
"""Extract transcript from YouTube video."""
video_id = extract_video_id(url) # See youtube-summarizer skill
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['en', 'zh-Hans', 'ja'])
text = ' '.join(entry['text'] for entry in transcript)
return {
'title': f'YouTube Video {video_id}',
'text': text,
'format': 'youtube',
'segments': transcript
}
Choose the output format based on user request or default to bullet points.
Best for: Quick scanning, team sharing, Slack/email updates.
# Summary: [Title]
**Source**: [URL or filename]
**Length**: ~X words / X pages / X minutes
## Key Points
• [Most important finding/conclusion]
• [Second key point]
• [Third key point]
• [Fourth key point — include specific numbers/data if available]
• [Fifth key point]
## Notable Details
• [Interesting data point or quote]
• [Counter-argument or limitation mentioned]
Prompt template:
Summarize the following content into 5-8 bullet points. Each bullet should:
- Be self-contained (understandable without reading the full text)
- Include specific numbers, names, or dates when relevant
- Be ordered by importance (most important first)
- Be concise (1-2 sentences max)
Content:
{content}
Best for: Leadership updates, decision-making, meeting prep.
# Executive Summary: [Title]
**Source**: [URL/file] | **Date**: [if available] | **Read time**: ~X min
## Bottom Line
[1-2 sentences: the single most important takeaway]
## Context
[2-3 sentences: why this matters, background]
## Key Findings
1. [Finding with supporting data]
2. [Finding with supporting data]
3. [Finding with supporting data]
## Implications
[What this means for the reader/team/organization]
## Recommended Actions
1. [Action item]
2. [Action item]
Prompt template:
Write an executive summary of the following content. Target audience: busy decision-makers
who need to understand the core message in under 2 minutes.
Structure:
1. Bottom Line (1-2 sentences — what's the one thing they need to know?)
2. Context (2-3 sentences — why does this matter?)
3. Key Findings (3-5 numbered points with data)
4. Implications (what this means going forward)
5. Recommended Actions (concrete next steps)
Content:
{content}
Best for: Research, studying, reference material.
# Detailed Notes: [Title]
**Source**: [URL/file]
**Summary date**: [today]
**Original length**: ~X words
## Overview
[3-5 sentence comprehensive overview]
## Section 1: [Topic]
[Detailed notes preserving key information, quotes, data]
- Sub-point with specifics
- Sub-point with specifics
## Section 2: [Topic]
[Detailed notes]
## Section 3: [Topic]
[Detailed notes]
## Key Quotes
> "[Exact quote]" — [Source/Author]
> "[Exact quote]" — [Source/Author]
## Data & Statistics
| Metric | Value | Context |
|---|---|---|
| [metric] | [value] | [context] |
## References & Links
- [Reference mentioned in the content]
Best for: Content extraction for downstream processing.
When the user says "just extract" or "don't summarize", return the raw extracted text in clean markdown format without any summarization or analysis:
# Extracted Content: [Title]
**Source**: [URL/file]
**Extracted**: [timestamp]
**Word count**: X
---
[Full extracted text in clean markdown]
User says: "Summarize https://example.com/article"
User says: "Summarize this PDF: /path/to/document.pdf"
User says: "Give me an executive summary of this article"
User provides multiple URLs/files:
User says: "Give me a 3-sentence summary" or "detailed 2000-word summary"
User says: "Just extract the text from this URL, don't summarize"
User shares a YouTube or Bilibili link:
When processing a summarization request, consider these adjustable parameters:
| Parameter | Options | Default |
|---|---|---|
| Format | bullet, executive, detailed, extract-only | bullet |
| Length | brief, short, medium, detailed | medium |
| Language | Output language code | Same as source |
| Focus | Specific topic/aspect to emphasize | None (general) |
| Audience | technical, general, executive, academic | general |
| Include quotes | yes/no | yes for detailed |
| Include data | yes/no | yes |
| Max points | Number of bullet points | 8 |
Users can specify these naturally:
Problem: Many news sites and platforms require subscriptions or login.
Solutions:
Problem: Some pages load content dynamically via JavaScript, making simple HTTP requests return empty shells.
Solutions:
?format=text or similar URL parametersProblem: Documents over 50,000 words may exceed model context limits.
Solutions:
def chunk_text(text: str, max_chars: int = 30000) -> list[str]:
"""Split text into manageable chunks at paragraph boundaries."""
paragraphs = text.split('\n\n')
chunks = []
current = []
current_len = 0
for para in paragraphs:
if current_len + len(para) > max_chars and current:
chunks.append('\n\n'.join(current))
current = []
current_len = 0
current.append(para)
current_len += len(para)
if current:
chunks.append('\n\n'.join(current))
return chunks
Problem: User provides a file that's primarily images, charts, or scanned documents.
Solutions:
Problem: Files with unusual encodings (GB2312, Shift-JIS, etc.) may not parse correctly.
Solutions:
chardet library for automatic detection if availabledef read_with_fallback(filepath: str) -> str:
"""Read file trying multiple encodings."""
encodings = ['utf-8', 'utf-8-sig', 'gb2312', 'gbk', 'gb18030', 'shift-jis', 'latin-1']
for enc in encodings:
try:
with open(filepath, 'r', encoding=enc) as f:
return f.read()
except (UnicodeDecodeError, UnicodeError):
continue
raise ValueError(f"Cannot decode {filepath} with any known encoding")
Problem: Summaries may miss nuance, oversimplify, or hallucinate details.
Solutions:
Problem: Fetching many URLs quickly may trigger rate limits or blocks.
Solutions:
This skill works with any AI model capable of text summarization. The prompts and workflows are model-agnostic. For best results:
| Model Capability | Recommended Use |
|---|---|
| Large context window (100K+) | Full document summarization in one pass |
| Standard context (8K-32K) | Chunked processing with merge step |
| Fast inference | Batch processing of multiple sources |
| Multi-language | Cross-language summary generation |
The skill automatically adapts to the available model's capabilities: