Extract content from YouTube videos, web articles, PDFs, local files, podcasts, and tweets to structured Markdown
Extracts content from multiple source types and saves as structured Markdown files. Supports:
/workspace/content_extractor.py
./content/output/YYYY-MM-DD_title-slug/transcript.mdarticle.mdpaper.mddocument.mdpresentation.mdtweet.mdpodcast.mdsummary.md| Flag | Purpose |
|---|---|
--cookies-from-browser chrome | Auth for member-only content |
--cookies FILE | Auth via Netscape cookies.txt file |
--lang CODE | Force subtitle language |
--prefer-auto | Prefer auto-generated subs |
--no-chapters | Flat transcript, no sections |
--include-description | Add video/article description |
--dry-run | Preview without downloading |
--overwrite | Replace existing files |
--polish | Claude cleanup |
--summarize | Generate Pyramid/SCQA summary |
--no-whisper | Disable Whisper audio fallback |
--whisper-model MODEL | Whisper model size (default: base) |
--max-episodes N | Max podcast episodes to extract |
--nitter-instance HOST | Nitter instance for tweet extraction |
--no-speaker-notes | Exclude PowerPoint speaker notes |
-o DIR | Custom output directory |
-f FILE | URL list file |
When --polish is used:
{basename}.unpolished.md in output folder{basename}.mdWhen --summarize is used:
summary.md in the same folder