Name: Content Collector
Author: goodjin

SkillsPool

Search skills.../

Skill Content

Content Collector - 网址池资料采集工具

智能采集新闻、文章、指南、参考资料，支持定时任务、去重、分类管理、自动归类。

核心能力

功能	说明
定时采集	每天 9:00 自动采集（需配置 launchd/cron）
主动采集	触发时立即采集并输出结果
网址池管理	按分类管理网址，支持增删查
智能匹配	根据需求自动匹配相关分类的网址
自动归类	采集后自动判断内容类型并归类
智能去重	URL 哈希去重，检查当月+上月
HTML 报告	生成美观的资料简报页面

目录结构

~/.claude/skills/content-collector/
├── config/
│   ├── sources.json      # 网址池配置
│   └── email.json        # 邮件配置（可选）
├── logs/
│   ├── 2026-02.jsonl     # 按月存储采集记录
│   └── 2026-01.jsonl
└── reports/
    └── 2026-02-21.html   # HTML 报告输出

分类	说明	示例
news	新闻资讯	36氪、虎嗅、钛媒体
article	文章博客	Hacker News、掘金、V2EX
guide	指南教程	MDN、阮一峰博客
reference	参考资料	GitHub、Stack Overflow
life	生活休闲	少数派

[当前项目]/
├── collected/
│   ├── news/           # 新闻类
│   │   └── 2026-02-21/
│   │       ├── articles/           # 每篇文章独立 HTML
│   │       │   ├── article-1.html
│   │       │   └── article-2.html
│   │       ├── index.md            # Markdown 汇总
│   │       └── index.html          # HTML 简报（标题带链接）
│   ├── article/        # 文章类
│   ├── guide/          # 指南类
│   ├── reference/      # 参考资料类
│   └── life/           # 生活类

{
  "url_pool": {
    "news": {
      "name": "新闻资讯",
      "description": "新闻网站、资讯平台",
      "urls": [
        {"name": "36氪", "url": "https://36kr.com", "enabled": true}
      ]
    },
    "article": {
      "name": "文章博客",
      "description": "博客文章、专栏评论",
      "urls": [
        {"name": "Hacker News", "url": "https://news.ycombinator.com", "enabled": true}
      ]
    },
    "guide": {
      "name": "指南教程",
      "description": "教程、学习指南、入门资料",
      "urls": []
    },
    "reference": {
      "name": "参考资料",
      "description": "技术文档、API参考、官方文档",
      "urls": []
    },
    "life": {
      "name": "生活休闲",
      "description": "生活方式、兴趣爱好",
      "urls": []
    }
  }
}

{
  "smtp_server": "smtp.gmail.com",
  "smtp_port": 587,
  "username": "[email protected]",
  "password": "app_password",
  "from": "[email protected]",
  "to": ["[email protected]"]
}

Read ~/.claude/skills/content-collector/config/sources.json
Read ~/.claude/skills/content-collector/config/email.json  # 如果存在

mcp__playwright__navigate -> 打开页面
mcp__playwright__screenshot -> 截图查看页面
mcp__playwright__click -> 点击链接
mcp__playwright__evaluate -> 提取内容

# 运行采集脚本（需要用户配合验证验证码）
node scripts/collect-36kr-headed.js

检测验证码：
- 脚本会自动检测页面中是否出现验证码容器（#captcha_container、.captcha-wrapper 等）
- 检测到后自动暂停采集

提示用户：

在终端显示提示信息：

⚠️ 检测到验证码！
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
请在浏览器中手动完成滑动验证
验证完成后，浏览器会自动继续...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

等待验证：
- 浏览器保持打开状态，用户手动完成滑动验证
- 脚本会自动检测验证码消失，然后继续采集
- 超时时间：120秒
超时处理：
- 如果用户未在 120 秒内完成验证，脚本会继续尝试采集
- 可能需要多次手动验证

获取文章列表：
- 访问新闻源首页（如虎嗅、钛媒体）
- 解析文章标题和链接

采集每篇文章完整内容：

逐个访问文章详情页 URL

使用 Playwright 提取完整内容：

mcp__playwright__navigate -> 文章URL
mcp__playwright__evaluate -> 提取 article/main/content 标签的完整HTML

保存完整内容到 HTML 文件：
- 标题：<h1> 或 <title>
- 正文：<article>, .content, .article-body, main 的完整 HTML
- 图片：保持 <img> 标签，src 为完整 URL
- 保留原文链接，便于跳转
简报中的链接指向本地 HTML：
- 简报中标题链接指向 articles/xxx.html（本地文件）
- 用户点击后看到的是完整内容，而非摘要

# 读取当月和上月日志
Read ~/.claude/skills/content-collector/logs/2026-02.jsonl
Read ~/.claude/skills/content-collector/logs/2026-01.jsonl

<!DOCTYPE html>
<html lang="zh-CN">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>文章标题 - 来源</title>
  <style>
    * { box-sizing: border-box; margin: 0; padding: 0; }
    body {
      font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
      background: linear-gradient(180deg, #f5f7fa 0%, #e4e8ec 100%);
      min-height: 100vh;
      line-height: 2.0;
    }
    .container { max-width: 800px; margin: 0 auto; padding: 40px 20px; }
    .back-link {
      display: inline-flex;
      align-items: center;
      gap: 8px;
      margin-bottom: 20px;
      padding: 10px 20px;
      background: white;
      color: #667eea;
      text-decoration: none;
      border-radius: 25px;
      font-weight: 500;
      box-shadow: 0 2px 8px rgba(0,0,0,0.08);
      transition: all 0.3s ease;
    }
    .back-link:hover {
      transform: translateX(-4px);
      box-shadow: 0 4px 12px rgba(102, 126, 234, 0.2);
    }
    header {
      background: linear-gradient(135deg, #667eea 0%, #764ba2 50%, #6B8DD6 100%);
      color: white;
      padding: 36px 32px;
      border-radius: 16px 16px 0 0;
      box-shadow: 0 4px 20px rgba(102, 126, 234, 0.3);
    }
    h1 {
      font-size: 28px;
      font-weight: 700;
      line-height: 1.5;
      margin-bottom: 16px;
      text-shadow: 0 2px 4px rgba(0,0,0,0.1);
    }
    .meta {
      opacity: 0.95;
      font-size: 14px;
      display: flex;
      flex-wrap: wrap;
      gap: 16px;
    }
    .meta span {
      display: inline-flex;
      align-items: center;
      gap: 6px;
    }
    .content {
      background: white;
      padding: 40px 32px;
      border-radius: 0 0 16px 16px;
      box-shadow: 0 2px 16px rgba(0,0,0,0.06);
    }
    .content p {
      margin-bottom: 24px;
      text-indent: 2em;
      color: #333;
      font-size: 16px;
    }
    .content a {
      color: #667eea;
      text-decoration: none;
      font-weight: 500;
    }
    .content a:hover {
      text-decoration: underline;
    }
    .original-link {
      display: inline-block;
      margin-top: 30px;
      padding: 14px 28px;
      background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
      color: white;
      text-decoration: none;
      border-radius: 25px;
      font-weight: 500;
      transition: all 0.3s ease;
    }
    .original-link:hover {
      transform: translateY(-2px);
      box-shadow: 0 4px 16px rgba(102, 126, 234, 0.4);
    }
  </style>
</head>
<body>
  <div class="container">
    <a href="index.html" class="back-link">
      <span>←</span> 返回简报
    </a>
    <header>
      <h1>文章标题</h1>
      <div class="meta">
        <span>📰 来源: Hacker News</span>
        <span>🏷️ 分类: article</span>
        <span>🕐 采集时间: 2026-02-21 09:00</span>
      </div>
    </header>
    <div class="content">
      <p>文章正文内容...</p>
      <p>文章正文内容...</p>
      <a href="原文链接" target="_blank" class="original-link">🔗 阅读原文 →</a>
    </div>
  </div>
</body>
</html>

# 文章标题

## 元信息
- **来源**: Hacker News
- **原文**: https://...
- **采集时间**: 2026-02-21 09:00
- **分类**: article

## 正文

[文章内容]

<!DOCTYPE html>
<html lang="zh-CN">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>资料采集简报 - 2026年2月21日</title>
  <style>
    * { box-sizing: border-box; margin: 0; padding: 0; }
    body {
      font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
      background: linear-gradient(180deg, #f0f2f5 0%, #e8eaed 100%);
      min-height: 100vh;
      padding: 20px;
    }
    .container { max-width: 800px; margin: 0 auto; }
    header {
      background: linear-gradient(135deg, #667eea 0%, #764ba2 50%, #6B8DD6 100%);
      color: white;
      padding: 40px 30px;
      text-align: center;
      border-radius: 16px 16px 0 0;
      box-shadow: 0 4px 20px rgba(102, 126, 234, 0.3);
    }
    header h1 {
      font-size: 32px;
      font-weight: 700;
      margin-bottom: 12px;
      text-shadow: 0 2px 4px rgba(0,0,0,0.1);
    }
    header .subtitle {
      opacity: 0.95;
      font-size: 15px;
      font-weight: 500;
    }
    .section {
      background: white;
      padding: 24px;
      border-radius: 0 0 16px 16px;
      box-shadow: 0 2px 16px rgba(0,0,0,0.08);
    }
    .section h2 {
      color: #1a1a1a;
      font-size: 20px;
      font-weight: 600;
      margin-bottom: 20px;
      padding-bottom: 14px;
      border-bottom: 2px solid #667eea;
    }
    .article {
      padding: 20px 0;
      border-bottom: 1px solid #f0f0f0;
    }
    .article:last-child { border-bottom: none; }
    .article:hover {
      background: linear-gradient(90deg, rgba(102,126,234,0.03) 0%, transparent 100%);
      margin: 0 -12px;
      padding: 20px 12px;
      border-radius: 8px;
    }
    .tags { margin-bottom: 10px; }
    .category {
      display: inline-block;
      background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
      color: white;
      padding: 4px 12px;
      border-radius: 20px;
      font-size: 12px;
      font-weight: 500;
      margin-right: 8px;
    }
    .source {
      display: inline-block;
      background: #f5f5f5;
      color: #666;
      padding: 4px 12px;
      border-radius: 20px;
      font-size: 12px;
      font-weight: 500;
    }
    .title {
      font-size: 18px;
      font-weight: 600;
      line-height: 1.6;
      margin: 10px 0 8px;
    }
    .title a {
      color: #1a1a1a;
      text-decoration: none;
      transition: all 0.2s ease;
    }
    .title a:hover {
      color: #667eea;
      text-decoration: none;
    }
    .title a::before {
      content: '';
      position: absolute;
      width: 100%;
      height: 100%;
      top: 0;
      left: 0;
    }
    .meta {
      font-size: 13px;
      color: #999;
      margin: 8px 0;
    }
    .summary {
      color: #555;
      line-height: 1.8;
      font-size: 15px;
      margin-top: 10px;
    }
    footer {
      text-align: center;
      padding: 24px;
      color: #888;
      font-size: 13px;
    }
    footer a { color: #667eea; text-decoration: none; }
  </style>
</head>
<body>
  <div class="container">
    <header>
      <h1>📰 资料采集简报</h1>
      <p class="subtitle">2026年2月21日 · 星期三 · 共 12 篇</p>
    </header>

    <div class="section">
      <h2>🔥 热门文章</h2>
      <div class="article">
        <div class="tags">
          <span class="category">article</span>
          <span class="source">Hacker News</span>
        </div>
        <div class="title">
          <a href="articles/abc123.html" title="点击查看完整文章内容">文章标题</a>
        </div>
        <div class="meta">来源: 网站名 · 发布时间: 2026-02-21 10:30</div>
        <div class="summary">文章摘要内容...</div>
      </div>
      <!-- 更多文章... -->
    </div>

    <footer>
      <p>由 Content Collector 自动生成 · <a href="#">查看更多</a></p>
    </footer>
  </div>
</body>
</html>

{"url_hash": "abc123", "url": "https://...", "title": "文章标题", "collected_at": "2026-02-21T09:00:00+08:00", "source": "Hacker News", "category": "article"}

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.user.content-collector</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/claude</string>
        <string>--skill</string>
        <string>content-collector</string>
        <string>定时采集</string>
    </array>
    <key>StartCalendarInterval</key>
    <dict>
        <key>Hour</key>
        <integer>9</integer>
        <key>Minute</key>
        <integer>0</integer>
    </dict>
</dict>
</plist>

用户: "设置定时采集"
→ 生成 plist 到 ~/Library/LaunchAgents/
→ 执行 launchctl load

0 9 * * * /usr/local/bin/claude --skill content-collector "定时采集"

命令	说明
"收集新闻"	采集新闻分类的网址
"收集AI文章"	采集文章分类的网址
"采集 https://example.com"	只采集指定网址（跳过网址池）
"列出所有网址"	显示网址池列表，按分类展示
"添加网址 [分类] [名称] [URL]"	添加新网址到指定分类
"删除网址 [名称]"	从网址池删除
"设置输出目录 [路径]"	修改报告输出位置
"设置定时采集"	配置定时任务
"配置邮件"	设置邮件发送参数

## 资料采集完成

### 需求分析
- 需求类型: article（文章博客）
- 匹配分类: article

### 采集统计
- 新增: 12 篇
- 去重跳过: 3 篇
- 失败: 0 篇

### 来源分布
| 来源 | 数量 |
|------|------|
| Hacker News | 5 |
| 掘金 | 4 |
| V2EX | 3 |

### 自动归类
- 新发现分类: 2 个网址已归入 article

### 文件位置

## 资料采集完成

### 采集模式
- 用户指定网址：https://example.com/article
- 跳过网址池匹配

### 采集统计
- 新增: 1 篇
- 自动归类: article

### 文件位置

报告位置: ~/mynote/collected/news/2026-02-21/index.html
邮件状态: ✓ 已发送至 [email protected]

Content Collector | Skills Pool

名称	URL
MDN Web Docs	https://developer.mozilla.org
阮一峰的网络日志	https://www.ruanyifeng.com/blog/

名称	URL
36氪	https://36kr.com
虎嗅	https://www.huxiu.com
钛媒体	https://www.tmtpost.com
极客公园	https://www.geekpark.net

名称	URL
Hacker News	https://news.ycombinator.com
Lobsters	https://lobste.rs
Reddit Programming	https://www.reddit.com/r/programming/
掘金	https://juejin.cn
V2EX	https://www.v2ex.com

名称	URL
GitHub Trending	https://github.com/trending
Stack Overflow	https://stackoverflow.com

Content Collector

Content Collector

Content Collector - 网址池资料采集工具

核心能力

目录结构

网址池分类

触发模式

1. 主动触发

2. 定时触发

配置管理

网址池配置 (config/sources.json)

邮件配置 (config/email.json)

输出目录配置

采集流程

Step 1: 加载配置

Step 2: 需求分析与网址匹配

Step 3: 访问网站采集

采集方式 1：Playwright MCP（推荐）

方式 2：Playwright 有头模式（脚本自动化）

采集失败处理

验证码处理（重要）

采集完整内容流程

Step 4: 去重检查

Step 5: 确定输出目录

Step 6: 自动归类网址

Step 7: 生成报告

Step 8: 发送邮件（仅定时触发）

Step 9: 记录日志

预设网址池

news（新闻资讯）

article（文章博客）

guide（指南教程）

reference（参考资料）

life（生活休闲）

定时任务设置

macOS (launchd)

Linux (cron)

用户命令

输出示例

主动触发

用户指定网址

定时触发

Sherlock

Domain Name Brainstormer

Bmad Domain Research

Claimable Postgres

Active Directory Attacks

Gws Gmail Watch