스킬 파일

Doc Importer

Name: Doc Importer
Author: wpsnote

将本地文档批量导入到 WPS 笔记。支持扫描 Obsidian Vault、思源笔记、微信公众号存档、下载目录或任意用户指定目录，自动识别 HTML、Markdown、PDF、DOCX、PPTX、XLSX 等格式，转换为 WPS 笔记可读内容并保留图片和富文本格式。当用户说「导入文档到 WPS 笔记」「把我的 Obsidian 笔记导入」「导入思源笔记」「把下载的 PDF/Word/PPT 导入笔记」「把公众号文章导入笔记」「同步本地文档到 WPS 笔记」时触发。不适用于直接编辑 WPS 笔记内容、文档格式转换（不导入到笔记）。

wpsnote116 스타2026. 4. 2.

직업
카테고리: 문서

스킬 내용

文档导入器（Doc Importer）

将本地文档（Obsidian、思源笔记、微信公众号 HTML、下载目录或任意目录）批量导入到 WPS 笔记。

快速开始

优先使用 wpsnote-cli 脚本方式，比 MCP 逐步操作更快：

# 确认 CLI 连接正常
wpsnote-cli status

# 一键导入整个目录
python3 scripts/import_to_wps.py ~/Documents/MyVault

# 只导入没有导入过的（根据标题去重）
python3 scripts/import_to_wps.py ~/Downloads --resume

# 演习模式（不实际写入）
python3 scripts/import_to_wps.py ~/Downloads --dry-run

支持的文档来源

来源	说明	典型目录结构
Obsidian Vault	扫描 .md 文件，保留 wiki 链接、Callout、标签	`Vault/笔记名.md` + `attachments/`
思源笔记	扫描 .sy 文件（JSON格式），提取 Block Tree

관련 스킬

Doc Importer | Skills Pool

格式	转换方式	图片处理	富文本保留
`.html`	BeautifulSoup 解析内联样式	本地图片 base64	✅ 颜色/粗体/标题
`.md` / `.markdown`	直接解析，转 WPS XML	本地图片 base64	基本格式
`.pdf`	pdfplumber 提取文本 + pdfimages 提取图片	提取嵌入图片	标题推断
`.docx`	pandoc 转 markdown，提取 word/media/	解包提取	基本格式
`.pptx`	markitdown 提取文本 + 解包媒体	解包提取	幻灯片结构
`.xlsx`	pandas 读取，转 WPS table	不含图片	表格结构
`.txt`	直接读取	不含图片	无
`.sy`	JSON 解析思源 Block Tree	提取 assets 图片	全部

def get_all_blocks(note_id):
    """翻页获取笔记全部 blocks"""
    # 第一页用 outline（有 preview 字段）
    r = cli(['outline', '--note_id', note_id])
    data = r.get('data', {})
    total = data.get('block_count', 0)
    blocks = list(data.get('blocks', []))
    last_id = blocks[-1]['id'] if blocks else None

    # 超过 100 时用 read-blocks 续读
    while len(blocks) < total and last_id:
        r2 = cli(['read-blocks', '--note_id', note_id,
                  '--block_id', last_id, '--after', '100',
                  '--include_anchor', 'false'])
        new_blocks = (r2.get('data') or {}).get('blocks', [])
        if not new_blocks:
            break
        blocks.extend(new_blocks)
        last_id = new_blocks[-1]['id']

    return blocks

wpsnote-cli --version  # 检查版本

r = subprocess.run(['wpsnote-cli', 'list', '--limit', '100', '--json'],
                   capture_output=True, timeout=30)  # 注意：不加 text=True
raw = r.stdout.decode('utf-8', errors='ignore')

def do_insert(note_id, content, get_anchor_fn, max_retries=4):
    """带重试的内容插入，自动刷新 anchor"""
    anchor = get_anchor_fn()
    for attempt in range(max_retries):
        if attempt > 0:
            time.sleep(1.5 * attempt)
            anchor = get_anchor_fn()  # 重新获取最新 anchor
        res = batch_edit(note_id, [{
            'op': 'insert', 'anchor_id': anchor,
            'position': 'after', 'content': content
        }])
        if res.get('ok') is not False:
            anchor = get_anchor_fn()
            return True
    return False  # 4次都失败才放弃

def get_last_block_id(note_id):
    r = cli(['outline', '--note_id', note_id])
    blocks = (r.get('data') or {}).get('blocks', [])
    return blocks[-1]['id'] if blocks else None

BATCH_SIZE = 4  # 不要超过 8，否则容易出现 anchor 失效

# Obsidian Vault（macOS）
ls ~/Documents/ | grep -i obsidian
ls ~/Library/Mobile\ Documents/iCloud~md~obsidian/Documents/ 2>/dev/null

# 思源笔记（macOS）
ls ~/Documents/SiYuan/ 2>/dev/null
ls ~/SiYuan/ 2>/dev/null

# 微信公众号存档（常见目录结构）
ls ~/Documents/ | grep -i "mp\|公众号\|推文\|文章"

# 下载目录
ls ~/Downloads/ | grep -E "\.(pdf|docx|pptx|xlsx|md|html)$" | head -10

python3 scripts/scan_docs.py <目录路径> [--recursive] [--days N] [--source TYPE]

{
  "source_type": "wechat_mp",
  "root_path": "/Users/xxx/Documents/articles",
  "files": [
    {
      "path": "/Users/xxx/Documents/articles/文章名/原文.html",
      "rel_path": "文章名/原文.html",
      "title": "文章标题",
      "publish_time": "2025-04-21 18:30",
      "size_bytes": 204800,
      "modified": "2025-04-22T10:00:00",
      "format": "html",
      "estimated_images": 12,
      "estimated_blocks": 180
    }
  ],
  "total": 71,
  "formats": {"html": 71}
}

扫描到 71 个文件：
  - HTML: 71 个

文件列表：
 1. AutoGLM 发布之后，如今国产大模型终于长出了手。  (2025-03-31, 12张图)
 2. 你可能看不懂扣子空间为什么重要…                 (2025-04-21, 17张图)
 ...（超过20个时截断，告知总数）

请问你想如何导入？
 [A] 全部导入（71个文件）
 [B] 手动选择（输入文件编号，如：1,3,5-10）
 [S] 跳过已有标题的笔记（根据笔记标题去重）

def check_exists(title):
    r = cli(['find', '--keyword', title[:20], '--limit', '5'])
    notes = (r.get('data') or {}).get('notes', [])
    return next((n for n in notes if n['title'] == title), None)

发现以下笔记在 WPS 中已存在：
 - 《AutoGLM 发布之后…》（最后更新：2025-05-01）

如何处理？
 [O] 覆盖  [S] 跳过  [A] 追加  [RA] 对所有冲突应用相同策略

# 1. 解析 HTML，提取内容段落和图片
segments = html_to_segments(html_path, img_dir)
# segments 格式：[('xml', '<p>文字</p>'), ('img', Path('图片/image_001.jpg')), ...]

# 2. 创建笔记
note_id = create_note(title)

# 3. 写入标题行 + meta 行（时间、标签）
write_header(note_id, title, publish_time, tag)

# 4. 批量写入正文（图片先插占位符）
write_content_with_placeholders(note_id, segments)

# 5. 翻页查找所有占位符 block_id（用 get_all_blocks）
ph_map = find_placeholders(note_id)

# 6. 逐个插入真实图片，替换占位符
for idx, img_path in img_list:
    insert_image(note_id, ph_map[idx]['block_id'], img_path)
    delete_placeholder(note_id, ph_map[idx]['block_id'])

# 微信公众号：简洁的时间 + 标签
f'<p>{publish_time} | <tag id="{tag_id}">#推文</tag></p>'

# 通用文档：完整 meta blockquote
"""
<blockquote>
  <p>📄 <strong>来源</strong>：{rel_path}</p>
  <p>🕒 <strong>修改时间</strong>：{modified_time}</p>
  <p>🔄 <strong>导入时间</strong>：{import_time}</p>
</blockquote>
"""

[3/71] AutoGLM 发布之后…
  解析完成: 95 段文字, 12 张图片
  ✓ 创建笔记: 501435173515
  ✓ 文字写入完成
  ✓ 图片插入: 12/12
  用时: 8.3s

进度：████████░░░░  42% (30/71)  预计剩余 ~12 分钟

# 全量导入
python3 scripts/import_to_wps.py ~/Documents/mp_format/历史推文

# 断点续跑（跳过已有标题的笔记）
python3 scripts/import_to_wps.py ~/Documents/mp_format/历史推文 --resume

# 指定来源类型
python3 scripts/import_to_wps.py ~/Documents/MyVault --source obsidian

# 只导入最近7天的 PDF 和 Word
python3 scripts/import_to_wps.py ~/Downloads --days 7 --formats pdf,docx

# 添加额外标签
python3 scripts/import_to_wps.py ~/Downloads --tag "#项目A"

# 跳过冲突（不询问）
python3 scripts/import_to_wps.py ~/Downloads --on-conflict skip

# 预先选择文件编号
python3 scripts/import_to_wps.py ~/Downloads --select 1,3,5-10

# 1. 创建笔记
create_note(title="文档标题")

# 2. 获取初始 block ID
get_note_outline(note_id=note_id)

# 3. 写入内容（分批，每批 4 个 block）
batch_edit(note_id=note_id, operations=[
    {"op": "replace", "block_id": first_block_id, "content": "<h1>标题</h1>"},
    {"op": "insert", "anchor_id": first_block_id, "position": "after",
     "content": "<p>2025-04-21 18:30 | <tag>#推文</tag></p>"},
])

# 4. 插入图片（新版 WPS 支持后台插图）
insert_image(note_id=note_id, anchor_id=placeholder_block_id,
             position="before", src="data:image/jpeg;base64,...")

文章目录/
  原文.html        ← 完整 HTML，含内联样式
  meta.json        ← {"title": "...", "publish_time": "2025-04-21 18:30", "url": "..."}
  图片/            ← 本地图片文件
    image_001.jpg
    image_002.jpg

# font-size >= 18px → <h2>
# font-weight: 700|bold → <strong>
# font-style: italic → <em>
# color: rgb(R,G,B) → <span fontColor="#WPS预设色">（需颜色映射）

# 正确方式（大图片）
echo "data:image/jpeg;base64,$(base64 -i image.jpg)" > /tmp/img.txt
wpsnote-cli insert-image --note_id "$ID" --anchor_id "$BID" --position before --src_file /tmp/img.txt

brew install pandoc  # macOS

pip3 install pdfplumber

pip3 install pytesseract pdf2image
brew install tesseract

工具	用途	安装
`wpsnote-cli`	CLI 操作 WPS 笔记（必须）	见 wpsnote-cli 文档
`beautifulsoup4`	HTML 解析（公众号/网页）	`pip3 install beautifulsoup4`
`lxml`	HTML 解析加速	`pip3 install lxml`
`pandoc`	DOCX → Markdown	`brew install pandoc`
`pdfplumber`	PDF 文本/表格提取	`pip3 install pdfplumber`
`pypdf`	PDF 图片提取备选	`pip3 install pypdf`
`markitdown`	PPTX → Markdown	`pip3 install "markitdown[pptx]"`
`pandas`	XLSX 读取	`pip3 install pandas openpyxl`
`pillow`	图片处理/base64转换	`pip3 install pillow`
`python-frontmatter`	YAML Frontmatter 解析	`pip3 install python-frontmatter`

Doc Importer

文档导入器（Doc Importer）

快速开始

支持的文档来源

Doc Importer

文档导入器（Doc Importer）

快速开始

支持的文档来源

支持的文件格式

WPS API 关键限制（必读）

1. get_note_outline 默认只返回 100 个 block

2. insert_image 在旧版 WPS 中只能插到前台笔记

3. list 接口返回大数据时可能截断

4. anchor 失效导致内容静默丢失

5. 批量写入的稳定参数

完整工作流

第一步：确定扫描目录

第二步：扫描文档列表

第三步：展示文件清单，询问选择

第四步：去重检测

第五步：转换并导入

第六步：进度报告

CLI 脚本方式（推荐）

MCP 逐步操作（CLI 不可用时）

各来源特殊处理

微信公众号 HTML（原文.html）

Obsidian Vault

思源笔记（SiYuan）

故障排查

get_note_outline 只返回 100 个 block，后面内容丢失

图片插入后点击显示加载失败

insert_image 报 IMAGE_FETCH_FAILED

anchor 失效导致内容截断

wpsnote-cli list 返回乱码或截断

pandoc 未安装

pdfplumber 未安装

PDF 扫描版无文字（需要 OCR）

依赖项清单

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing

1. `get_note_outline` 默认只返回 100 个 block

2. `insert_image` 在旧版 WPS 中只能插到前台笔记

3. `list` 接口返回大数据时可能截断

微信公众号 HTML（`原文.html`）

`get_note_outline` 只返回 100 个 block，后面内容丢失

`insert_image` 报 `IMAGE_FETCH_FAILED`

`wpsnote-cli list` 返回乱码或截断