Name: MinerU Document Explorer
Author: opendatalab

Search skills.../

MinerU Document Explorer | Skills Pool

Group	Purpose	Tools
Retrieval	Find and fetch documents	`query`, `get`, `multi_get`, `status`
Deep Reading	Navigate within a document	`doc_toc`, `doc_read`, `doc_grep`, `doc_query`, `doc_elements`, `doc_links`
Knowledge Ingestion	Build wiki knowledge base	`wiki_ingest`, `doc_write`, `wiki_lint`, `wiki_log`, `wiki_index`

which qmd && qmd status

# Option A: npm (recommended)
npm install -g mineru-document-explorer

# Option B: from source
git clone https://github.com/opendatalab/MinerU-Document-Explorer.git
cd MinerU-Document-Explorer && bun install && bun link

python3 --version

python3 -c "import pymupdf; import docx; import pptx; print('All dependencies OK')"

pip install pymupdf python-docx python-pptx

# Method A: Environment variable
export MINERU_API_KEY="your-key-here"

# Method B: Config file (~/.config/qmd/doc-reading.json)
mkdir -p ~/.config/qmd
cat > ~/.config/qmd/doc-reading.json << 'EOF'
{
  "docReading": {
    "providers": {
      "fullText": { "pdf": ["mineru_cloud", "pymupdf"] }
    },
    "credentials": {
      "mineru": { "api_key": "YOUR_API_KEY_HERE" }
    }
  }
}
EOF

pip install mineru-open-sdk

# Index a folder (adjust path to user's documents)
qmd collection add ~/Documents --name mydocs --mask '**/*.{md,pdf,docx,pptx}'

# Verify indexing worked
qmd status

# Test search (instant, no model downloads)
qmd search "test"

{ "mcpServers": { "qmd": { "command": "qmd", "args": ["mcp"] } } }

qmd mcp --http --daemon   # start the server first

{ "mcpServers": { "qmd": { "url": "http://localhost:8181/mcp" } } }

{ "mcpServers": { "qmd": { "command": "qmd", "args": ["mcp"] } } }

{
  "docReading": {
    "providers": {
      "fullText": { "pdf": ["mineru_cloud", "pymupdf"] },
      "toc":      { "pdf": ["native_bookmarks"] },
      "elements": { "docx": ["python_docx_local"], "pptx": ["python_pptx_local"] }
    },
    "credentials": {
      "mineru": {
        "api_key": "your-mineru-api-key",
        "api_url": "https://mineru.net/api/v4"
      },
      "openai": {
        "api_key": "your-openai-api-key",
        "base_url": "https://api.openai.com/v1"
      }
    }
  }
}

Capability	Provider	Requires
PDF full text	`pymupdf` (default)	`pip install pymupdf`
PDF full text	`mineru_cloud`	`pip install mineru-open-sdk` + API key
PDF full text	`mineru_local`	`pip install mineru-vl-utils[transformers]` + model
PDF TOC	`native_bookmarks` (default)	`pip install pymupdf`
PDF TOC	`gpt_pageindex`	`pip install tiktoken openai pyyaml` + API key
DOCX tables	`python_docx_local` (default)	`pip install python-docx`
PPTX tables	`python_pptx_local` (default)	`pip install python-pptx`

query({ "query": "how does authentication work" })

get({ "file": "#abc123" })

doc_toc({ "file": "#abc123" })
doc_read({ "file": "#abc123", "addresses": ["line:11-20", "line:31-49"] })

doc_toc({ "file": "papers/survey.pdf" })

doc_read({ "file": "papers/survey.pdf", "addresses": ["line:11-20", "line:45-60"] })

doc_grep({ "file": "papers/survey.pdf", "pattern": "attention mechanism" })

doc_query({ "file": "papers/survey.pdf", "query": "what evaluation metrics were used" })

doc_elements({ "file": "report.pdf", "element_types": ["table"], "query": "revenue" })

doc_toc("papers/survey.pdf")
  → sees section "3. Methodology" at line:45-80
doc_read("papers/survey.pdf", ["line:45-80"])
  → reads methodology section
doc_grep("papers/survey.pdf", "dataset")
  → finds mentions at line:62, line:78
doc_read("papers/survey.pdf", ["line:60-65", "line:76-80"])
  → reads specific paragraphs around dataset mentions

wiki_ingest({ "source": "mydocs/distributed-systems.md", "wiki_collection": "mywiki" })

doc_toc({ "file": "mydocs/distributed-systems.md" })
doc_read({ "file": "mydocs/distributed-systems.md", "addresses": ["line:11-30"] })

doc_write({
  "collection": "mywiki",
  "path": "concepts/cap-theorem.md",
  "content": "# CAP Theorem\n\n**Source:** [[sources/distributed-systems]]\n\n## Overview\n\nThe CAP theorem states that...\n\n## Connections\n- Related to [[concepts/consistency-models]]\n- See also [[concepts/consensus-algorithms]]",
  "title": "CAP Theorem",
  "source": "mydocs/distributed-systems.md"
})

wiki_lint({ "collection": "mywiki", "stale_days": 30 })

# Page Title

**Source:** [[sources/paper-name]]

## Key Points
- ...

## Connections
- Related to [[concepts/topic-a]]
- Extends [[concepts/topic-b]]

wiki_lint({ "collection": "mywiki" })

wiki_log({ "since": "2025-01-01", "limit": 20 })

wiki_index({ "collection": "mywiki", "write": true })

Param	Type	Default	Description
`query`	string	—	Simple search (mutually exclusive with `searches`)
`searches`	array	—	Advanced: `[{type: "lex"
`intent`	string	—	Disambiguation context (steers ranking, not searched)
`collections`	string[]	all	Filter to specific collections
`limit`	number	10	Max results
`minScore`	number	0	Min relevance 0-1

Param	Type	Default	Description
`pattern`	string	—	Glob, comma-separated paths, or comma-separated globs
`maxLines`	number	—	Max lines per file
`maxBytes`	number	10240	Skip files larger than this
`lineNumbers`	boolean	false	Add line numbers

Param	Type	Description
`collection`	string	Target collection name
`path`	string	Relative path (e.g. `"concepts/topic.md"`)
`content`	string	Full markdown content
`title`	string	Optional: document title
`source`	string	Optional: source path for provenance

Param	Type	Default	Description
`since`	string	—	ISO date filter (e.g. `"2025-01-01"`)
`operation`	string	—	Filter: `"ingest"`, `"update"`, `"lint"`, `"query"`, `"index"`
`limit`	number	20	Max entries
`format`	string	`"markdown"`	`"markdown"` or `"json"`

START
  │
  ├─ "What's indexed?" → status
  │
  ├─ "Find documents about X" → query
  │     Next: get (small docs) or doc_toc → doc_read (large docs)
  │
  ├─ "Get this specific file" → get (path or #docid)
  │     ⚠ For large docs: use doc_toc + doc_read instead
  │
  ├─ "Get several files" → multi_get (glob or comma-list)
  │
  ├─ "Read section of a large doc" → doc_toc → doc_read
  │
  ├─ "Find keyword in one doc" → doc_grep → doc_read
  │
  ├─ "Conceptual search in one doc" → doc_query → doc_read
  │
  ├─ "Extract tables/figures" → doc_elements
  │
  ├─ "What links to this page?" → doc_links
  │
  ├─ "Build wiki from source" → wiki_ingest → doc_read → doc_write
  │
  └─ "Check wiki health" → wiki_lint

Problem	Cause	Fix
"Document not found"	Wrong path or missing collection prefix	Check "Did you mean?" suggestions in error; use `status` to see collections
"No results found"	Query too specific or wrong collection	Try simpler keywords; omit `collections` to search all; check `status`
"No vector embeddings" warning	Embeddings not generated	Tell the user to run `qmd embed` (one-time, downloads ~2GB models)
`get` returns too much text	Document is large	Use `doc_toc` → `doc_read` for targeted sections
`doc_read` returns empty	No addresses provided or wrong format	Get addresses from `doc_toc`, `doc_grep`, or `doc_query` first
Slow first query (~5-15s)	LLM models loading	Normal for MCP startup; subsequent queries are fast. CLI always reloads.
PDF/DOCX/PPTX not working	Missing Python dependencies	Follow Playbook 0 to check and install: `python3 -c "import pymupdf; import docx; import pptx"`, then `pip install pymupdf python-docx python-pptx`
Wiki page has broken links	Target page doesn't exist	Create the missing page with `doc_write`, or fix the `[[wikilink]]`
Stale wiki pages	Source document updated after wiki page written	Run `wiki_lint` to detect; re-read source with `doc_read` and update
`multi_get` returns no files	Pattern doesn't match any indexed files	Check exact collection names via `status`; try broader glob

qmd status                                # Index health
qmd query "question"                      # Hybrid search (recommended)
qmd search "keywords"                     # BM25 only (fast, no LLM)
qmd get "#abc123"                         # Get by docid
qmd get "docs/readme.md:100" -l 50        # Line slice
qmd multi-get "journals/2026-*.md" -l 40  # Glob batch
qmd multi-get "a.md, b.md, c.md"          # Comma-separated
qmd doc-toc "paper.pdf"                   # Document TOC
qmd doc-read "paper.pdf" "line:45-120"    # Read section
qmd doc-grep "report.md" "revenue"        # Search in document
qmd mcp                                   # MCP server (stdio)
qmd mcp --http --daemon                   # MCP server (HTTP, background)

# Install
npm install -g mineru-document-explorer

# Python dependencies for PDF/DOCX/PPTX (required for binary formats)
pip install pymupdf python-docx python-pptx

# Optional: MinerU Cloud for high-quality PDF (scanned docs, complex layouts)
pip install mineru-open-sdk
export MINERU_API_KEY="your-key"  # get from https://mineru.net

# Index documents
qmd collection add ~/notes --name notes
qmd collection add ~/papers --name papers --mask '**/*.{md,pdf,docx,pptx}'

# Verify
qmd status
qmd search "test query"  # instant, no model download

# Optional: enable semantic search (downloads ~2GB models on first run)
qmd embed

{ "mcpServers": { "qmd": { "command": "qmd", "args": ["mcp"] } } }

{ "mcpServers": { "qmd": { "command": "qmd", "args": ["mcp"] } } }

{ "mcpServers": { "qmd": { "url": "http://localhost:8181/mcp" } } }

qmd skill install              # install to current project
qmd skill install --global     # install globally

I want to...	Tool	Example
Search across all docs	`query`	`{ "query": "authentication flow" }`
Get a specific file	`get`	`{ "file": "#abc123" }` or `{ "file": "docs/readme.md" }`
Get multiple files	`multi_get`	`{ "pattern": "docs/*.md" }`
See document structure	`doc_toc`	`{ "file": "paper.pdf" }`
Read specific sections	`doc_read`	`{ "file": "paper.pdf", "addresses": ["page:3"] }`
Find keyword in a doc	`doc_grep`

I want to...	Tool	Example
Search across all docs	`query`	`{ "query": "authentication flow" }`
Get a specific file	`get`	`{ "file": "#abc123" }` or `{ "file": "docs/readme.md" }`
Get multiple files	`multi_get`	`{ "pattern": "docs/*.md" }`
See document structure	`doc_toc`	`{ "file": "paper.pdf" }`
Read specific sections	`doc_read`	`{ "file": "paper.pdf", "addresses": ["page:3"] }`
Find keyword in a doc	`doc_grep`

Format	Meaning	Used by
`line:N` or `line:N-M`	Line or line range	Markdown
`page:N`	PDF page	PDF
`slide:N`	PPTX slide	PPTX
`section:N`	DOCX section	DOCX

Package	Format	What it does
`pymupdf`	PDF	Text extraction, bookmarks, page-level reading
`python-docx`	DOCX	Section extraction, table extraction
`python-pptx`	PPTX	Slide text, table extraction

Variable	Purpose
`MINERU_API_KEY`	MinerU Cloud PDF (auto-enables `mineru_cloud` provider)
`OPENAI_API_KEY`	GPT PageIndex (LLM-inferred TOC for PDFs)
`OPENAI_BASE_URL`	Custom OpenAI-compatible endpoint

MinerU Document Explorer

Quick Reference

MinerU Document Explorer

Quick Reference

Agent Principles

Key Concepts

Collections and File Paths

Document IDs (docid)

Addresses

Three Tool Groups (15 tools)

Playbook 0: First-Run Setup & Configuration

Step 1 — Check qmd is installed

Step 2 — Check Python for binary document support

Step 3 — Check and install Python packages

Step 4 — Ask about advanced PDF processing (optional)

Step 5 — Index documents and verify

Step 6 — Configure MCP server (for AI agent integration)

Configuration Reference

Playbook 1: Search & Answer a Question

Playbook 2: Deep-Read a Large Document

Playbook 3: Build Wiki from Sources

Playbook 4: Maintain Wiki Health

Tool Reference

Retrieval Tools

Deep Reading Tools

Knowledge Ingestion Tools

Decision Tree

Troubleshooting

CLI Reference (when MCP is not available)

Setup

MCP Configuration

Skill Installation

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing

Sub-query type	Method	Best for
`lex`	BM25 keywords	Exact terms, names, `"quoted phrases"`, `-negation`
`vec`	Vector semantic	Natural language questions
`hyde`	Hypothetical answer	Write 50-100 words resembling the answer

Param	Type	Default	Description
`file`	string	—	Path, docid (#abc123), or path:line
`fromLine`	number	—	Start line (1-indexed)
`maxLines`	number	—	Max lines to return
`lineNumbers`	boolean	false	Add line numbers

Param	Type	Default	Description
`file`	string	—	File path or docid
`addresses`	string[]	—	Addresses from doc_toc / doc_grep / doc_query
`max_tokens`	number	2000	Max tokens per section

Param	Type	Default	Description
`source`	string	—	Source file path or docid
`wiki_collection`	string	auto	Target wiki collection
`force`	boolean	false	Force re-ingest even if unchanged

Param	Type	Default	Description
`collection`	string	—	Optional: limit to collection
`stale_days`	number	30	Days threshold for staleness

Param	Type	Default	Description
`collection`	string	—	Wiki collection to index
`write`	boolean	false	Write index.md to disk