Deep extraction pipeline for content research. Routes content types to the right
extraction tools — web articles, scholarly papers, PDFs, ebooks, code repos. Use when
asked to "research this topic", "extract from this source", "summarize this paper",
or "deep dive into this article". Maximizes signal density from any source format.
Route content types to the right extraction tools. The goal: maximum signal density from any source — web articles, scholarly papers, PDFs, ebooks, code repos.
Tool Inventory
Discovery (find sources)
Tool
What It Does
When To Use
mcp__exa__web_search_exa
Neural web search, content extraction
Blog posts, articles, general web research
mcp__exa__get_code_context_exa
Code docs, API references, library examples
Technical documentation, SDK references
mcp__paper-search__search_arxiv
Search arXiv by query
Academic papers — CS, ML, AI, math, physics
mcp__paper-search__search_semantic
Semantic Scholar search
関連 Skill
Academic papers — broader coverage, citation data
mcp__paper-search__search_crossref
CrossRef search
DOI resolution, journal articles
mcp__paper-search__search_pubmed
PubMed search
Biomedical, health data, clinical research
mcp__openalex__search_works
OpenAlex (240M+ works)
Broad scholarly search, comprehensive coverage
mcp__openalex__search_authors
Author search and profiles
Finding an author's full body of work
mcp__openalex__get_trending_topics
Trending research topics
Landscape mapping, trend detection
Extraction (get content from sources)
Tool
What It Does
When To Use
mcp__arxiv-latex__get_paper_prompt
Fetch LaTeX source from arXiv
Always use for arXiv papers — lossless math, tables, structure
mcp__paper-search__read_arxiv
Read arXiv paper content
Fallback if LaTeX source unavailable
mcp__paper-search__download_arxiv
Download arXiv PDF
When you need the PDF file itself
summarize CLI
Extract web page content
Blog posts, Substack, Medium articles
pandoc CLI
EPUB-to-markdown conversion
Best for EPUBs — direct, no dependencies, preserves structure
pdftotext -layout
PDF-to-text conversion
Fast fallback for PDFs when Marker stalls
marker_single CLI
PDF-to-markdown (ML-based)
High-fidelity PDFs with complex layout — but first-run downloads ~1GB models
Read tool (built-in)
Read PDF files directly
Quick look at short PDFs (<10 pages)
WebFetch
Fetch URL content
Direct URL access when summarize isn't needed
Citation & Network (map the landscape)
Tool
What It Does
When To Use
mcp__openalex__get_work_citations
Papers that cite a given work
"Who built on this paper?"
mcp__openalex__get_work_references
Papers a given work cites
"What does this paper build on?"
mcp__openalex__get_citation_network
Full citation graph
Mapping a research area
mcp__openalex__get_related_works
Related works by topic
Finding adjacent research
mcp__openalex__get_author
Author profile + works
Deep dive on a specific researcher
Routing Decision Tree
Source type?
│
├─ arXiv paper (has arXiv ID like 2512.24601)
│ ├─ Get LaTeX source: mcp__arxiv-latex__get_paper_prompt
│ └─ Get citation network: mcp__openalex__get_work (by DOI) → get_work_citations
│
├─ Academic paper (non-arXiv, has DOI)
│ ├─ Find it: mcp__paper-search__search_semantic OR mcp__openalex__search_works
│ ├─ If PDF available: marker_single paper.pdf --output_dir /tmp/marker_out --output_format markdown
│ └─ Citation network: mcp__openalex__get_work → get_work_citations / get_work_references
│
├─ EPUB (ebook)
│ ├─ Convert: pandoc book.epub -t markdown --wrap=none -o output.md
│ ├─ If EPUB is an expanded directory: zip it first (zip -X0 out.epub mimetype && zip -Xr out.epub *)
│ └─ Note: pandoc is fast, dependency-free, and handles EPUB→md natively
│
├─ PDF (ebook, report, vendor doc)
│ ├─ Fast path: pdftotext -layout file.pdf output.md (always works, rough formatting)
│ ├─ High quality: marker_single file.pdf --output_dir /tmp/marker_out --output_format markdown
│ ├─ For complex tables/math: add --use_llm (requires GOOGLE_API_KEY or --claude_api_key)
│ ├─ ⚠️ First run: Marker downloads ~1GB of models — may appear to hang
│ └─ Read output: Read /tmp/marker_out/<filename>/<filename>.md
│
├─ Web article (Substack, Medium, blog)
│ ├─ Extract: summarize "<url>" --extract-only --json
│ └─ If paywall/JS-heavy: use Exa (has content extraction built in)
│
├─ GitHub repo
│ ├─ Code context: mcp__exa__get_code_context_exa
│ └─ Deep analysis: clone + scout/oracle agent with specific extraction brief
│
├─ Research landscape question ("what's the state of X?")
│ ├─ Academic: mcp__openalex__search_works + get_trending_topics
│ ├─ Industry: mcp__exa__web_search_exa
│ └─ Both: parallel agents — one academic, one industry
│
└─ "What cites this?" / "What does this build on?"
└─ mcp__openalex__get_work_citations / get_work_references
Extraction Brief Templates
Quality of extraction depends on quality of the brief. Never say "summarize." Always specify what to extract.
For Academic Papers
Read this paper. Extract:
1. Core claim (one sentence)
2. Methodology — what did they actually do?
3. Key findings — specific numbers, not descriptions
4. Limitations the authors acknowledge
5. Limitations they DON'T acknowledge
6. How this connects to [specific thesis/question]
7. Citation count and year (for recency weighting)
For Industry Articles / Blog Posts
Read this article. Extract:
1. Core claim (one sentence)
2. Evidence quality — is this opinion, anecdote, or data?
3. Author's incentive structure — are they selling something?
4. Specific metrics or data points cited (with their sources)
5. How this connects to [specific thesis/question]
6. What's genuinely new vs. repackaged conventional wisdom?
For Landscape Mapping (5+ sources)
Broad sweep first — for each source, extract in 3-5 lines:
1. Author and affiliation
2. Core claim
3. Evidence type (data/opinion/case study)
4. Cluster assignment (which theme does this belong to?)
Then identify: which clusters have the most signal? Which sources
disagree? Where are the gaps — questions nobody is asking?
For Ebooks / Long PDFs
After converting with marker_single, extract:
1. Table of contents with chapter summaries (1-2 lines each)
2. Key frameworks or models introduced
3. Practitioner advice that's specific enough to act on
4. Claims that can be verified against other sources
5. Which chapters are most relevant to [specific question]?
Marker CLI Reference
# Basic conversion (no LLM, no API key needed)
marker_single /path/to/file.pdf \
--output_dir /tmp/marker_out \
--output_format markdown
# Maximum quality (uses Claude for complex tables/math)
marker_single /path/to/file.pdf \
--output_dir /tmp/marker_out \
--output_format markdown \
--use_llm \
--llm_service marker.services.claude.ClaudeService \
--claude_api_key $ANTHROPIC_API_KEY
# Batch convert a directory
marker /path/to/pdfs/ \
--output_dir /tmp/marker_out \
--output_format markdown
# Output location: /tmp/marker_out/<filename>/<filename>.md
Workflow: Scholarly Deep Read
Complete workflow for processing an academic paper:
1. DISCOVER
mcp__paper-search__search_arxiv("context engineering semantic layer", max_results=10)
— or —
mcp__openalex__search_works("knowledge graph agent memory", per_page=10)
2. EVALUATE (quick scan of abstracts/titles)
Which papers are worth deep reading?
Check: citation count, recency, author credibility, relevance to our thesis
3. EXTRACT
If arXiv: mcp__arxiv-latex__get_paper_prompt(arxiv_id)
If PDF: marker_single paper.pdf --output_dir /tmp/marker_out --output_format markdown
4. READ with extraction brief
Oracle agent with specific brief (see templates above)
— or —
Direct Read if paper is short enough for context
5. MAP CITATIONS
mcp__openalex__get_work_citations(work_id) — who built on this?
mcp__openalex__get_work_references(work_id) — what foundation does this rest on?
6. CONNECT
How does this paper relate to:
- Our active beliefs (beliefs.md)?
- Our open questions (reading-wants.md)?
- Resources already in the library (resources.yaml)?
7. STORE
Update beliefs.md if evidence changes confidence
Update reading-wants.md if new threads emerge
Add to resources.yaml if it belongs in the library
Workflow: Ebook Mining Pipeline
For extracting structured knowledge from books (EPUBs, PDFs):
1. CONVERT
EPUB → pandoc book.epub -t markdown --wrap=none -o output.md
PDF → pdftotext -layout book.pdf output.md (fast)
marker_single book.pdf --output_dir /tmp/marker_out (high quality, slow first run)
Expanded EPUB dir → zip first, then pandoc
2. TRIAGE
Check line counts (wc -l *.md) to gauge scope
Read first 80 lines of each to see TOC/structure
Prioritize by relevance to extraction goal
3. PARALLEL EXTRACTION
Dispatch scout agents (model: sonnet) — one per book
Each agent gets a targeted extraction brief:
- What specific knowledge to extract (not "summarize")
- How to organize the output (headers, bullet points)
- What to flag (tensions, surprises, quotable insights)
4. SYNTHESIS
Opus reads all extraction outputs
Cross-references claims across books (where do authors agree? disagree?)
Produces final resource files organized by purpose, not by source
5. INTEGRATION
Wire resource files into the relevant skill
Update reading-wants.md and beliefs.md if findings shift understanding
Validated: 2026-02-25 — converted 6 books (45K lines) in parallel, dispatched 6 extraction agents simultaneously. pandoc for EPUBs, pdftotext for PDFs. Total conversion time: ~30 seconds.
Anti-Patterns
Don't OCR what has LaTeX source. If it's on arXiv, use arxiv-latex-mcp. The source is right there.
Don't Read 200-page PDFs 20 pages at a time. Convert with Marker first, then read the markdown.
Don't search Exa for academic papers. Use paper-search-mcp or OpenAlex — they have structured metadata, citation counts, and DOIs.
Don't skip the citation graph. A paper in isolation is half a picture. Who cites it tells you if the field agreed.
Don't deep-read before triaging. Broad sweep first (categorize into clusters), then depth on signal. Validated across 3+ sessions.
Don't trust abstracts. An abstract is a sales pitch. The methodology section is where the truth lives.