Skill for extracting and summarising a single academic document (PDF, arxiv paper, blog post) into a structured Markdown file with exact quotes, paraphraseable claims, BibTeX, and metadata. This is a subagent skill — called by the summarizer-agent orchestrator for each document in a batch. Use whenever a single research document needs to be converted into a knowledge base entry.
You receive one document and produce one structured .md summary file.
You will be given:
research-kb/sources/[name].md# Primary: layout-preserving extraction (handles two-column)
pdftotext -layout document.pdf /tmp/extracted.txt
# Check quality
head -50 /tmp/extracted.txt
# If garbled, fallback:
python3 -c "
from pypdf import PdfReader
r = PdfReader('document.pdf')
for i, p in enumerate(r.pages):
print(f'--- PAGE {i+1} ---')
print(p.extract_text())
"
arxiv.org/abs/ page (cleaner than PDF extraction)web_fetch to retrieve the pageWrite the output file following this EXACT structure:
# [Full Title]
## Metadata
- **Authors**: [Full author list, comma-separated]
- **Year**: [YYYY]
- **Source**: [Journal / Conference / Blog name / "arxiv preprint"]
- **URL/DOI**: [Primary link]
- **BibTeX Key**: [firstauthorYYYY_shorttitle]
## BibTeX Entry
\```bibtex
@article{firstauthorYYYY_shorttitle,
author = {Last, First and Last, First and ...},
title = {Full Title Here},
year = {YYYY},
journal = {Venue},
url = {https://...},
eprint = {XXXX.XXXXX}, % arxiv only
archivePrefix = {arXiv}, % arxiv only
primaryClass = {cs.XX} % arxiv only
}
\```
Use `@inproceedings` for conferences, `@article` for journals, `@misc` for arxiv preprints and blog posts.
## Abstract / TL;DR
[2-4 sentences: what the paper does, why it matters, main result]
## Key Ideas
1. **[Idea name]**: [1-2 sentence explanation]
2. **[Idea name]**: [1-2 sentence explanation]
...
[3-8 ideas]
## Methodology
[1-2 paragraphs: approach, architecture, experimental setup. Focus on what distinguishes this work.]
## Key Results
- [Result with specific numbers: "achieves 85.3 BLEU on WMT14 EN-DE"]
- [Result 2]
...
## Notable Quotes & Citable Passages
Extract 5-15 of the most important exact quotes. These are used downstream for citation matching and similarity scoring, so ACCURACY IS CRITICAL.
> "[EXACT quote from the paper, word for word]"
> — Section [X], Page [Y]
> "[Another exact quote]"
> — Section [X], Page [Y]
Selection criteria for quotes:
- Core claims and contributions
- Definitions of key terms
- Results statements with numbers
- Methodological descriptions
- Limitations acknowledged by authors
- Comparisons to prior work
## Paraphraseable Claims
5-20 key claims rewritten in neutral academic language. Each includes the original for similarity comparison.
1. **Claim**: [Your neutral paraphrase of the claim]
**Original**: "[Exact text from paper]" — Section [X], Page [Y]
**Context**: [When you'd cite this — e.g., "when discussing attention mechanisms"]
**Key terms**: [distinctive terms for search matching]
2. **Claim**: [Paraphrase]
**Original**: "[Exact text]" — Section [X], Page [Y]
**Context**: [Usage context]
**Key terms**: [terms]
## Definitions & Terminology
- **[Term]**: [Definition as used in this paper] — Section [X]
...
## Connections
- **Builds on**: [Key prior work this paper extends]
- **Compared against**: [Baselines and competitors]
- **Related to**: [Other papers on same topic, if known]
- **Extended by**: [Notable follow-up work, if known]
## Tags
`[tag1]` `[tag2]` `[tag3]` ...
Use lowercase, hyphenated tags: `transformer`, `self-attention`, `language-model`, `knowledge-graph`, `retrieval-augmented-generation`, `fine-tuning`, `benchmark`, `computer-vision`, `graph-neural-network`, etc.
retrieval-augmented-generation not just nlp. Aim for 4-8 tags per paper.