Read, summarize, search, and extract structured artifacts from PDF files, especially papers and reports. Use when the user wants a PDF converted before reading, needs figures or tables pulled out, wants image labels or captions, or needs cleaner Markdown/text than reading the PDF directly.
Use this skill for PDF-first work on papers and reports. The default path is Docling-first.
--fast only when the user explicitly prefers speed over fidelity.pdftotext -layout and continue with a degraded-quality warning.Run the extractor on a PDF:
python3 scripts/pdf_extract.py /path/to/paper.pdf
Extract figures with structural labels:
python3 scripts/pdf_extract.py /path/to/paper.pdf --extract-figures
Extract figures with richer descriptions instead of classifier labels:
python3 scripts/pdf_extract.py /path/to/paper.pdf --extract-figures --figure-label-mode description
Extract tables, agent-facing JSON, LaTeX, cleaned Markdown, CSV, and crop images:
python3 scripts/pdf_extract.py /path/to/paper.pdf --extract-tables
Extract both figures and tables:
python3 scripts/pdf_extract.py /path/to/paper.pdf --extract-figures --extract-tables
Use the faster MarkItDown path only when requested:
python3 scripts/pdf_extract.py /path/to/paper.pdf --fast
<basename>.docling.md: primary reading artifact from Docling.<basename>.docling.md now get a compact YAML artifact block immediately after the table with links to the extracted table sidecars.<basename>.layout.txt: fallback text artifact when Docling fails.<basename>.docling_artifacts/: extracted figure images referenced by Markdown.<basename>.figures.json: figure metadata, labels, captions, and image paths.<basename>.tables/: raw and cleaned table Markdown, CSV exports, and crop images.<basename>.tables/<table_nnn>.raw_docling.json: full Docling table object dump.<basename>.tables/<table_nnn>.agent.json: cell/span AST with provenance, regression semantics, and embedded LaTeX.<basename>.tables/<table_nnn>.html: Docling HTML rendering for the table.<basename>.tables/<table_nnn>.otsl.txt: Docling OTSL table structure dump.<basename>.tables/<table_nnn>.tex: synthesized canonical LaTeX table output.<basename>.tables.json: table metadata, paths, and verification flags.The script prints a JSON summary with the exact artifact paths to read next.
Maintainer benchmark fixtures and runners live under tests/pdf-reading/ rather than inside the installable skill.