Extract any PDF into structured Markdown + extracted PNG images with a YAML manifest. Use this skill whenever the user asks to extract, convert, digitize, or OCR a PDF document into markdown, or when they want to pull images/diagrams out of a PDF. Also use when the user says "convert this PDF", "extract the figures from this PDF", "make this PDF machine-readable", or similar requests involving PDF-to-text/image conversion. Do NOT use for ZAP Gymiprüfung exams — use gymiprufung-extract for those instead.
Convert any PDF document into structured Markdown with extracted diagram/figure images (PNG) and a YAML manifest linking everything together.
Required CLI tools (check before starting):
pdftoppm -v # from poppler-utils
convert -version # from imagemagick
If missing, tell the user to install: sudo apt install poppler-utils imagemagick
The single biggest source of errors in PDF extraction is assuming page numbers match content structure. PDFs have cover pages, table-of-contents pages, multi-section pages, and appendices that shift everything.
You MUST build a page map before extracting anything. Read every page of the PDF to understand what content is on each page. This takes a few minutes but prevents hours of debugging wrong extractions.
Determine from the user's request or from $ARGUMENTS:
If the user hasn't specified an output directory, create one next to the PDF
(e.g., document_name/ alongside document_name.pdf).
Read the PDF page by page using the Read tool (it handles PDFs natively). For large PDFs, read in batches of 15-20 pages.
Read the PDF: pages "1-15"
Verify the actual page count. Don't trust content like slide footers ("1 of
8") or table-of-contents page ranges — these can be wrong. Read the last few
pages to confirm where the document actually ends. If pdftoppm fails on a page,
the PDF doesn't have that page.
For each page, record:
Write the page map down before proceeding. Example:
Page 1: Title page
Page 2: Table of Contents
Page 3: Section 1 introduction (text only)
Page 4: Section 1 continued (FIGURE: bar chart of revenue)
Page 5: Section 2 (text + TABLE: quarterly results)
Page 6: Section 2 continued (FIGURE: process flow diagram)
Page 7: Appendix A (text only)
output_dir/
├── content.md # Full document content in Markdown
├── manifest.yaml # Structured metadata
├── extract_images.sh # Reproducible extraction script
└── img/ # Extracted figure PNGs
For longer documents with clear sections, you may split into multiple markdown
files (e.g., section_1.md, section_2.md) — use your judgment based on
document length and structure.
Read each page and convert to Markdown. General rules:
#, ##, ###)$inline$ and $$display$$--- between major sections if helpful[^1] or inline notesPreserve the document's logical structure. Don't just dump raw text — organize it so a reader (human or AI) can navigate the content.
Math-heavy documents (academic papers, textbooks):
$inline$ for inline, $$display$$ for
display equations\tag{N} to match the source document's numbering\Gamma, \Beta, {}_2F_1, etc.images: [] in the manifestOnly extract pages that contain actual figures, diagrams, charts, or images. Skip decorative elements, logos, and backgrounds unless the user specifically wants them.
The extraction process:
Extract full page as PNG at 300 dpi:
pdftoppm -png -r 300 -f {PAGE} -l {PAGE} "$PDF" temp{PAGE}
Use a unique prefix per page (e.g., temp2, temp3, temp6) instead of
a shared temp prefix. This prevents filename collisions if you extract
multiple pages before cleaning up, and makes the ls + convert commands
unambiguous.
pdftoppm page padding: The output filename uses zero-padded page numbers, but the padding width depends on the total page count of the PDF:
temp-1.png, temp-2.png (no padding)temp-02.png, temp-14.png (2-digit)temp-002.png, temp-014.png (3-digit)Always check with ls temp*.png after extraction to see the actual filename
before writing the convert command. This is a common source of bugs — don't
guess the padding, verify it.
Determine crop coordinates. Read the extracted full-page PNG to see exactly what's on it, then crop to the figure area:
convert temp-{PAGE_PADDED}.png -crop {W}x{H}+{X}+{Y} +repage output.png
At 300 dpi, common page sizes are:
Use this to estimate where the figure sits on the page. Always verify by reading the full-page PNG first.
Verify the crop. Read the cropped image to confirm:
If the crop cuts off content, make it LARGER, not smaller. Start generous (e.g., full width, generous height) and tighten later. A too-large crop with some whitespace is far better than a too-tight crop missing labels or legends.
Naming convention for extracted images:
fig1_revenue_chart.png, diagram_process_flow.pngUse this three-part format to make images useful to both humans and AI:
<!-- image-ref: {image_id} | see manifest.yaml for full metadata -->

> **Figure {N}** | Type: {chart/diagram/photo/illustration} | Source: PDF p. {page}
>
> **Content:** {What the figure shows — one or two sentences}
>
> **Key elements:** {List the important visual elements}
>
> **Labels/annotations:** {Any text, numbers, or labels visible in the figure}
The blockquote captures everything an AI agent needs to understand the image without seeing the pixels. Be thorough — list every labeled element and important visual detail.
Create manifest.yaml with structured metadata: