Name: PDF-to-Markdown Extraction
Author: pengguanya

Search skills.../

PDF-to-Markdown Extraction | Skills Pool

Read the PDF: pages "1-15"

Page 1: Title page
Page 2: Table of Contents
Page 3: Section 1 introduction (text only)
Page 4: Section 1 continued (FIGURE: bar chart of revenue)
Page 5: Section 2 (text + TABLE: quarterly results)
Page 6: Section 2 continued (FIGURE: process flow diagram)
Page 7: Appendix A (text only)

output_dir/
├── content.md           # Full document content in Markdown
├── manifest.yaml        # Structured metadata
├── extract_images.sh    # Reproducible extraction script
└── img/                 # Extracted figure PNGs

Extract full page as PNG at 300 dpi:
```
pdftoppm -png -r 300 -f {PAGE} -l {PAGE} "$PDF" temp{PAGE}
```
Use a unique prefix per page (e.g., temp2, temp3, temp6) instead of a shared temp prefix. This prevents filename collisions if you extract multiple pages before cleaning up, and makes the ls + convert commands unambiguous.

pdftoppm page padding: The output filename uses zero-padded page numbers, but the padding width depends on the total page count of the PDF:
- PDFs with 1-9 pages: temp-1.png, temp-2.png (no padding)
- PDFs with 10-99 pages: temp-02.png, temp-14.png (2-digit)
- PDFs with 100-999 pages: temp-002.png, temp-014.png (3-digit)
Always check with ls temp*.png after extraction to see the actual filename before writing the convert command. This is a common source of bugs — don't guess the padding, verify it.
Determine crop coordinates. Read the extracted full-page PNG to see exactly what's on it, then crop to the figure area:
```
convert temp-{PAGE_PADDED}.png -crop {W}x{H}+{X}+{Y} +repage output.png
```
At 300 dpi, common page sizes are:
- A4: ~2480x3508 pixels (portrait)
- US Letter: ~2550x3300 pixels (portrait)
- Widescreen 16:9 slides: ~3996x2250 pixels (landscape)
- Standard 4:3 slides: ~3300x2475 pixels (landscape)
Use this to estimate where the figure sits on the page. Always verify by reading the full-page PNG first.
Verify the crop. Read the cropped image to confirm:
- The figure is fully visible (not cut off)
- All labels, legends, and annotations are readable
- No extraneous text from surrounding content is included
- If the crop is wrong, adjust and re-extract
If the crop cuts off content, make it LARGER, not smaller. Start generous (e.g., full width, generous height) and tighten later. A too-large crop with some whitespace is far better than a too-tight crop missing labels or legends.

<!-- image-ref: {image_id} | see manifest.yaml for full metadata -->
![{Descriptive alt text}](img/{filename}.png)

> **Figure {N}** | Type: {chart/diagram/photo/illustration} | Source: PDF p. {page}
>
> **Content:** {What the figure shows — one or two sentences}
>
> **Key elements:** {List the important visual elements}
>
> **Labels/annotations:** {Any text, numbers, or labels visible in the figure}

PDF-to-Markdown Extraction

Prerequisites

The Critical Lesson: Page Mapping First

Workflow

PDF-to-Markdown Extraction

Prerequisites

The Critical Lesson: Page Mapping First

Workflow

Step 1: Understand the Document

Step 2: Build the Page Map

Step 3: Create Directory Structure

Step 4: Extract Text Content

Step 5: Extract Images and Figures

Step 6: Write Image References in Markdown

Step 7: Write the Manifest

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing