Convert PDF files to Markdown using PyMuPDF4LLM. Optimized for technical documentation (datasheets, hardware manuals, programming guides) with tables, diagrams, and code listings. --ocr or --scan forces OCR for scanned documents. --auto-ocr opt-in enables OCR only when selected pages are image-only.

The converter also applies post-processing to improve:

heading hierarchy
visible contents-page handling
tables and lists
flattened preformatted listings
extracted image path handling

Important behavior for agents:

Prefer no OCR for born-digital PDFs.
Use --ocr / --scan only when the PDF is scanned, image-only, or has a broken text layer.
--auto-ocr enables OCR only for image-only pages.
Built-in PDF outlines are preferred for heading structure; visible contents pages are the next fallback.
Visible contents pages are usually removed from final Markdown because markdown readers already provide heading navigation.

Repository-wide implementation details and limitations live in the root CLAUDE.md.

How to run

The converter also applies post-processing to improve:

heading hierarchy
visible contents-page handling
tables and lists
flattened preformatted listings
extracted image path handling

Important behavior for agents:

Prefer no OCR for born-digital PDFs.
Use --ocr / --scan only when the PDF is scanned, image-only, or has a broken text layer.
--auto-ocr enables OCR only for image-only pages.
Built-in PDF outlines are preferred for heading structure; visible contents pages are the next fallback.
Visible contents pages are usually removed from final Markdown because markdown readers already provide heading navigation.

Repository-wide implementation details and limitations live in the root CLAUDE.md.

Flag	Description
`-o`, `--output`	Output `.md` file path (default: next to the source PDF as `<name>.md`)
`--pages`	Page range, e.g. `1-50` (default: all pages)
`--ocr` / `--scan`	Force full-page OCR for scanned PDFs or broken text layers. Use when user says "scanned", "scan", or "ocr".
`--auto-ocr`	Enable OCR only when selected pages are image-only.
`--ocr-engine`	`auto` (default), `mac`, `rapidocr`, `tesseract`
`--langs`	Comma-separated language codes (default: `en`)
`--threads`	Compatibility flag kept for the skill interface; currently unused
`--skip-heading-pipeline`	Bypass heading reconstruction and heading-specific cleanup
`--skip-text-cleaning`	Bypass prose/table/bullet cleanup while keeping heading reconstruction
`--skip-all-cleanup`	Bypass all optional post-processing; image paths are still rewritten

Pdf To Markdown

How to run

Pdf To Markdown

How to run

Examples

Flags

Output

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing