Name: Pdf Reader
Author: lgili

Overview

Use this skill to extract clean, structured content from PDF files for downstream processing — by AI agents, simulation tools, or human review. The primary goal is fidelity: preserve the original document structure (paragraphs, headings, tables, figures) and output in formats that are easy to consume programmatically.

Default stance:

Digital PDFs (text layer present) → direct extraction via pdfplumber; precise, no OCR needed.
Scanned PDFs (image-only pages) → flag as requiring OCR before extraction; direct extraction produces garbage on scanned content; recommend tesseract or Adobe Acrobat OCR as preprocessing.
Multi-column layouts need spatial sorting — pdfplumber handles this with word-level bounding boxes, but always verify the reading order in the output.
Tables must be formatted as GFM markdown tables or CSV — never as raw text with whitespace alignment.
Images should be saved as PNG at native resolution, one file per embedded image object, named by page and index.

Core Workflow

Overview

Default stance:

Digital PDFs (text layer present) → direct extraction via pdfplumber; precise, no OCR needed.
Scanned PDFs (image-only pages) → flag as requiring OCR before extraction; direct extraction produces garbage on scanned content; recommend tesseract or Adobe Acrobat OCR as preprocessing.
Multi-column layouts need spatial sorting — pdfplumber handles this with word-level bounding boxes, but always verify the reading order in the output.
Tables must be formatted as GFM markdown tables or CSV — never as raw text with whitespace alignment.
Images should be saved as PNG at native resolution, one file per embedded image object, named by page and index.

Topic	Reference	Load when
PDF format and text layer structure	`references/pdf-formats.md`	Encountering encoding issues, garbled text, wrong reading order, font-mapping failures
Table extraction strategies	`references/table-extraction.md`	Tuning pdfplumber table detection, handling borderless or merged-cell tables
Image extraction and color spaces	`references/image-extraction.md`	Handling CMYK, indexed, masked, or inline images; filtering decorative elements

Script	Purpose	Usage
`scripts/pdf_extract_text.py`	Extract all text and tables from a PDF as Markdown (or tables-only as CSV)	`python skills/pdf-reader/scripts/pdf_extract_text.py --file doc.pdf --output doc.md`
`scripts/pdf_extract_images.py`	Extract all embedded images from a PDF and save as PNG files	`python skills/pdf-reader/scripts/pdf_extract_images.py --file doc.pdf --output-dir ./images`

Pdf Reader

Overview

Core Workflow

Pdf Reader

Overview

Core Workflow

Reference Guide

Bundled Scripts

Constraints

MUST DO

MUST NOT DO

Output Template

Primary References

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing