PDF extraction specialist for text, tables, and embedded images. Converts digital PDFs to clean Markdown, GFM tables, and PNG exports. Activates when you say 'read this PDF', 'extract text from PDF', 'get tables from PDF', 'export images from PDF', 'convert PDF to markdown', or 'parse PDF document'.
Use this skill to extract clean, structured content from PDF files for downstream processing — by AI agents, simulation tools, or human review. The primary goal is fidelity: preserve the original document structure (paragraphs, headings, tables, figures) and output in formats that are easy to consume programmatically.
Default stance:
tesseract or Adobe Acrobat OCR as preprocessing.page.extract_text() returns non-empty string → digital PDF.references/pdf-formats.md for encoding issues and text layer structure.Extract full text with tables.
scripts/pdf_extract_text.py --file doc.pdf --output doc.md.--- dividers with a page number header.--two-column flag; words are sorted by column position before line assembly.references/table-extraction.md for pdfplumber table strategy tuning.Extract tables as CSV (standalone).
scripts/pdf_extract_text.py --tables-only --format csv --output-dir ./tables/.page<N>_table<M>.csv.Extract images.
scripts/pdf_extract_images.py --file doc.pdf --output-dir ./images/.page<N>_img<M>.png.--min-width and --min-height to skip decorative elements (default: 32×32 px).references/image-extraction.md for color space handling and filtering.Validate output.
| Topic | Reference | Load when |
|---|---|---|
| PDF format and text layer structure | references/pdf-formats.md | Encountering encoding issues, garbled text, wrong reading order, font-mapping failures |
| Table extraction strategies | references/table-extraction.md | Tuning pdfplumber table detection, handling borderless or merged-cell tables |
| Image extraction and color spaces | references/image-extraction.md | Handling CMYK, indexed, masked, or inline images; filtering decorative elements |
| Script | Purpose | Usage |
|---|---|---|
scripts/pdf_extract_text.py | Extract all text and tables from a PDF as Markdown (or tables-only as CSV) | python skills/pdf-reader/scripts/pdf_extract_text.py --file doc.pdf --output doc.md |
scripts/pdf_extract_images.py | Extract all embedded images from a PDF and save as PNG files | python skills/pdf-reader/scripts/pdf_extract_images.py --file doc.pdf --output-dir ./images |
Dependencies: pip install pdfplumber pymupdf
pip install instructions if missing.--pages range is specified.--overwrite flag.--min-width/--min-height without reporting them in the summary.For PDF extraction tasks, report:
datasheet-intelligence for curve digitization."