Extract text, tables, and metadata from PDF documents. Parse scanned PDFs with OCR, analyze PDF structure, and convert PDFs to other formats. Use when the user needs to read, parse, or extract content from PDF files, even if they just say "get the text from this document."
Use this skill when the user needs to:
pip install pdfplumber pypdf pdf2image
For OCR on scanned PDFs:
pip install pytesseract Pillow
brew install tesseract poppler # macOS
Use pdfplumber as the default. It handles most PDFs well.
import pdfplumber
with pdfplumber.open("input.pdf") as pdf:
for i, page in enumerate(pdf.pages):
text = page.extract_text()
if text:
print(f"--- Page {i + 1} ---")
print(text)
else:
print(f"--- Page {i + 1}: no text (likely scanned) ---")
When extract_text() returns None or empty string, the PDF is image-based. Fall back to OCR:
from pdf2image import convert_from_path
import pytesseract
images = convert_from_path("scanned.pdf", dpi=300)
for i, img in enumerate(images):
text = pytesseract.image_to_string(img)
print(f"--- Page {i + 1} ---")
print(text)
Use scripts/extract_text.py for automatic detection and OCR fallback.
with pdfplumber.open("input.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
For structured output, convert tables to pandas DataFrames:
import pandas as pd
with pdfplumber.open("input.pdf") as pdf:
tables = pdf.pages[0].extract_tables()
if tables:
df = pd.DataFrame(tables[0][1:], columns=tables[0][0])
print(df.to_markdown(index=False))
from pypdf import PdfReader
reader = PdfReader("input.pdf")
meta = reader.metadata
print(f"Pages: {len(reader.pages)}")
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Creator: {meta.creator}")
print(f"Created: {meta.creation_date}")
from pdf2image import convert_from_path
images = convert_from_path("input.pdf", dpi=300)
for i, img in enumerate(images):
img.save(f"page_{i + 1}.png", "PNG")
pdf-service skill).extract_text() returns None or "" and fall back to OCR.brew install poppler. Without it, convert_from_path fails with a cryptic error about pdftoppm.table_settings tuning.PdfReader("file.pdf", password="secret"). An empty password ("") works for some "encrypted" PDFs that aren't truly locked.After extraction: