"Convert PDF files to clean text. Handles both embedded-text PDFs and scanned/image PDFs via OCR. Use when the user wants to import, extract, or convert a PDF to text."
Check if PDF has embedded text
pdftotext <input.pdf> - | head -20
pdftotext (fast, accurate)Test scan first two pages and evaluate quality
pdftotext -f 1 -l 2 <input.pdf> -
Read the output and check for:
If quality is poor with pdftotext, try OCR on the same two pages:
pdftoppm -f 1 -l 2 -png <input.pdf> /tmp/test-scan
tesseract /tmp/test-scan-1.png - 2>/dev/null
rm /tmp/test-scan-*.png
Compare both outputs. Report findings to user before proceeding with full extraction.
If neither produces clean output, visually inspect the pages:
pdftoppm -f 1 -l 2 -png -r 200 <input.pdf> /tmp/visual-check
Use the Read tool to view /tmp/visual-check-1.png and /tmp/visual-check-2.png (Claude vision will render them). Compare what you see on the page to what the text extraction produced. Identify the specific issue (columns, watermarks, unusual fonts, embedded images of text, etc.) and recommend an extraction strategy before proceeding.
rm /tmp/visual-check-*.png
Extract text
Embedded text (preferred):
pdftotext <input.pdf> <output.txt>
Scanned/image PDF (OCR fallback):
ocrmypdf --force-ocr <input.pdf> /tmp/ocr-temp.pdf
pdftotext /tmp/ocr-temp.pdf <output.txt>
rm /tmp/ocr-temp.pdf
If ocrmypdf is not installed:
# Convert pages to images, then OCR
pdftoppm <input.pdf> /tmp/pdf-page -png
tesseract /tmp/pdf-page-*.png <output> txt
rm /tmp/pdf-page-*.png
Report results
wc -l <output.txt>
pdfinfo <input.pdf> | grep Pages
$ARGUMENTS[0] — path to input PDF (required)$ARGUMENTS[1] — path to output txt file (optional, defaults to same name with .txt extension)Requires at least one of:
pdftotext (from poppler-utils) — for embedded text PDFsocrmypdf + pdftotext — for scanned PDFstesseract + pdftoppm — OCR fallback if ocrmypdf unavailableCheck availability before starting:
which pdftotext ocrmypdf tesseract pdftoppm 2>/dev/null
If missing tools, tell the user what to install:
sudo apt install poppler-utils — pdftotext and pdftoppmsudo apt install tesseract-ocr — tesseractpip install ocrmypdf — ocrmypdfEdit PDFs with natural-language instructions using the nano-pdf CLI.