PDF to Text Import

Workflow

Check if PDF has embedded text
```
pdftotext <input.pdf> - | head -20
```
- If output contains readable text → use pdftotext (fast, accurate)
- If output is empty or garbled → fall back to OCR
Test scan first two pages and evaluate quality
```
pdftotext -f 1 -l 2 <input.pdf> -
```
Read the output and check for:
- Garbled characters, mojibake, or encoding issues
- Missing words, merged lines, or broken paragraphs
- Headers/footers bleeding into body text
- Table or column data mangled into single lines

pdftoppm -f 1 -l 2 -png <input.pdf> /tmp/test-scan
tesseract /tmp/test-scan-1.png - 2>/dev/null
rm /tmp/test-scan-*.png

pdftoppm -f 1 -l 2 -png -r 200 <input.pdf> /tmp/visual-check

rm /tmp/visual-check-*.png

Extract text

Embedded text (preferred):

pdftotext <input.pdf> <output.txt>

Scanned/image PDF (OCR fallback):

ocrmypdf --force-ocr <input.pdf> /tmp/ocr-temp.pdf
pdftotext /tmp/ocr-temp.pdf <output.txt>
rm /tmp/ocr-temp.pdf

If ocrmypdf is not installed:

# Convert pages to images, then OCR
pdftoppm <input.pdf> /tmp/pdf-page -png
tesseract /tmp/pdf-page-*.png <output> txt
rm /tmp/pdf-page-*.png

which pdftotext ocrmypdf tesseract pdftoppm 2>/dev/null

Pdf To Text | Skills Pool