Extract PDF text content using shell tools or Python libraries when read_file PDF handler fails
The read_file tool with filetype='pdf' can be unreliable for PDF text extraction. It may:
Use run_shell with dedicated PDF extraction tools instead of relying on read_file for PDFs.
pdftotext input.pdf output.txt
Or to extract to stdout:
pdftotext input.pdf -
With layout preservation:
pdftotext -layout input.pdf output.txt
pdfinfo input.pdf
Useful for checking page count, dimensions, and PDF properties before extraction.
import fitz # PyMuPDF
doc = fitz.open("input.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
import pdfplumber
with pdfplumber.open("input.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
tables = page.extract_tables()
Check PDF exists and is readable:
pdfinfo input.pdf 2>/dev/null || echo "PDF not accessible"
Extract text using pdftotext:
pdftotext -layout input.pdf - > extracted_text.txt
If pdftotext fails, try Python fallback:
import fitz
doc = fitz.open("input.pdf")
for i, page in enumerate(doc):
print(f"--- Page {i+1} ---")
print(page.get_text())
doc.close()
Verify extraction succeeded:
| Tool | Best For |
|---|---|
pdftotext | Fast, simple text extraction |
pdftotext -layout | Preserving spacing/formatting |
PyMuPDF | Complex PDFs, programmatic access |
pdfplumber | Tables and structured data |
# In your agent workflow, prefer this pattern:
result = run_shell(command="pdftotext document.pdf -", timeout=30)
if result.stdout and len(result.stdout.strip()) > 0:
content = result.stdout