Use shell commands or Python libraries to extract PDF text when read_file PDF handler fails
The read_file tool with filetype='pdf' often returns binary image data, errors, or unusable output when attempting to extract text from PDF documents. This makes it unreliable for structured data extraction tasks.
Use run_shell with command-line tools (pdftotext, pdfinfo) or execute_code_sandbox with Python libraries (PyMuPDF, pdfplumber) to extract PDF text content reliably.
# Extract all text to stdout
pdftotext input.pdf -
# Or extract to file
pdftotext input.pdf output.txt
cat output.txt
pdfinfo input.pdf
import fitz # PyMuPDF
doc = fitz.open("input.pdf")
text = ""
for page in doc:
text += page.get_text()
print(text)
doc.close()
import pdfplumber
with pdfplumber.open("input.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
# For tables:
# tables = page.extract_tables()
Attempt read_file with filetype='pdf' first (in case it works)
Check output - If you receive:
Fall back to one of the extraction methods above:
pdftotext via run_shell for quick text extractionpdfplumber via execute_code_sandbox for structured data/tablesPyMuPDF for complex layouts or when you need more controlProcess the extracted text for your task
# Via run_shell
result = run_shell(command="pdftotext document.pdf -")
pdf_text = result.stdout
# Via execute_code_sandbox
code = """
import pdfplumber
with pdfplumber.open("/path/to/document.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
"""
result = execute_code_sandbox(code=code)
pdf_text = result.stdout
tesseract)