Reliably extract text from PDFs using pdftotext when standard file reading fails.
Standard file reading tools (e.g., read_file) often fail to extract text from PDF documents. Instead of returning parsed text, they may return:
This occurs because PDFs are complex binary formats, not plain text files. Attempts to parse them using general-purpose Python libraries (like PyMuPDF) in sandboxed environments may also fail due to missing dependencies or environment restrictions.
Use the pdftotext command-line utility (part of poppler-utils) via run_shell. This tool is commonly pre-installed in Linux environments and reliably extracts text content from PDFs.
When attempting to read a PDF:
read_file.\x00), appears as base64, or is clearly binary/garbled, assume standard reading has failed.Run the following shell command using run_shell:
pdftotext -layout -nopgbrk <file_path> -
-layout: Maintains the physical layout of the text (optional but recommended).-nopgbrk: Prevents inserting form feed characters between pages.-: Outputs content to stdout instead of creating a new file.Capture the stdout from the shell command. This string is the extracted text.
Scenario: You need to read document.pdf.
Step 1: Attempt standard read
content = read_file("document.pdf")
if "\x00" in content or not content.strip():
# Fallback needed
pass
Step 2: Fallback to shell
result = run_shell("pdftotext -layout -nopgbrk document.pdf -")
text = result.stdout
pdftotext installed (usually via poppler-utils).pdftotext is not found, attempt to install it (apt-get install poppler-utils) if permissions allow, or notify the user.