PDF extraction with ordered tool chain: read_file, then run_shell/pdftotext, then execute_code_sandbox/PyMuPDF
This skill provides a robust workflow for acquiring PDF documents from web sources and extracting their text content, with a clearly ordered sequence of tool invocations to maximize success rate.
When working with PDFs from web sources, encounters with JavaScript redirects, corrupted files, missing tools, or inaccessible content are common. This workflow ensures maximum success rate through a严格 ordered fallback sequence that prioritizes shell-based tools over Python sandbox execution.
| Step | Tool | Method | Priority |
|---|---|---|---|
| 0 | read_file | Direct PDF text extraction | First attempt |
| 1 | run_shell | pdftotext command | Primary fallback (if Step 0 returns binary/fails) |
| 2 |
| execute_code_sandbox |
| PyMuPDF Python library |
| Secondary fallback (if Step 1 fails) |
| 3 | Domain knowledge | Manual content generation | Last resort |
Key principle: Always try shell tools (run_shell) before Python sandbox (execute_code_sandbox) when both are viable options. Shell execution is more reliable in constrained environments.
First, attempt to extract PDF text using the read_file tool. This is the simplest approach and handles many PDFs correctly:
read_file filetype="pdf" file_path="path/to/document.pdf"
Expected outcomes:
Critical: If read_file returns binary data (PNG/JPEG headers, raw PDF bytes), do NOT attempt to parse it manually. Immediately switch to shell-based pdftotext.
Many PDF hosting sites use JavaScript-based redirects or block automated requests. Use curl with a realistic browser user-agent:
curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o output.pdf "URL_HERE"
Key flags:
-L: Follow redirects-A: Set user-agent header to mimic a real browser-o: Specify output filenameAlways validate the downloaded file is actually a PDF before attempting extraction:
file output.pdf
Expected output should contain "PDF document". If not:
This step takes priority over Python-based extraction. If Step 0 failed or if you're working with a newly downloaded PDF, use run_shell with pdftotext before attempting any Python libraries:
pdftotext downloaded.pdf extracted.txt
Execute via run_shell:
run_shell command="pdftotext downloaded.pdf extracted.txt"
If pdftotext is not available, install it first:
# Debian/Ubuntu
apt-get update && apt-get install -y poppler-utils
# macOS
brew install poppler
# RHEL/CentOS
yum install -y poppler-utils
Why shell-first? Shell-based pdftotext is more reliable, faster, and avoids sandbox execution issues that can affect Python code execution in constrained environments.
Only if run_shell with pdftotext fails or is unavailable, fall back to Python's PyMuPDF library via execute_code_sandbox:
import fitz # PyMuPDF
doc = fitz.open("downloaded.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
with open("extracted.txt", "w") as f:
f.write(text)
Execute within execute_code_sandbox:
execute_code_sandbox code="<Python code above>"
Install if needed:
pip install pymupdf
Note: Some environments may experience execute_code_sandbox failures (unknown errors). This is why shell-based extraction (Step 3) must be attempted first.
If the PDF cannot be accessed or extracted after all attempts:
Example degradation note:
NOTE: Source document [URL] was inaccessible due to [reason].
Content below combines partial extraction with established domain knowledge
for [topic]. Verify against official sources when available.
#!/bin/bash
# pdf-extract-workflow.sh
PDF_URL="$1"
OUTPUT_PDF="downloaded.pdf"
OUTPUT_TXT="extracted.txt"
# Step 0/1: Download with browser user-agent
echo "Downloading PDF..."
curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" -o "$OUTPUT_PDF" "$PDF_URL"
# Step 1: Verify file type
echo "Verifying file type..."
if ! file "$OUTPUT_PDF" | grep -q "PDF document"; then
echo "WARNING: Downloaded file is not a valid PDF"
echo "Attempting fallback extraction anyway..."
fi
# Step 2: Try pdftotext via shell (PRIMARY EXTRACTION)
echo "Attempting pdftotext extraction..."
if command -v pdftotext &> /dev/null; then
if pdftotext "$OUTPUT_PDF" "$OUTPUT_TXT" 2>/dev/null; then
echo "Extraction successful with pdftotext"
exit 0
fi
fi
# Step 3: Fallback to PyMuPDF via Python sandbox (SECONDARY)
echo "Falling back to PyMuPDF..."
python3 << 'PYTHON_SCRIPT'
import fitz
import sys