PDF text extraction with tool cascade prioritizing shell pdftotext before Python fallback
This skill provides an optimized workflow for extracting text content from PDF documents (local files or downloaded URLs) using a prioritized tool cascade that favors shell-based extraction before falling back to Python libraries.
Analysis of execution patterns shows:
read_file on PDFs sometimes returns binary/image data instead of textrun_shell with pdftotext has higher success rate and fewer sandbox errorsexecute_code_sandbox can fail with "unknown error" in constrained environmentsBefore beginning, identify your scenario:
| Scenario | Start Here | Skip |
|---|
| PDF already on local disk | Step 1 (Try read_file) | Shell download steps |
| PDF at a web URL | Shell download, then Step 1 | None |
| Need maximum reliability | Full cascade (all 3 tools) | None |
If your PDF is at a web URL, download it first using browser user-agent:
curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o target.pdf "URL_HERE"
Key flags:
-L: Follow redirects-A: Set user-agent header to mimic a real browser-o: Specify output filenameIf you already have the PDF locally, skip to Step 1.
First, attempt to extract text using the read_file tool:
read_file(filetype="pdf", file_path="target.pdf")
Evaluate the response:
| Response Type | Interpretation | Next Action |
|---|---|---|
| Clean readable text | Success | Proceed to content analysis |
| Binary data / PNG image / garbled | read_file returned raw data | Go to Step 2 immediately |
| Error / timeout | Tool failure | Go to Step 2 immediately |
Critical: If read_file returns binary image data or garbled content, do not retry read_file. Immediately proceed to Step 2.
When read_file fails or returns binary data, use run_shell with pdftotext:
run_shell(command="pdftotext target.pdf output.txt")
Then read the extracted text:
read_file(filetype="txt", file_path="output.txt")
If pdftotext is not found, install it first:
run_shell(command="apt-get update && apt-get install -y poppler-utils")
# Or for macOS:
run_shell(command="brew install poppler")
Then retry:
run_shell(command="pdftotext target.pdf output.txt")
Verify extraction quality:
output.txt exists and has contentIf pdftotext is unavailable or produces poor results, use Python's PyMuPDF via execute_code_sandbox:
import fitz # PyMuPDF
doc = fitz.open("target.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
with open("output.txt", "w") as f:
f.write(text)
print(f"Extracted {len(text)} characters from {len(doc)} pages")
Execute via:
execute_code_sandbox(code="<python code above>")
Then read the result:
read_file(filetype="txt", file_path="output.txt")
Note: execute_code_sandbox may fail with "unknown error" in some environments. If this occurs, document the failure and proceed to Step 4.
If all extraction methods fail:
Example degradation documentation:
EXTRACTION FAILURE REPORT:
- Source: [URL or file path]
- read_file: Returned binary/image data (no text extraction)
- run_shell/pdftotext: [Tool not available / produced garbled output / succeeded]
- execute_code_sandbox/PyMuPDF: [Failed with unknown error / succeeded]
NOTE: Content below combines partial extraction with established domain
knowledge for [topic]. Verify against official sources when available.
PDF to Extract
│
▼
┌───────────────┐
│ read_file │
│ (primary) │
└───────┬───────┘
│
┌─────────────┼─────────────┐
│ │ │
Returns text Returns binary Error/timeout
(✓) / image data │
│ │ │
▼ ▼ ▼
SUCCESS ┌───────────────┐
│ run_shell │
│ pdftotext │
└───────┬───────┘
│
┌───────┼───────┐
│ │ │
Succeeds Not Garbled
(✓) avail. output
│ │ │
▼ ▼ ▼
SUCCESS ┌───────────────┐
│ execute_code │
│ _sandbox │
│ PyMuPDF │
└───────┬───────┘
│
┌───────┼───────┐
│ │ │
Succeeds Fails Error
(✓) │ │
│ ▼ │
▼ Domain │
SUCCESS Knowledge │
│ │
└──────┘
FAILURE
DOCUMENTED
#!/bin/bash
# pdf-extract-cascade.sh
# Implements the full tool cascade for PDF extraction
INPUT="$1"
OUTPUT_PDF="target.pdf"
OUTPUT_TXT="output.txt"
# Step 0: Handle URL vs local file
if [[ "$INPUT" =~ ^https?:// ]]; then
echo "Downloading PDF from URL..."
curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" -o "$OUTPUT_PDF" "$INPUT"
else
if [ ! -f "$INPUT" ]; then
echo "ERROR: Local file not found: $INPUT"
exit 1
fi
OUTPUT_PDF="$INPUT"
fi
# Step 1: Verify file type
echo "Verifying file type..."
if ! file "$OUTPUT_PDF" | grep -q "PDF document"; then
echo "WARNING: File is not a valid PDF"
file "$OUTPUT_PDF"
fi
# Step 2: Try pdftotext (shell-first approach)
echo "Attempting pdftotext extraction..."
if command -v pdftotext &> /dev/null; then
if pdftotext "$OUTPUT_PDF" "$OUTPUT_TXT" 2>/dev/null; then
if [ -s "$OUTPUT_TXT" ]; then
echo "SUCCESS: Extraction completed with pdftotext"
wc -l "$OUTPUT_TXT"
exit 0
fi
fi
fi
# Step 3: Fallback to PyMuPDF
echo "Falling back to PyMuPDF..."
python3 << 'PYTHON_SCRIPT'
import fitz
import sys