Progressive tool-chain PDF extraction with explicit read_file, run_shell, and execute_code_sandbox sequencing
This skill provides a robust workflow for extracting text from PDF documents using a sequenced approach with agent tools, with explicit fallback mechanisms based on observed tool behavior.
read_file often returns binary/image data for PDFs, not extracted text. When this occurs, immediately escalate to run_shell with pdftotext before attempting Python-based extraction.
Before beginning, identify your scenario:
| Scenario | Start Here | Skip |
|---|---|---|
| PDF already on local disk | Step 1 (read_file attempt) | Download steps |
| PDF at a web URL | Download first, then Step 1 | None |
| PDF content already extracted |
| Step 4 (Quality verification) |
| Steps 1-3 |
PDF extraction failures cascade when tool sequencing is unclear. This workflow ensures maximum success rate through explicit tool progression:
Always try the simplest approach first:
Tool: read_file
Path: document.pdf
Expected outcome: Extracted text content
Critical check: Examine the returned content:
Binary data indicators:
%PDF- header without text extractionWhen read_file returns binary data, do NOT attempt execute_code_sandbox yet. Use run_shell immediately:
Tool: run_shell
Command: pdftotext document.pdf document.txt
If pdftotext is not available:
Tool: run_shell
Command: apt-get update && apt-get install -y poppler-utils && pdftotext document.pdf document.txt
Then read the extracted text:
Tool: read_file
Path: document.txt
Expected outcome: Clean text extraction
If this fails:
file document.pdf)Only attempt this if Steps 1-2 fail:
Tool: execute_code_sandbox
Language: python
Code: |
import fitz # PyMuPDF
try:
doc = fitz.open("document.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
with open("document_pymupdf.txt", "w") as f:
f.write(text)
print("SUCCESS: Extracted {} characters".format(len(text)))
except Exception as e:
print(f"FAILED: {e}")
Then read the result:
Tool: read_file
Path: document_pymupdf.txt
Regardless of which method succeeded, verify extraction quality:
If quality is poor:
If all extraction methods fail:
Example degradation note:
NOTE: Source document [path/URL] was inaccessible due to [specific tool failures].
Content below combines partial extraction with established domain knowledge
for [topic]. All claims verified against [alternative sources] where possible.
Tool Failure Log:
- read_file: Returned binary data (no text extraction)
- run_shell/pdftotext: Command not available in environment
- execute_code_sandbox/PyMuPDF: Sandbox execution failed with [error]
# pdf-extract-orchestrator.py
# Implements the progressive tool fallback pattern
def extract_pdf_text(pdf_path):
"""
Progressive PDF extraction following tool precedence:
1. read_file (quick check)
2. run_shell + pdftotext (primary extraction)
3. execute_code_sandbox + PyMuPDF (final fallback)
"""
extraction_log = []
# Step 1: Try read_file
print("Step 1: Attempting read_file...")
try:
content = read_file(pdf_path)
if is_binary_or_image_data(content):
extraction_log.append("read_file: Returned binary data")
# Proceed to Step 2
else:
extraction_log.append("read_file: Success")
return content, extraction_log
except Exception as e:
extraction_log.append(f"read_file: Failed - {e}")
# Step 2: Try run_shell with pdftotext
print("Step 2: Attempting run_shell + pdftotext...")
try:
run_shell(f"pdftotext {pdf_path} output.txt")
content = read_file("output.txt")
if content and len(content) > 100:
extraction_log.append("run_shell/pdftotext: Success")
return content, extraction_log
else:
extraction_log.append("run_shell/pdftotext: Empty extraction")
except Exception as e:
extraction_log.append(f"run_shell/pdftotext: Failed - {e}")
# Step 3: Try execute_code_sandbox with PyMuPDF
print("Step 3: Attempting execute_code_sandbox + PyMuPDF...")
try:
code = """
import fitz
doc = fitz.open("""" + pdf_path + """")
text = ""
for page in doc:
text += page.get_text()
doc.close()
print(text[:1000]) # Preview
"""
result = execute_code_sandbox(language="python", code=code)
extraction_log.append("execute_code_sandbox/PyMuPDF: Success")
return result, extraction_log
except Exception as e:
extraction_log.append(f"execute_code_sandbox/PyMuPDF: Failed - {e}")
# Step 4: All methods failed
extraction_log.append("ALL METHODS FAILED - Escalate to domain knowledge")
return None, extraction_log
def is_binary_or_image_data(content):
"""Detect if content is binary/image data rather than extracted text"""
if not content:
return True
# Check for PDF header without text extraction
if content.startswith("%PDF-"):
return True
# Check for high ratio of non-printable characters
non_printable = sum(1 for c in content if ord(c) < 32 and c not in '\n\r\t')
if len(content) > 0 and non_printable / len(content) > 0.1:
return True
return False
┌─────────────────┐
│ Start: PDF │
│ Available? │
└────────┬────────┘
│
┌────────▼────────┐
│ Step 1: │
│ read_file │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Text │ │ Binary │ │ Error/ │
│ Returned │ │ Data │ │ Not Found│
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
│ ┌────▼─────┐ ┌────▼─────┐
│ │ Step 2: │ │ Download │
│ │ run_shell│ │ or Fix │
│ │ pdftotext│ │ Path │
│ └────┬─────┘ └──────────┘
│ │
│ ┌────▼─────┐
│ │ Success? │
│ └────┬─────┘
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ Yes │ │ No │
└─────┬─────┘ └─────┬─────┘
│ │
│ ┌────▼─────────┐
│ │ Step 3: │
│ │ execute_ │
│ │ code_sandbox │
│ │ PyMuPDF │
│ └──────────────┘
│
┌─────▼──────────────────┐
│ Step 4: Quality Check │
│ Step 5: Document │
│ Limitations │
└────────────────────────┘
| Tool | Symptom | Cause | Solution |
|---|---|---|---|
| read_file | Binary PDF data | Tool doesn't extract PDF text | Escalate to run_shell immediately |
| read_file | PNG/JPEG data | PDF contains embedded images | Use OCR tools or request text version |
| run_shell | pdftotext not found | Tool not installed | Install poppler-utils first |
| run_shell | Empty output | Password-protected PDF | Request accessible version |
| execute_code_sandbox | Unknown error | Sandbox execution issue | Try run_shell alternative or document limitation |
| execute_code_sandbox | Import error | PyMuPDF not installed | Include pip install in script |
This skill enhances pdf-download-extract-fallback by: