Fallback workflow for extracting text from PDFs when read_file returns binary data
Use this skill when read_file returns binary data or garbled content for PDF files instead of readable text. This workflow provides a reliable fallback using command-line PDF tools.
read_file with filetype: pdf returns binary data, unreadable characters, or errorsAfter attempting to read a PDF with read_file, check if the output is:
# Example of problematic output from read_file
%PDF-1.4
1 0 obj
<< /Type /Catalog ...
If the output looks like raw PDF structure or binary, proceed to Step 2.
Invoke shell_agent to extract text using pdftotext (preferred) or pdfplumber (Python fallback):
Task: Extract all text content from <filename.pdf> using pdftotext or pdfplumber.
Output the extracted text in readable format. If pdftotext is not available, use Python with pdfplumber library.
Example shell_agent invocation:
shell_agent task="Extract text from Move_Out_Inspection_Tracker.pdf using pdftotext. Save output to a .txt file and return the content."
After extraction, validate that the content contains expected text patterns:
# Validation checklist
def validate_pdf_extraction(text, expected_patterns=None):
checks = [
bool(text.strip()), # Not empty
len(text) > 50, # Has substantial content
not text.startswith('%PDF'), # Not raw PDF structure
]
if expected_patterns:
for pattern in expected_patterns:
checks.append(pattern.lower() in text.lower())
return all(checks)
Common expected patterns to check:
If validation fails:
Try alternative tool: If pdftotext failed, try pdfplumber:
shell_agent task="Extract text from <file.pdf> using Python pdfplumber library. Handle any encoding issues."
Try OCR fallback: For scanned PDFs:
shell_agent task="This PDF may be scanned. Use pytesseract or similar OCR tool to extract text from <file.pdf>."
Report specific error: Document what patterns were expected but not found.
Once validated text is obtained:
# Complete extraction workflow
def extract_pdf_text_fallback(pdf_path, expected_patterns=None):
"""Extract text from PDF with fallback handling."""
# Step 1: Try read_file first
content = read_file(filetype="pdf", file_path=pdf_path)
# Step 2: Check if binary/unreadable
if is_binary_or_garbled(content):
# Step 3: Use shell_agent fallback
result = shell_agent(
task=f"Extract all text from {pdf_path} using pdftotext. Return the text content."
)
content = result.stdout
# Step 4: Validate
if not validate_pdf_extraction(content, expected_patterns):
# Try pdfplumber as secondary fallback
result = shell_agent(
task=f"Extract text from {pdf_path} using Python pdfplumber library."
)
content = result.stdout
return content
def is_binary_or_garbled(text):
"""Check if text appears to be binary or unreadable."""
if not text:
return True
if text.startswith('%PDF'):
return True
# Check for high ratio of non-printable characters
non_printable = sum(1 for c in text if ord(c) > 127 or ord(c) < 32)
return non_printable / len(text) > 0.3