Name: Pdf Text Extraction Fallback 85d5ca
Author: HKUDS

Pdf Text Extraction Fallback 85d5ca | Skills Pool

# Example of problematic output from read_file
%PDF-1.4
1 0 obj
<< /Type /Catalog ...

Task: Extract all text content from <filename.pdf> using pdftotext or pdfplumber.
Output the extracted text in readable format. If pdftotext is not available, use Python with pdfplumber library.

shell_agent task="Extract text from Move_Out_Inspection_Tracker.pdf using pdftotext. Save output to a .txt file and return the content."

# Validation checklist
def validate_pdf_extraction(text, expected_patterns=None):
    checks = [
        bool(text.strip()),  # Not empty
        len(text) > 50,  # Has substantial content
        not text.startswith('%PDF'),  # Not raw PDF structure
    ]
    
    if expected_patterns:
        for pattern in expected_patterns:
            checks.append(pattern.lower() in text.lower())
    
    return all(checks)

Try alternative tool: If pdftotext failed, try pdfplumber:

shell_agent task="Extract text from <file.pdf> using Python pdfplumber library. Handle any encoding issues."

Try OCR fallback: For scanned PDFs:

shell_agent task="This PDF may be scanned. Use pytesseract or similar OCR tool to extract text from <file.pdf>."

Report specific error: Document what patterns were expected but not found.

# Complete extraction workflow
def extract_pdf_text_fallback(pdf_path, expected_patterns=None):
    """Extract text from PDF with fallback handling."""
    
    # Step 1: Try read_file first
    content = read_file(filetype="pdf", file_path=pdf_path)
    
    # Step 2: Check if binary/unreadable
    if is_binary_or_garbled(content):
        # Step 3: Use shell_agent fallback
        result = shell_agent(
            task=f"Extract all text from {pdf_path} using pdftotext. Return the text content."
        )
        content = result.stdout
        
        # Step 4: Validate
        if not validate_pdf_extraction(content, expected_patterns):
            # Try pdfplumber as secondary fallback
            result = shell_agent(
                task=f"Extract text from {pdf_path} using Python pdfplumber library."
            )
            content = result.stdout
    
    return content

def is_binary_or_garbled(text):
    """Check if text appears to be binary or unreadable."""
    if not text:
        return True
    if text.startswith('%PDF'):
        return True
    # Check for high ratio of non-printable characters
    non_printable = sum(1 for c in text if ord(c) > 127 or ord(c) < 32)
    return non_printable / len(text) > 0.3

Pdf Text Extraction Fallback 85d5ca

PDF Text Extraction Fallback

When to Use

Step-by-Step Instructions

Step 1: Detect Binary/Unreadable PDF Output

Pdf Text Extraction Fallback 85d5ca

PDF Text Extraction Fallback

When to Use

Step-by-Step Instructions

Step 1: Detect Binary/Unreadable PDF Output

Step 2: Use shell_agent with PDF Tools

Step 3: Validate Extracted Content

Step 4: Handle Extraction Failures

Step 5: Proceed with Data Processing

Code Example

Tips

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing