技能档案

Pdf Extract Progressive Tools

Name: Pdf Extract Progressive Tools
Author: HKUDS

Progressive tool-chain PDF extraction with explicit read_file, run_shell, and execute_code_sandbox sequencing

HKUDS5,421 星标2026年3月24日

职业
分类: 文档

技能内容

PDF Text Extraction with Progressive Tool Fallback

This skill provides a robust workflow for extracting text from PDF documents using a sequenced approach with agent tools, with explicit fallback mechanisms based on observed tool behavior.

Critical Insight from Execution Data

read_file often returns binary/image data for PDFs, not extracted text. When this occurs, immediately escalate to run_shell with pdftotext before attempting Python-based extraction.

Entry Point: Determine Your Starting Point

Before beginning, identify your scenario:

Scenario	Start Here	Skip
PDF already on local disk	Step 1 (read_file attempt)	Download steps
PDF at a web URL	Download first, then Step 1	None

相关技能

Pdf Extract Progressive Tools | Skills Pool

Tool: read_file
Path: document.pdf

Tool: run_shell
Command: pdftotext document.pdf document.txt

Tool: run_shell
Command: apt-get update && apt-get install -y poppler-utils && pdftotext document.pdf document.txt

Tool: read_file
Path: document.txt

Tool: execute_code_sandbox
Language: python
Code: |
  import fitz  # PyMuPDF
  
  try:
      doc = fitz.open("document.pdf")
      text = ""
      for page in doc:
          text += page.get_text()
      doc.close()
      
      with open("document_pymupdf.txt", "w") as f:
          f.write(text)
      
      print("SUCCESS: Extracted {} characters".format(len(text)))
  except Exception as e:
      print(f"FAILED: {e}")

Tool: read_file
Path: document_pymupdf.txt

NOTE: Source document [path/URL] was inaccessible due to [specific tool failures].
Content below combines partial extraction with established domain knowledge 
for [topic]. All claims verified against [alternative sources] where possible.

Tool Failure Log:
- read_file: Returned binary data (no text extraction)
- run_shell/pdftotext: Command not available in environment
- execute_code_sandbox/PyMuPDF: Sandbox execution failed with [error]

# pdf-extract-orchestrator.py
# Implements the progressive tool fallback pattern

def extract_pdf_text(pdf_path):
    """
    Progressive PDF extraction following tool precedence:
    1. read_file (quick check)
    2. run_shell + pdftotext (primary extraction)
    3. execute_code_sandbox + PyMuPDF (final fallback)
    """
    extraction_log = []
    
    # Step 1: Try read_file
    print("Step 1: Attempting read_file...")
    try:
        content = read_file(pdf_path)
        if is_binary_or_image_data(content):
            extraction_log.append("read_file: Returned binary data")
            # Proceed to Step 2
        else:
            extraction_log.append("read_file: Success")
            return content, extraction_log
    except Exception as e:
        extraction_log.append(f"read_file: Failed - {e}")
    
    # Step 2: Try run_shell with pdftotext
    print("Step 2: Attempting run_shell + pdftotext...")
    try:
        run_shell(f"pdftotext {pdf_path} output.txt")
        content = read_file("output.txt")
        if content and len(content) > 100:
            extraction_log.append("run_shell/pdftotext: Success")
            return content, extraction_log
        else:
            extraction_log.append("run_shell/pdftotext: Empty extraction")
    except Exception as e:
        extraction_log.append(f"run_shell/pdftotext: Failed - {e}")
    
    # Step 3: Try execute_code_sandbox with PyMuPDF
    print("Step 3: Attempting execute_code_sandbox + PyMuPDF...")
    try:
        code = """
import fitz
doc = fitz.open("""" + pdf_path + """")
text = ""
for page in doc:
    text += page.get_text()
doc.close()
print(text[:1000])  # Preview
"""
        result = execute_code_sandbox(language="python", code=code)
        extraction_log.append("execute_code_sandbox/PyMuPDF: Success")
        return result, extraction_log
    except Exception as e:
        extraction_log.append(f"execute_code_sandbox/PyMuPDF: Failed - {e}")
    
    # Step 4: All methods failed
    extraction_log.append("ALL METHODS FAILED - Escalate to domain knowledge")
    return None, extraction_log

def is_binary_or_image_data(content):
    """Detect if content is binary/image data rather than extracted text"""
    if not content:
        return True
    # Check for PDF header without text extraction
    if content.startswith("%PDF-"):
        return True
    # Check for high ratio of non-printable characters
    non_printable = sum(1 for c in content if ord(c) < 32 and c not in '\n\r\t')
    if len(content) > 0 and non_printable / len(content) > 0.1:
        return True
    return False

                    ┌─────────────────┐
                    │  Start: PDF     │
                    │  Available?     │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │   Step 1:       │
                    │   read_file     │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
        ┌─────▼─────┐  ┌─────▼─────┐  ┌─────▼─────┐
        │  Text     │  │  Binary   │  │  Error/   │
        │  Returned │  │  Data     │  │  Not Found│
        └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
              │              │              │
              │         ┌────▼─────┐  ┌────▼─────┐
              │         │ Step 2:  │  │ Download │
              │         │ run_shell│  │ or Fix   │
              │         │ pdftotext│  │ Path     │
              │         └────┬─────┘  └──────────┘
              │              │
              │         ┌────▼─────┐
              │         │ Success? │
              │         └────┬─────┘
              │              │
        ┌─────▼─────┐  ┌─────▼─────┐
        │  Yes      │  │  No       │
        └─────┬─────┘  └─────┬─────┘
              │              │
              │         ┌────▼─────────┐
              │         │ Step 3:      │
              │         │ execute_     │
              │         │ code_sandbox │
              │         │ PyMuPDF      │
              │         └──────────────┘
              │
        ┌─────▼──────────────────┐
        │  Step 4: Quality Check │
        │  Step 5: Document      │
        │  Limitations           │
        └────────────────────────┘

Tool	Symptom	Cause	Solution
read_file	Binary PDF data	Tool doesn't extract PDF text	Escalate to run_shell immediately
read_file	PNG/JPEG data	PDF contains embedded images	Use OCR tools or request text version
run_shell	pdftotext not found	Tool not installed	Install poppler-utils first
run_shell	Empty output	Password-protected PDF	Request accessible version
execute_code_sandbox	Unknown error	Sandbox execution issue	Try run_shell alternative or document limitation
execute_code_sandbox	Import error	PyMuPDF not installed	Include pip install in script

Pdf Extract Progressive Tools

PDF Text Extraction with Progressive Tool Fallback

Critical Insight from Execution Data

Entry Point: Determine Your Starting Point

Pdf Extract Progressive Tools

PDF Text Extraction with Progressive Tool Fallback

Critical Insight from Execution Data

Entry Point: Determine Your Starting Point

Overview

Step-by-Step Instructions

Step 1: Attempt read_file First

Step 2: Escalate to run_shell with pdftotext

Step 3: Final Fallback to execute_code_sandbox with PyMuPDF

Step 4: Quality Verification

Step 5: Graceful Degradation to Domain Knowledge

Complete Tool Orchestration Script

Tool Precedence Decision Tree

Best Practices

Common Failure Modes by Tool

When to Use This Skill

Migration from Parent Skill

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing