技能档案

Pdf Extract Shell First

Name: Pdf Extract Shell First
Author: HKUDS

PDF text extraction with tool cascade prioritizing shell pdftotext before Python fallback

HKUDS5,421 星标2026年3月24日

职业
分类: 文档

技能内容

PDF Extract with Shell-First Tool Cascade

This skill provides an optimized workflow for extracting text content from PDF documents (local files or downloaded URLs) using a prioritized tool cascade that favors shell-based extraction before falling back to Python libraries.

Why Shell-First?

Analysis of execution patterns shows:

read_file on PDFs sometimes returns binary/image data instead of text
run_shell with pdftotext has higher success rate and fewer sandbox errors
execute_code_sandbox can fail with "unknown error" in constrained environments
Shell tools are more reliable for PDF text extraction when available

Entry Point: Determine Your Starting Point

Before beginning, identify your scenario:

Scenario	Start Here	Skip

相关技能

Pdf Extract Shell First | Skills Pool

curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o target.pdf "URL_HERE"

read_file(filetype="pdf", file_path="target.pdf")

Response Type	Interpretation	Next Action
Clean readable text	Success	Proceed to content analysis
Binary data / PNG image / garbled	`read_file` returned raw data	Go to Step 2 immediately
Error / timeout	Tool failure	Go to Step 2 immediately

run_shell(command="pdftotext target.pdf output.txt")

read_file(filetype="txt", file_path="output.txt")

run_shell(command="apt-get update && apt-get install -y poppler-utils")
# Or for macOS:
run_shell(command="brew install poppler")

run_shell(command="pdftotext target.pdf output.txt")

import fitz  # PyMuPDF

doc = fitz.open("target.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()

with open("output.txt", "w") as f:
    f.write(text)
print(f"Extracted {len(text)} characters from {len(doc)} pages")

execute_code_sandbox(code="<python code above>")

read_file(filetype="txt", file_path="output.txt")

EXTRACTION FAILURE REPORT:
- Source: [URL or file path]
- read_file: Returned binary/image data (no text extraction)
- run_shell/pdftotext: [Tool not available / produced garbled output / succeeded]
- execute_code_sandbox/PyMuPDF: [Failed with unknown error / succeeded]

NOTE: Content below combines partial extraction with established domain 
knowledge for [topic]. Verify against official sources when available.

                    PDF to Extract
                          │
                          ▼
                  ┌───────────────┐
                  │  read_file    │
                  │  (primary)    │
                  └───────┬───────┘
                          │
            ┌─────────────┼─────────────┐
            │             │             │
     Returns text   Returns binary   Error/timeout
        (✓)         / image data         │
            │             │             │
            ▼             ▼             ▼
       SUCCESS    ┌───────────────┐
                  │ run_shell     │
                  │ pdftotext     │
                  └───────┬───────┘
                          │
                  ┌───────┼───────┐
                  │       │       │
             Succeeds  Not      Garbled
                (✓)   avail.    output
                  │       │       │
                  ▼       ▼       ▼
             SUCCESS ┌───────────────┐
                     │ execute_code  │
                     │ _sandbox      │
                     │ PyMuPDF       │
                     └───────┬───────┘
                             │
                     ┌───────┼───────┐
                     │       │       │
                Succeeds   Fails   Error
                   (✓)      │       │
                     │      ▼       │
                     ▼   Domain     │
                 SUCCESS  Knowledge │
                             │      │
                             └──────┘
                              FAILURE
                              DOCUMENTED

#!/bin/bash
# pdf-extract-cascade.sh
# Implements the full tool cascade for PDF extraction

INPUT="$1"
OUTPUT_PDF="target.pdf"
OUTPUT_TXT="output.txt"

# Step 0: Handle URL vs local file
if [[ "$INPUT" =~ ^https?:// ]]; then
    echo "Downloading PDF from URL..."
    curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" -o "$OUTPUT_PDF" "$INPUT"
else
    if [ ! -f "$INPUT" ]; then
        echo "ERROR: Local file not found: $INPUT"
        exit 1
    fi
    OUTPUT_PDF="$INPUT"
fi

# Step 1: Verify file type
echo "Verifying file type..."
if ! file "$OUTPUT_PDF" | grep -q "PDF document"; then
    echo "WARNING: File is not a valid PDF"
    file "$OUTPUT_PDF"
fi

# Step 2: Try pdftotext (shell-first approach)
echo "Attempting pdftotext extraction..."
if command -v pdftotext &> /dev/null; then
    if pdftotext "$OUTPUT_PDF" "$OUTPUT_TXT" 2>/dev/null; then
        if [ -s "$OUTPUT_TXT" ]; then
            echo "SUCCESS: Extraction completed with pdftotext"
            wc -l "$OUTPUT_TXT"
            exit 0
        fi
    fi
fi

# Step 3: Fallback to PyMuPDF
echo "Falling back to PyMuPDF..."
python3 << 'PYTHON_SCRIPT'
import fitz
import sys

PDF already on local disk	Step 1 (Try read_file)	Shell download steps
PDF at a web URL	Shell download, then Step 1	None
Need maximum reliability	Full cascade (all 3 tools)	None

Pdf Extract Shell First

PDF Extract with Shell-First Tool Cascade

Why Shell-First?

Entry Point: Determine Your Starting Point

Pdf Extract Shell First

PDF Extract with Shell-First Tool Cascade

Why Shell-First?

Entry Point: Determine Your Starting Point

Complete Workflow

Step 0: Download PDF (URL Only)

Step 1: Try read_file (Primary Attempt)

Step 2: Use run_shell with pdftotext (Preferred Fallback)

Step 3: Use execute_code_sandbox with PyMuPDF (Last Resort)

Step 4: Graceful Degradation to Domain Knowledge

Tool Selection Decision Tree

Complete Automated Script

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing