技能档案

Pdf Extract Ordered Fallback

Name: Pdf Extract Ordered Fallback
Author: HKUDS

PDF extraction with ordered tool chain: read_file, then run_shell/pdftotext, then execute_code_sandbox/PyMuPDF

HKUDS5,421 星标2026年3月24日

职业
分类: 文档

技能内容

PDF Download and Extract with Ordered Fallback

This skill provides a robust workflow for acquiring PDF documents from web sources and extracting their text content, with a clearly ordered sequence of tool invocations to maximize success rate.

Overview

When working with PDFs from web sources, encounters with JavaScript redirects, corrupted files, missing tools, or inaccessible content are common. This workflow ensures maximum success rate through a严格 ordered fallback sequence that prioritizes shell-based tools over Python sandbox execution.

Ordered Tool Chain Summary

Step	Tool	Method	Priority
0	read_file	Direct PDF text extraction	First attempt
1	run_shell	pdftotext command	Primary fallback (if Step 0 returns binary/fails)
2

相关技能

Pdf Extract Ordered Fallback | Skills Pool

read_file filetype="pdf" file_path="path/to/document.pdf"

curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o output.pdf "URL_HERE"

file output.pdf

pdftotext downloaded.pdf extracted.txt

run_shell command="pdftotext downloaded.pdf extracted.txt"

# Debian/Ubuntu
apt-get update && apt-get install -y poppler-utils

# macOS  
brew install poppler

# RHEL/CentOS
yum install -y poppler-utils

import fitz  # PyMuPDF

doc = fitz.open("downloaded.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()

with open("extracted.txt", "w") as f:
    f.write(text)

execute_code_sandbox code="<Python code above>"

pip install pymupdf

NOTE: Source document [URL] was inaccessible due to [reason]. 
Content below combines partial extraction with established domain knowledge 
for [topic]. Verify against official sources when available.

#!/bin/bash
# pdf-extract-workflow.sh

PDF_URL="$1"
OUTPUT_PDF="downloaded.pdf"
OUTPUT_TXT="extracted.txt"

# Step 0/1: Download with browser user-agent
echo "Downloading PDF..."
curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" -o "$OUTPUT_PDF" "$PDF_URL"

# Step 1: Verify file type
echo "Verifying file type..."
if ! file "$OUTPUT_PDF" | grep -q "PDF document"; then
    echo "WARNING: Downloaded file is not a valid PDF"
    echo "Attempting fallback extraction anyway..."
fi

# Step 2: Try pdftotext via shell (PRIMARY EXTRACTION)
echo "Attempting pdftotext extraction..."
if command -v pdftotext &> /dev/null; then
    if pdftotext "$OUTPUT_PDF" "$OUTPUT_TXT" 2>/dev/null; then
        echo "Extraction successful with pdftotext"
        exit 0
    fi
fi

# Step 3: Fallback to PyMuPDF via Python sandbox (SECONDARY)
echo "Falling back to PyMuPDF..."
python3 << 'PYTHON_SCRIPT'
import fitz
import sys

Pdf Extract Ordered Fallback

PDF Download and Extract with Ordered Fallback

Overview

Ordered Tool Chain Summary

Pdf Extract Ordered Fallback

PDF Download and Extract with Ordered Fallback

Overview

Ordered Tool Chain Summary

Step-by-Step Instructions

Step 0: Initial Extraction Attempt with read_file Tool

Step 1: Download PDF with Browser User-Agent

Step 2: Verify File Type Before Parsing

Step 3: Primary Shell Extraction with pdftotext via run_shell

Step 4: Secondary Python Fallback with PyMuPDF via execute_code_sandbox

Step 5: Graceful Degradation to Domain Knowledge

Complete Workflow Script

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing