Multi-fallback PDF/text extraction with early failure detection and sequential tool fallbacks
When extracting text from PDFs (especially regulatory documents, handbooks, or protected content), single-method approaches often fail due to JavaScript protection, CORS restrictions, encoding issues, or corrupted downloads. This skill provides a robust multi-fallback workflow that detects failures early and tries sequential extraction methods.
Before attempting extraction, validate the downloaded file:
# Download the PDF
curl -L -o document.pdf "$URL"
# Check file size (reject if < 1KB - likely error page)
FILE_SIZE=$(stat -f%z document.pdf 2>/dev/null || stat -c%s document.pdf 2>/dev/null)
if [ "$FILE_SIZE" -lt 1024 ]; then
echo "ERROR: File too small ($FILE_SIZE bytes) - likely not a valid PDF"
# Check if it's an HTML error page
head -c 200 document.pdf | grep -i "<html\|<!doctype\|error\|access denied" && \
echo "Detected HTML error page instead of PDF"
exit 1
fi
# Check PDF magic bytes
HEAD_BYTES=$(head -c 4 document.pdf)
if [ "$HEAD_BYTES" != "%PDF" ]; then
echo "ERROR: File does not start with PDF magic bytes"
head -c 100 document.pdf
exit 1
fi
# Try pdftotext first (fastest, most reliable for simple PDFs)
if command -v pdftotext &> /dev/null; then
pdftotext -layout document.pdf output.txt 2>/dev/null
if [ -s output.txt ]; then
WORD_COUNT=$(wc -w < output.txt)
if [ "$WORD_COUNT" -gt 50 ]; then
echo "SUCCESS: pdftotext extracted $WORD_COUNT words"
exit 0
fi
fi
fi
# Try PyMuPDF - handles more complex PDFs
import fitz # pymupdf