Extract text from local PDFs using pdftotext or PyMuPDF via run_shell
Use this skill when you need to extract text from PDF files that exist locally on the filesystem, and read_file returns binary data instead of readable text.
read_file on PDFs returns binary/garbled data instead of textFirst, list directory contents to find all PDF files:
ls -la *.pdf
# or for recursive search
find . -name "*.pdf" -type f
Choose one of these methods based on available tools:
# Extract single PDF
pdftotext input.pdf output.txt
# Batch extract all PDFs in directory
for pdf in *.pdf; do
pdftotext "$pdf" "${pdf%.pdf}.txt"
done
python3 << 'EOF'
import fitz # PyMuPDF
import glob
import os
for pdf_path in glob.glob("*.pdf"):
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
txt_path = pdf_path.replace(".pdf", ".txt")
with open(txt_path, "w", encoding="utf-8") as f:
f.write(text)
print(f"Extracted: {pdf_path} -> {txt_path}")
EOF
Once extracted, use read_file to read the .txt files:
# Now you can read the text files normally
content = read_file(filetype="txt", file_path="document.txt")
Proceed with your analysis, summarization, or data extraction on the text content.
# Step 1: Find PDFs
ls -la *.pdf
# Step 2: Extract all PDFs to text
for pdf in *.pdf; do
pdftotext "$pdf" "${pdf%.pdf}.txt"
done
# Step 3: Verify extraction
ls -la *.txt
Or as a Python script via run_shell:
python3 << 'SCRIPT'
import fitz, glob
for pdf in glob.glob("*.pdf"):
doc = fitz.open(pdf)
text = "".join(page.get_text() for page in doc)
with open(pdf.replace(".pdf", ".txt"), "w") as f:
f.write(text)
print(f"Done: {pdf}")
SCRIPT
apt-get install poppler-utils or use PyMuPDF methodpdftoppm + tesseractread_file directly on PDFs for text extractionrun_shell with pdftotext or PyMuPDF for reliable extraction.txt files for further processing37:["$","$L3d",null,{"content":"$3e","frontMatter":{"name":"local-pdf-extraction","description":"Extract text from local PDFs using pdftotext or PyMuPDF via run_shell"}}]