技能档案

Local Pdf Extraction

Name: Local Pdf Extraction
Author: HKUDS

Extract text from local PDFs using pdftotext or PyMuPDF via run_shell

HKUDS5,421 星标2026年3月24日

职业: 软件开发人员
分类: 文档

技能内容

Local PDF Extraction Workflow

Use this skill when you need to extract text from PDF files that exist locally on the filesystem, and read_file returns binary data instead of readable text.

When to Use

PDF files exist in the local workspace or known directories
read_file on PDFs returns binary/garbled data instead of text
You need to process PDF content for analysis, summarization, or data extraction

Step-by-Step Instructions

Step 1: Locate PDF Files

First, list directory contents to find all PDF files:

ls -la *.pdf
# or for recursive search
find . -name "*.pdf" -type f

Step 2: Extract PDFs to Text

相关技能

Local Pdf Extraction | Skills Pool

# Extract single PDF
pdftotext input.pdf output.txt

# Batch extract all PDFs in directory
for pdf in *.pdf; do
    pdftotext "$pdf" "${pdf%.pdf}.txt"
done

python3 << 'EOF'
import fitz  # PyMuPDF
import glob
import os

for pdf_path in glob.glob("*.pdf"):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    
    txt_path = pdf_path.replace(".pdf", ".txt")
    with open(txt_path, "w", encoding="utf-8") as f:
        f.write(text)
    print(f"Extracted: {pdf_path} -> {txt_path}")
EOF

# Now you can read the text files normally
content = read_file(filetype="txt", file_path="document.txt")

# Step 1: Find PDFs
ls -la *.pdf

# Step 2: Extract all PDFs to text
for pdf in *.pdf; do
    pdftotext "$pdf" "${pdf%.pdf}.txt"
done

# Step 3: Verify extraction
ls -la *.txt

python3 << 'SCRIPT'
import fitz, glob
for pdf in glob.glob("*.pdf"):
    doc = fitz.open(pdf)
    text = "".join(page.get_text() for page in doc)
    with open(pdf.replace(".pdf", ".txt"), "w") as f:
        f.write(text)
    print(f"Done: {pdf}")
SCRIPT

Local Pdf Extraction

Local PDF Extraction Workflow

When to Use

Step-by-Step Instructions

Step 1: Locate PDF Files

Step 2: Extract PDFs to Text

Local Pdf Extraction

Local PDF Extraction Workflow

When to Use

Step-by-Step Instructions

Step 1: Locate PDF Files

Step 2: Extract PDFs to Text

Method A: Using pdftotext (poppler-utils)

Method B: Using PyMuPDF (fitz) via Python

Step 3: Read Extracted Text Files

Step 4: Process Content

Complete Workflow Example

Troubleshooting

Key Takeaways

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing