技能档案

Reliable Pdf Extraction

Name: Reliable Pdf Extraction
Author: HKUDS

Use shell commands or Python libraries to extract PDF text when read_file PDF handler fails

HKUDS5,421 星标2026年3月24日

职业: 软件开发人员
分类: 文档

技能内容

Reliable PDF Text Extraction

Problem

The read_file tool with filetype='pdf' often returns binary image data, errors, or unusable output when attempting to extract text from PDF documents. This makes it unreliable for structured data extraction tasks.

Solution

Use run_shell with command-line tools (pdftotext, pdfinfo) or execute_code_sandbox with Python libraries (PyMuPDF, pdfplumber) to extract PDF text content reliably.

Methods

Method 1: pdftotext (Recommended for simple extraction)

# Extract all text to stdout
pdftotext input.pdf -

# Or extract to file
pdftotext input.pdf output.txt
cat output.txt

相关技能

Reliable Pdf Extraction | Skills Pool

pdfinfo input.pdf

import fitz  # PyMuPDF

doc = fitz.open("input.pdf")
text = ""
for page in doc:
    text += page.get_text()
print(text)
doc.close()

import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)
        # For tables:
        # tables = page.extract_tables()

# Via run_shell
result = run_shell(command="pdftotext document.pdf -")
pdf_text = result.stdout

# Via execute_code_sandbox
code = """
import pdfplumber
with pdfplumber.open("/path/to/document.pdf") as pdf:
    for page in pdf.pages:
        print(page.extract_text())
"""
result = execute_code_sandbox(code=code)
pdf_text = result.stdout

Reliable Pdf Extraction

Reliable PDF Text Extraction

Problem

Solution

Methods

Method 1: pdftotext (Recommended for simple extraction)

Reliable Pdf Extraction

Reliable PDF Text Extraction

Problem

Solution

Methods

Method 1: pdftotext (Recommended for simple extraction)

Method 2: pdfinfo (For metadata)

Method 3: Python with PyMuPDF (fitz)

Method 4: Python with pdfplumber (Better for tables/structured data)

Workflow

Example Usage

Tips

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing