스킬 파일

Pdf Work

Name: Pdf Work
Author: Maggot4703

Use this skill when the user wants to open a PDF file and convert it to plain text (.txt). Triggers on requests like "convert PDF to text", "extract text from PDF", "read a PDF file", "turn this PDF into a text file", or "save PDF content as .txt". Handles both regular text-based PDFs and scanned/image-based PDFs using OCR. Use the existing `pdf` skill for other PDF operations (merge, split, create, fill forms, etc.).

Maggot47030 스타2026. 3. 31.

직업
카테고리: 문서

스킬 내용

PDF to Text Conversion

Convert PDF files to .txt. Two paths:

Text-based PDFs — use pdfplumber (fast, layout-preserving)
Scanned / image-based PDFs — use pdf2image + pytesseract (OCR)

The bundled script scripts/pdf_to_txt.py handles both automatically.

Dependencies

pip install pdfplumber pypdf

# OCR support (for scanned PDFs)
pip install pdf2image pytesseract
sudo apt-get install poppler-utils tesseract-ocr  # Linux
# macOS: brew install poppler tesseract

Quick Start

# Single file — output written to input.txt
python scripts/pdf_to_txt.py document.pdf

# Custom output path
python scripts/pdf_to_txt.py document.pdf -o output.txt

# Batch: convert all PDFs in a directory
python scripts/pdf_to_txt.py ./my-pdfs/

# Force OCR even on text-based PDFs
python scripts/pdf_to_txt.py document.pdf --ocr

관련 스킬

Pdf Work | Skills Pool

import pdfplumber

def pdf_to_txt(input_path: str, output_path: str) -> None:
    with pdfplumber.open(input_path) as pdf:
        with open(output_path, "w", encoding="utf-8") as out:
            for i, page in enumerate(pdf.pages, start=1):
                out.write(f"--- Page {i} ---\n")
                text = page.extract_text() or ""
                out.write(text + "\n\n")

import pytesseract
from pdf2image import convert_from_path

def scanned_pdf_to_txt(input_path: str, output_path: str) -> None:
    images = convert_from_path(input_path)
    with open(output_path, "w", encoding="utf-8") as out:
        for i, image in enumerate(images, start=1):
            out.write(f"--- Page {i} ---\n")
            text = pytesseract.image_to_string(image)
            out.write(text + "\n\n")

OCR quality depends on scan resolution; 300 DPI or higher recommended
For scanned PDFs in languages other than English, pass lang to Tesseract:
```
pytesseract.image_to_string(image, lang="fra")  # French
```

Password-protected PDFs: decrypt first with pypdf or qpdf

qpdf --password=SECRET --decrypt encrypted.pdf decrypted.pdf

For more advanced extraction (tables, structured data), see the pdf skill

Pdf Work

PDF to Text Conversion

Dependencies

Quick Start

Pdf Work

PDF to Text Conversion

Dependencies

Quick Start

How It Works

Manual Python Usage

Text-based PDF

Scanned PDF (OCR)

Notes

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing