스킬 파일

Pdf Processing

Name: Pdf Processing
Author: anjijava16

Extract text, tables, and metadata from PDF documents. Parse scanned PDFs with OCR, analyze PDF structure, and convert PDFs to other formats. Use when the user needs to read, parse, or extract content from PDF files, even if they just say "get the text from this document."

anjijava160 스타2026. 3. 29.

직업
카테고리: 문서

스킬 내용

When to use this skill

Use this skill when the user needs to:

Extract text or tables from PDF documents
Parse scanned/image-based PDFs using OCR
Read PDF metadata (author, title, page count, creation date)
Convert PDF pages to images
Analyze PDF structure (fonts, layout, embedded objects)

Prerequisites

pip install pdfplumber pypdf pdf2image

For OCR on scanned PDFs:

pip install pytesseract Pillow
brew install tesseract poppler  # macOS

Text Extraction

Use pdfplumber as the default. It handles most PDFs well.

관련 스킬

Pdf Processing | Skills Pool

import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        text = page.extract_text()
        if text:
            print(f"--- Page {i + 1} ---")
            print(text)
        else:
            print(f"--- Page {i + 1}: no text (likely scanned) ---")

from pdf2image import convert_from_path
import pytesseract

images = convert_from_path("scanned.pdf", dpi=300)
for i, img in enumerate(images):
    text = pytesseract.image_to_string(img)
    print(f"--- Page {i + 1} ---")
    print(text)

with pdfplumber.open("input.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

import pandas as pd

with pdfplumber.open("input.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()
    if tables:
        df = pd.DataFrame(tables[0][1:], columns=tables[0][0])
        print(df.to_markdown(index=False))

from pypdf import PdfReader

reader = PdfReader("input.pdf")
meta = reader.metadata
print(f"Pages: {len(reader.pages)}")
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Creator: {meta.creator}")
print(f"Created: {meta.creation_date}")

from pdf2image import convert_from_path

images = convert_from_path("input.pdf", dpi=300)
for i, img in enumerate(images):
    img.save(f"page_{i + 1}.png", "PNG")

Pdf Processing

When to use this skill

Prerequisites

Text Extraction

Pdf Processing

When to use this skill

Prerequisites

Text Extraction

OCR Fallback for Scanned PDFs

Table Extraction

PDF Metadata

PDF to Images

Gotchas

Validation

References

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing