技能檔案

PDF Processing Guide

Name: PDF Processing Guide
Author: Hemkumar247

Use this skill when you need to read, inspect, or extract content from PDF files — especially when file content is NOT in your context and you need to read it from disk. Covers content inventory, text extraction, page rasterization for visual inspection, embedded image/attachment/table/form-field extraction, and choosing the right reading strategy for different document types (text-heavy, scanned, slide-decks, forms, data-heavy). Do NOT use this skill for PDF creation, form filling, merging, splitting, watermarking, or encryption — use the pdf skill instead.

Hemkumar2470 星標2026年4月2日

職業
分類: 文件

技能內容

Overview

This guide covers essential PDF reading operations using Python libraries and command-line tools. For advanced features (pypdfium2 rendering, pdfplumber table settings, OCR fallback, encrypted/corrupted PDF handling), see REFERENCE.md.

Reading & Inspecting PDFs

Before doing anything with a PDF, understand what you're working with.

Content inventory

Run a quick diagnostic first. For simple tasks ("summarize this document"), pdfinfo + a text sample may suffice. For anything involving figures, attachments, or extraction issues, run the full set:

# Always: page count, file size, PDF version, metadata
pdfinfo document.pdf

# Always: quick text extraction check — is this a text PDF or a scan?
pdftotext -f 1 -l 1 document.pdf - | head -20

# If figures/charts may matter:
pdfimages -list document.pdf

# If the PDF might contain embedded files (reports, portfolios):
pdfdetach -list document.pdf

# If text extraction looks garbled:
pdffonts document.pdf

相關技能

PDF Processing Guide | Skills Pool

from pypdf import PdfReader

reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")

# Extract text
text = ""
for page in reader.pages:
    text += page.extract_text()

# Layout mode preserves spatial positioning
pdftotext -layout document.pdf output.txt

# Specific page range
pdftotext -f 1 -l 5 document.pdf output.txt

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

# Rasterize a single page (page 3 here) at 150 DPI
pdftoppm -jpeg -r 150 -f 3 -l 3 document.pdf /tmp/page

# pdftoppm zero-pads the output filename based on TOTAL page count
# (e.g., page-03.jpg for a 50-page PDF, page-003.jpg for 200+ pages)
# Don't guess the filename — find it:
ls /tmp/page-*.jpg

# List all embedded images with metadata (size, color, compression)
pdfimages -list document.pdf

# Extract all images as PNG
pdfimages -png document.pdf /tmp/img

# Extract from specific pages only (pages 3-5)
pdfimages -png -f 3 -l 5 document.pdf /tmp/img

# Extract in original format (JPEG stays JPEG, etc.)
pdfimages -all document.pdf /tmp/img

import fitz  # PyMuPDF

doc = fitz.open("document.pdf")
for page in doc:
    for img in page.get_images():
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n - pix.alpha > 3:  # CMYK or other non-RGB
            pix = fitz.Pixmap(fitz.csRGB, pix)
        pix.save(f"/tmp/img_{xref}.png")

# List all attachments
pdfdetach -list document.pdf

# Extract all attachments to a directory
mkdir -p /tmp/attachments
pdfdetach -saveall -o /tmp/attachments/ document.pdf

# Extract a specific attachment by number (1-based index from -list output)
pdfdetach -save 1 -o /tmp/attachment.pdf document.pdf

import os
from pypdf import PdfReader

reader = PdfReader("document.pdf")
for name, content_list in reader.attachments.items():
    safe_name = os.path.basename(name)  # sanitize — name comes from the PDF
    for content in content_list:
        with open(f"/tmp/{safe_name}", "wb") as f:
            f.write(content)

from pypdf import PdfReader

reader = PdfReader("form.pdf")

# Text input fields only:
fields = reader.get_form_text_fields()
for name, value in fields.items():
    print(f"{name}: {value}")

# All field types (checkboxes, radio buttons, dropdowns too):
all_fields = reader.get_fields() or {}
for name, field in all_fields.items():
    print(f"{name}: {field.get('/V', '')} (type: {field.get('/FT', '')})")

pdftk form.pdf dump_data_fields

pdffonts document.pdf

Task	Best Tool	Command/Code
Inspect PDF	poppler-utils	`pdfinfo`, `pdfimages -list`, `pdfdetach -list`, `pdffonts`
Extract text	pdfplumber	`page.extract_text()`
Extract text (CLI)	pdftotext	`pdftotext -layout input.pdf output.txt`
Extract tables	pdfplumber	`page.extract_tables()`
See page visually	pdftoppm	`pdftoppm -jpeg -r 150 -f N -l N`
Extract images	pdfimages	`pdfimages -png input.pdf prefix`
Extract attachments	pdfdetach	`pdfdetach -saveall -o /tmp/`
Read form fields	pypdf	`reader.get_fields()`
OCR scanned PDFs	pytesseract	Convert to image first

PDF Processing Guide

Overview

Reading & Inspecting PDFs

Content inventory

PDF Processing Guide

Overview

Reading & Inspecting PDFs

Content inventory

Text extraction

Visual inspection (rasterize pages)

Choosing your reading strategy

Extracting embedded images

Extracting file attachments

Extracting form field data

Audio, video, and other rare embedded content

Font diagnostics

Quick Reference

PDF Form Filling, Creation, Merging, Splitting, and Other Operations

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing