技能档案

Pdf Extract Create Workflow

Name: Pdf Extract Create Workflow
Author: HKUDS

Complete PDF lifecycle: download, extract, and generate structured documents with reportlab

HKUDS5,421 星标2026年3月24日

职业
分类: 文档

技能内容

PDF Extract and Create Workflow

This skill provides a complete PDF lifecycle workflow for acquiring PDF documents from web sources or local files, extracting their text content, AND generating new structured PDFs from processed data—with multiple fallback mechanisms throughout.

Overview

When working with PDFs, you may need to:

Download PDFs from web sources (with anti-bot protection)
Extract text content from PDFs (with fallback strategies)
Generate new PDFs from processed data (with professional formatting)

This workflow ensures maximum success rate through progressive fallback strategies for extraction and templated approaches for generation.

Entry Point: Determine Your Starting Point

Before beginning, identify your scenario:

Scenario	Start Here

相关技能

Pdf Extract Create Workflow | Skills Pool

curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o output.pdf "URL_HERE"

curl -L \
  -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" \
  -H "Accept: application/pdf,*/*" \
  -H "Accept-Language: en-US,en;q=0.9" \
  -H "Connection: keep-alive" \
  -o output.pdf "URL_HERE"

file output.pdf

pdftotext output.pdf output.txt

# Debian/Ubuntu
apt-get update && apt-get install -y poppler-utils

# macOS
brew install poppler

# RHEL/CentOS
yum install -y poppler-utils

import fitz  # PyMuPDF

doc = fitz.open("output.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()

with open("output.txt", "w") as f:
    f.write(text)

pip install pymupdf

NOTE: Source document [URL] was inaccessible due to [reason]. 
Content below combines partial extraction with established domain knowledge 
for [topic]. Verify against official sources when available.

pip install reportlab

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
from reportlab.lib.enums import TA_CENTER, TA_LEFT

def create_structured_pdf(output_path, title, sections):
    """
    Create a structured PDF with title page and sections.
    
    Args:
        output_path: Path for output PDF
        title: Document title
        sections: List of dicts with 'heading' and 'content' keys
    """
    doc = SimpleDocTemplate(
        output_path,
        pagesize=letter,
        rightMargin=72,
        leftMargin=72,
        topMargin=72,
        bottomMargin=72
    )
    
    styles = getSampleStyleSheet()
    story = []
    
    # Title Page
    title_style = ParagraphStyle(
        'CustomTitle',
        parent=styles['Heading1'],
        fontSize=24,
        alignment=TA_CENTER,
        spaceAfter=30
    )
    story.append(Paragraph(title, title_style))
    story.append(Spacer(1, 2*inch))
    story.append(PageBreak())
    
    # Content Sections
    heading_style = ParagraphStyle(
        'CustomHeading',
        parent=styles['Heading2'],
        fontSize=16,
        spaceBefore=12,
        spaceAfter=6
    )
    body_style = ParagraphStyle(
        'CustomBody',
        parent=styles['Normal'],
        fontSize=11,
        leading=14,
        spaceAfter=12
    )
    
    for section in sections:
        story.append(Paragraph(section['heading'], heading_style))
        # Handle long text by splitting into paragraphs
        for paragraph in section['content'].split('\n\n'):
            if paragraph.strip():
                story.append(Paragraph(paragraph, body_style))
        story.append(Spacer(1, 0.2*inch))
    
    doc.build(story)
    print(f"PDF created: {output_path}")

from reportlab.platypus import Table, TableStyle
from reportlab.lib import colors

def create_table(data, col_widths=None):
    """
    Create a formatted table for PDF.
    
    Args:
        data: List of lists (rows x columns)
        col_widths: Optional list of column widths
    """
    table = Table(data, colWidths=col_widths)
    table.setStyle(TableStyle([
        # Header row
        ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
        ('ALIGN', (0, 0), (-1, -1), 'LEFT'),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
        ('FONTSIZE', (0, 0), (-1, 0), 12),
        ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
        # Data rows
        ('BACKGROUND', (0, 1), (-1, -1), colors.beige),
        ('TEXTCOLOR', (0, 1), (-1, -1), colors.black),
        ('FONTNAME', (0, 1), (-1, -1), 'Helvetica'),
        ('FONTSIZE', (0, 1), (-1, -1), 10),
        # Grid
        ('GRID', (0, 0), (-1, -1), 1, colors.black),
        ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.lightgrey]),
    ]))
    return table

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak, Table, TableStyle
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.enums import TA_CENTER
from reportlab.lib import colors

def create_report_pdf(output_path, title, subtitle, sections, table_data=None):
    """
    Create a complete report PDF with title, sections, and optional tables.
    
    Args:
        output_path: Output PDF path
        title: Main title
        subtitle: Subtitle or date
        sections: List of {'heading': str, 'content': str} dicts
        table_data: Optional list of lists for tables
    """
    doc = SimpleDocTemplate(
        output_path,
        pagesize=letter,
        rightMargin=50,
        leftMargin=50,
        topMargin=50,
        bottomMargin=50
    )
    
    styles = getSampleStyleSheet()
    story = []
    
    # Title Page
    title_style = ParagraphStyle(
        'Title',
        parent=styles['Heading1'],
        fontSize=28,
        alignment=TA_CENTER,
        spaceAfter=20,
        fontName='Helvetica-Bold'
    )
    subtitle_style = ParagraphStyle(
        'Subtitle',
        parent=styles['Normal'],
        fontSize=14,
        alignment=TA_CENTER,
        spaceAfter=50,
        textColor=colors.darkgrey
    )
    
    story.append(Paragraph(title, title_style))
    story.append(Paragraph(subtitle, subtitle_style))
    story.append(PageBreak())
    
    # Content
    heading_style = ParagraphStyle(
        'SectionHeading',
        parent=styles['Heading2'],
        fontSize=16,
        spaceBefore=20,
        spaceAfter=10,
        fontName='Helvetica-Bold',
        textColor=colors.darkblue
    )
    body_style = ParagraphStyle(
        'Body',
        parent=styles['Normal'],
        fontSize=11,
        leading=15,
        spaceAfter=12
    )
    
    for i, section in enumerate(sections):
        story.append(Paragraph(section['heading'], heading_style))
        
        # Split content into paragraphs
        for para in section['content'].split('\n\n'):
            if para.strip():
                # Handle very long paragraphs
                story.append(Paragraph(para, body_style))
        
        # Add table after specific section if provided
        if table_data and i == 0:
            story.append(Spacer(1, 0.3*inch))
            table = Table(table_data)
            table.setStyle(TableStyle([
                ('BACKGROUND', (0, 0), (-1, 0), colors.darkblue),
                ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
                ('ALIGN', (0, 0), (-1, -1), 'LEFT'),
                ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
                ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
                ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.lightgrey]),
            ]))
            story.append(table)
            story.append(Spacer(1, 0.3*inch))
        
        if i < len(sections) - 1:
            story.append(PageBreak())
    
    doc.build(story)
    return output_path

# Example usage
if __name__ == "__main__":
    sections = [
        {
            'heading': 'Section 1: Overview',
            'content': 'This is the first section content...\n\nAdditional paragraph here.'
        },
        {
            'heading': 'Section 2: Details',
            'content': 'Detailed information goes here...'
        }
    ]
    
    table_data = [
        ['Header 1', 'Header 2', 'Header 3'],
        ['Row 1 Col 1', 'Row 1 Col 2', 'Row 1 Col 3'],
        ['Row 2 Col 1', 'Row 2 Col 2', 'Row 2 Col 3'],
    ]
    
    create_report_pdf(
        "output_report.pdf",
        "Report Title",
        "Generated: 2024",
        sections,
        table_data
    )

def safe_pdf_generation(output_path, title, sections, max_retries=3):
    """
    Generate PDF with retry logic and error handling.
    """
    import traceback
    from reportlab.lib.utils import ImageReader
    
    for attempt in range(max_retries):
        try:
            create_report_pdf(output_path, title, sections)
            # Verify file was created
            import os
            if os.path.exists(output_path) and os.path.getsize(output_path) > 0:
                print(f"✓ PDF generated successfully: {output_path}")
                return True
            else:
                raise Exception("PDF file empty or not created")
        except Exception as e:
            print(f"Attempt {attempt + 1}/{max_retries} failed: {e}")
            if attempt < max_retries - 1:
                import time
                time.sleep(1)  # Brief delay before retry
            else:
                print(f"PDF generation failed after {max_retries} attempts")
                print(traceback.format_exc())
                # Fallback: create minimal text file
                with open(output_path.replace('.pdf', '.txt'), 'w') as f:
                    f.write(f"Title: {title}\n\n")
                    for section in sections:
                        f.write(f"{section['heading']}\n{section['content']}\n\n")
                return False

#!/bin/bash
# pdf-lifecycle-workflow.sh
# Handles URL downloads, local files, and PDF generation

INPUT="$1"
MODE="${2:-extract}"  # extract, generate, or both
OUTPUT_PDF="downloaded.pdf"
OUTPUT_TXT="extracted.txt"
OUTPUT_REPORT="generated_report.pdf"

if [[ "$MODE" == "generate" ]]; then
    echo "Mode: PDF Generation"
    python3 generate_pdf.py
    exit $?
fi

if [[ "$INPUT" =~ ^https?:// ]]; then
    # Mode A: URL download
    PDF_URL="$INPUT"
    echo "Downloading PDF from URL..."
    curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" -o "$OUTPUT_PDF" "$PDF_URL"
else
    # Mode B: Local file
    if [ ! -f "$INPUT" ]; then
        echo "ERROR: Local file not found: $INPUT"
        exit 1
    fi
    OUTPUT_PDF="$INPUT"
    echo "Using local file: $INPUT"
fi

# Step 2: Verify file type
echo "Verifying file type..."
if ! file "$OUTPUT_PDF" | grep -q "PDF document"; then
    echo "WARNING: File is not a valid PDF"
    echo "Attempting fallback extraction anyway..."
fi

# Step 3: Try pdftotext
echo "Attempting pdftotext extraction..."
if command -v pdftotext &> /dev/null; then
    if pdftotext "$OUTPUT_PDF" "$OUTPUT_TXT" 2>/dev/null; then
        echo "Extraction successful with pdftotext"
        if [[ "$MODE" == "both" ]]; then
            echo "Proceeding to PDF generation..."
            python3 generate_pdf.py
        fi
        exit 0
    fi
fi

# Step 4: Fallback to PyMuPDF
echo "Falling back to PyMuPDF..."
python3 << 'PYTHON_SCRIPT'
import fitz
import sys

PDF already on local disk	Step 2 (Verify File Type)	Step 1 (Download)
PDF at a web URL	Step 1 (Download)	None
Need to CREATE a PDF from data	Mode C (Generate)	Modes A & B
Need to extract AND create	Mode A/B → Mode C	None

Pdf Extract Create Workflow

PDF Extract and Create Workflow

Overview

Entry Point: Determine Your Starting Point

Pdf Extract Create Workflow

PDF Extract and Create Workflow

Overview

Entry Point: Determine Your Starting Point

Mode A: Web URL Download

Step 1: Download PDF with Browser User-Agent

Mode B: Local File Processing & Extraction

Step 2: Verify File Type Before Parsing

Step 3: Primary Extraction with pdftotext

Step 4: Fallback to PyMuPDF (fitz)

Step 5: Graceful Degradation to Domain Knowledge

Mode C: PDF Generation with ReportLab

Installation

Step C1: Basic Document Structure

Step C2: Adding Tables

Step C3: Multi-Section Document Template

Step C4: Error Handling for PDF Generation

Step C5: Best Practices for PDF Generation

Complete Workflow Script (Handles Download, Extract, and Generate)

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing