技能档案

Pdf Report Workflow Cli

Name: Pdf Report Workflow Cli
Author: HKUDS

Complete PDF workflow: verify, extract content, assemble reports, and generate output PDFs using command-line tools

HKUDS5,421 星标2026年3月24日

职业
分类: 文档

技能内容

PDF Report Generation with Command-Line Tools

When working with PDFs in minimal/containerized environments where Python libraries may be unavailable, this skill provides a complete end-to-end workflow: verify source PDFs, extract content, assemble structured reports, and generate final PDF output.

When to Use This Skill

Need to create new PDF reports from existing PDF sources
Need to verify, extract, and combine PDF content
Working in environments without Python PDF libraries
Building document assembly pipelines in CI/CD or containers

Complete Workflow Overview

[Source PDFs] → [Verify] → [Extract] → [Assemble] → [Generate Output PDF]

Phase 1: Tool Availability Check

相关技能

Pdf Report Workflow Cli | Skills Pool

# Core extraction tools (from poppler-utils)
which pdfinfo && echo "✓ pdfinfo available" || echo "✗ pdfinfo missing"
which pdftotext && echo "✓ pdftotext available" || echo "✗ pdftotext missing"

# PDF generation tools (choose based on availability)
which pdftk && echo "✓ pdftk available (PDF merging)" || true
which wkhtmltopdf && echo "✓ wkhtmltopdf available (HTML→PDF)" || true
which pandoc && echo "✓ pandoc available (document conversion)" || true
which enscript && echo "✓ enscript available (text→PS)" || true
which ps2pdf && echo "✓ ps2pdf available (PS→PDF)" || true

# Debian/Ubuntu - Core tools
apt-get update && apt-get install -y poppler-utils

# Optional: PDF generation tools (choose based on needs)
apt-get install -y pdftk          # PDF merging/manipulation
apt-get install -y wkhtmltopdf    # HTML to PDF
apt-get install -y pandoc         # Document conversion
apt-get install -y enscript ghostscript  # Text to PDF pipeline

# RHEL/CentOS/Fedora
yum install -y poppler-utils pdftk ghostscript
# or
dnf install -y poppler-utils pdftk ghostscript

# macOS (with Homebrew)
brew install poppler pdftk ghostscript
brew install --cask wkhtmltopdf  # GUI app, includes CLI

# Verify each source PDF exists and is readable
for pdf in source1.pdf source2.pdf source3.pdf; do
    if [ ! -f "$pdf" ]; then
        echo "ERROR: $pdf not found"
        exit 1
    fi
    
    pages=$(pdfinfo "$pdf" | grep Pages | awk '{print $2}')
    echo "$pdf: $pages pages"
done

# Check that required sections exist in source documents
REQUIRED_TERMS=("checklist" "summary" "references")

for term in "${REQUIRED_TERMS[@]}"; do
    if pdftotext source.pdf - | grep -qi "$term"; then
        echo "✓ Found: $term"
    else
        echo "⚠ Missing: $term"
    fi
done

# Create working directory
WORK_DIR=$(mktemp -d)
echo "Working directory: $WORK_DIR"

# Extract text from each source PDF
for i in source1.pdf source2.pdf source3.pdf; do
    base=$(basename "$i" .pdf)
    pdftotext -layout "$i" "$WORK_DIR/${base}.txt"
    echo "Extracted: $i → ${base}.txt"
done

# Also extract metadata for reference
pdfinfo source1.pdf > "$WORK_DIR/source1_metadata.txt"

# Extract only specific page ranges
pdftotext -f 1 -l 3 source1.pdf "$WORK_DIR/source1_pages1-3.txt"

# Extract and filter for specific content
pdftotext source1.pdf - | grep -A 10 "Summary" > "$WORK_DIR/summary_section.txt"

REPORT_MD="$WORK_DIR/report.md"

cat > "$REPORT_MD" << 'REPORT_HEADER'
# New Case Creation Report

**Generated:** $(date '+%Y-%m-%d %H:%M:%S')

---

## Executive Summary

REPORT_HEADER

# Add content from extracted sources
echo "## Case Details" >> "$REPORT_MD"
echo "" >> "$REPORT_MD"
cat "$WORK_DIR/source1.txt" >> "$REPORT_MD"
echo "" >> "$REPORT_MD"

echo "## Supporting Documentation" >> "$REPORT_MD"
echo "" >> "$REPORT_MD"
cat "$WORK_DIR/source2.txt" >> "$REPORT_MD"
echo "" >> "$REPORT_MD"

echo "## References" >> "$REPORT_MD"
echo "" >> "$REPORT_MD"
cat "$WORK_DIR/source3.txt" >> "$REPORT_MD"

echo "Report assembled: $REPORT_MD"

REPORT_HTML="$WORK_DIR/report.html"

cat > "$REPORT_HTML" << 'HTML_HEADER'
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Case Creation Report</title>
    <style>
        body { font-family: Arial, sans-serif; margin: 40px; }
        h1 { color: #333; border-bottom: 2px solid #333; }
        h2 { color: #666; }
        .section { margin: 20px 0; }
        .metadata { font-size: 0.9em; color: #888; }
    </style>
</head>
<body>
<h1>New Case Creation Report</h1>
<p class="metadata">Generated: HTML_HEADER

date '+%Y-%m-%d %H:%M:%S' >> "$REPORT_HTML"

cat >> "$REPORT_HTML" << 'HTML_MIDDLE'
</p>

<div class="section">
<h2>Case Details</h2>
HTML_MIDDLE

# Convert text to HTML paragraphs (simple approach)
sed 's/^/<p>/; s/$/<\/p>/' "$WORK_DIR/source1.txt" >> "$REPORT_HTML"

cat >> "$REPORT_HTML" << 'HTML_END'
</div>
</body>
</html>
HTML_END

echo "HTML report created: $REPORT_HTML"

# Markdown to PDF
pandoc "$REPORT_MD" -o final_report.pdf \
    --pdf-engine=xelatex \
    -V geometry:margin=1in

# Or HTML to PDF
pandoc "$REPORT_HTML" -o final_report.pdf \
    --pdf-engine=webkit2png  # or wkhtmltopdf

wkhtmltopdf \
    --page-size A4 \
    --margin-top 20mm \
    --margin-bottom 20mm \
    --margin-left 15mm \
    --margin-right 15mm \
    "$REPORT_HTML" \
    final_report.pdf

# Text to PostScript, then to PDF
enscript \
    --media=A4 \
    --font=Courier10 \
    --margins=20:20:20:20 \
    -o "$WORK_DIR/report.ps" \
    "$REPORT_MD"

# PostScript to PDF
ps2pdf "$WORK_DIR/report.ps" final_report.pdf

# If you have multiple PDFs to combine (not convert)
pdftk source1.pdf source2.pdf source3.pdf cat output combined_report.pdf

# Add bookmarks/outline
pdftk source1.pdf dump_data > "$WORK_DIR/bookmarks.txt"
# Edit bookmarks.txt, then:
pdftk source1.pdf update_info "$WORK_DIR/bookmarks.txt" output final_report.pdf

#!/bin/bash
# pdf-report-generator.sh - Complete PDF report generation workflow

set -e

# Configuration
SOURCE_PDFS=("case_guide.pdf" "case_summary.pdf" "test_results.pdf")
OUTPUT_PDF="new_case_report.pdf"
WORK_DIR=$(mktemp -d)

echo "=== PDF Report Generation Workflow ==="
echo "Working directory: $WORK_DIR"

# Phase 1: Verify tools
echo -e "\n[Phase 1] Checking tools..."
for tool in pdfinfo pdftotext; do
    if ! command -v "$tool" &> /dev/null; then
        echo "ERROR: $tool not found. Install poppler-utils."
        exit 1
    fi
done

# Phase 2: Verify source PDFs
echo -e "\n[Phase 2] Verifying source PDFs..."
for pdf in "${SOURCE_PDFS[@]}"; do
    if [ ! -f "$pdf" ]; then
        echo "ERROR: Source PDF not found: $pdf"
        exit 1
    fi
    pages=$(pdfinfo "$pdf" | grep Pages | awk '{print $2}')
    echo "✓ $pdf: $pages pages"
done

# Phase 3: Extract content
echo -e "\n[Phase 3] Extracting content..."
for i in "${!SOURCE_PDFS[@]}"; do
    pdf="${SOURCE_PDFS[$i]}"
    base=$(basename "$pdf" .pdf)
    pdftotext -layout "$pdf" "$WORK_DIR/${base}.txt"
    echo "✓ Extracted: $pdf"
done

# Phase 4: Assemble report
echo -e "\n[Phase 4] Assembling report..."
cat > "$WORK_DIR/report.md" << EOF
# New Case Creation Report

**Generated:** $(date '+%Y-%m-%d %H:%M:%S')

---

EOF

section_num=1
for pdf in "${SOURCE_PDFS[@]}"; do
    base=$(basename "$pdf" .pdf)
    cat >> "$WORK_DIR/report.md" << EOF

## Section $section_num: ${base//_/ }

EOF
    cat "$WORK_DIR/${base}.txt" >> "$WORK_DIR/report.md"
    ((section_num++))
done

echo "✓ Report assembled: $WORK_DIR/report.md"

# Phase 5: Generate PDF
echo -e "\n[Phase 5] Generating PDF..."
if command -v pandoc &> /dev/null; then
    pandoc "$WORK_DIR/report.md" -o "$OUTPUT_PDF" --pdf-engine=xelatex
    echo "✓ Generated with pandoc: $OUTPUT_PDF"
elif command -v enscript &> /dev/null; then
    enscript -o "$WORK_DIR/report.ps" "$WORK_DIR/report.md"
    ps2pdf "$WORK_DIR/report.ps" "$OUTPUT_PDF"
    echo "✓ Generated with enscript+ps2pdf: $OUTPUT_PDF"
else
    echo "⚠ No PDF generator available. Report saved as Markdown: $WORK_DIR/report.md"
    cp "$WORK_DIR/report.md" "./${OUTPUT_PDF%.pdf}.md"
fi

# Cleanup
echo -e "\n[Cleanup]"
echo "Working files preserved in: $WORK_DIR"
echo "=== Complete ==="

import subprocess
import tempfile
import os
from pathlib import Path

class PDFReportGenerator:
    """Complete PDF workflow: verify, extract, assemble, generate"""
    
    def __init__(self, work_dir=None):
        self.work_dir = Path(work_dir) if work_dir else Path(tempfile.mkdtemp())
        self.extracted_files = []
    
    def verify_pdf(self, pdf_path):
        """Verify PDF exists and get metadata"""
        result = subprocess.run(
            ['pdfinfo', str(pdf_path)],
            capture_output=True, text=True
        )
        if result.returncode != 0:
            raise ValueError(f"Cannot read PDF: {pdf_path}")
        
        metadata = {}
        for line in result.stdout.split('\n'):
            if ':' in line:
                key, val = line.split(':', 1)
                metadata[key.strip()] = val.strip()
        return metadata
    
    def extract_pdf(self, pdf_path, output_name=None):
        """Extract text from PDF"""
        if output_name is None:
            output_name = Path(pdf_path).stem + '.txt'
        
        output_path = self.work_dir / output_name
        subprocess.run(
            ['pdftotext', '-layout', str(pdf_path), str(output_path)],
            check=True
        )
        self.extracted_files.append(output_path)
        return output_path.read_text()
    
    def assemble_report(self, sections, output_md=None):
        """Assemble extracted content into markdown report"""
        if output_md is None:
            output_md = self.work_dir / 'report.md'
        
        with open(output_md, 'w') as f:
            f.write("# Generated Report\n\n")
            f.write(f"**Created:** {subprocess.check_output(['date']).decode().strip()}\n\n---\n\n")
            
            for i, (title, content) in enumerate(sections, 1):
                f.write(f"## Section {i}: {title}\n\n{content}\n\n")
        
        return output_md
    
    def generate_pdf(self, input_file, output_pdf=None):
        """Generate PDF from markdown or HTML"""
        if output_pdf is None:
            output_pdf = self.work_dir / 'report.pdf'
        
        # Try pandoc first
        if self._command_exists('pandoc'):
            subprocess.run(
                ['pandoc', str(input_file), '-o', str(output_pdf), '--pdf-engine=xelatex'],
                check=True
            )
        # Fall back to enscript + ps2pdf
        elif self._command_exists('enscript'):
            ps_file = self.work_dir / 'report.ps'
            subprocess.run(['enscript', '-o', str(ps_file), str(input_file)], check=True)
            subprocess.run(['ps2pdf', str(ps_file), str(output_pdf)], check=True)
        else:
            raise RuntimeError("No PDF generator available (need pandoc or enscript)")
        
        return output_pdf
    
    def _command_exists(self, cmd):
        return subprocess.run(['which', cmd], capture_output=True).returncode == 0
    
    def full_workflow(self, source_pdfs, output_pdf):
        """Complete workflow: verify → extract → assemble → generate"""
        # Verify and extract
        sections = []
        for pdf in source_pdfs:
            meta = self.verify_pdf(pdf)
            content = self.extract_pdf(pdf)
            sections.append((Path(pdf).stem, content))
        
        # Assemble and generate
        report_md = self.assemble_report(sections)
        self.generate_pdf(report_md, output_pdf)
        
        return Path(output_pdf)

# Usage example
generator = PDFReportGenerator()
final_pdf = generator.full_workflow(
    source_pdfs=['guide.pdf', 'summary.pdf', 'results.pdf'],
    output_pdf='final_report.pdf'
)
print(f"Report generated: {final_pdf}")

Tool	Purpose	Pros	Cons
pdfinfo	Metadata extraction	Fast, reliable	Metadata only
pdftotext	Text extraction	Preserves structure	Struggles with complex layouts
pandoc	Document conversion	Best quality, flexible	Requires LaTeX for PDF
wkhtmltopdf	HTML→PDF	Great styling support	Larger dependency
enscript	Text→PS	Minimal dependencies	Basic formatting
pdftk	PDF manipulation	Powerful merging	Not for content generation
*** End Files

Pdf Report Workflow Cli

PDF Report Generation with Command-Line Tools

When to Use This Skill

Complete Workflow Overview

Phase 1: Tool Availability Check

Pdf Report Workflow Cli

PDF Report Generation with Command-Line Tools

When to Use This Skill

Complete Workflow Overview

Phase 1: Tool Availability Check

Install Required Tools

Phase 2: Verify Source PDFs

Check Page Count and Metadata

Validate Content Presence

Phase 3: Extract Content from Source PDFs

Extract Text to Temporary Files

Extract Specific Sections (Optional)

Phase 4: Assemble Report Content

Create Structured Report (Markdown Format)

Create HTML Report (Alternative for Better Formatting)

Phase 5: Generate Final PDF Output

Option A: Using pandoc (Recommended if available)

Option B: Using wkhtmltopdf (HTML to PDF)

Option C: Using enscript + ps2pdf (Text to PDF)

Option D: Merge Existing PDFs with pdftk

Complete End-to-End Example

Python Integration Example

Troubleshooting

Best Practices

Tool Comparison

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing