Complete PDF workflow: extract content from source PDFs and generate new PDF reports using command-line tools
This skill provides a complete end-to-end workflow for processing PDF documents: extracting content from source PDFs, assembling report data, and generating new PDF output files—all using command-line tools when Python libraries are unavailable.
pdfinfo - Extract PDF Metadata# Get full PDF info
pdfinfo document.pdf
# Get only page count
pdfinfo document.pdf | grep Pages
# Extract page count as a number
pdfinfo document.pdf | grep Pages | awk '{print $2}'
Key metadata fields:
Pages: Number of pages in the PDFTitle: Document titleAuthor: Document authorCreationDate: When the PDF was createdModDate: Last modification datepdftotext - Extract Text Content# Extract all text to stdout
pdftotext document.pdf -
# Extract text to a file
pdftotext document.pdf output.txt
# Extract text from specific page range
pdftotext -f 1 -l 3 document.pdf output.txt
# Preserve layout (rough formatting)
pdftotext -layout document.pdf output.txt
enscript + ps2pdf - Text to PDF (Recommended)Convert plain text to PDF via PostScript:
# Install if needed
apt-get install -y enscript ghostscript
# Convert text file to PDF
enscript -B -o output.ps input.txt && ps2pdf output.ps output.pdf
# One-liner
enscript -B input.txt -o - | ps2pdf - output.pdf
Options:
-B: No borders-o: Output file (- for stdout)-f: Font specification (e.g., -fCourier10)wkhtmltopdf - HTML to PDFConvert HTML to PDF with full formatting support:
# Install if needed
apt-get install -y wkhtmltopdf
# Convert HTML file to PDF
wkhtmltopdf input.html output.pdf
# Convert from stdin
echo "<html><body><h1>Report</h1></body></html>" | wkhtmltopdf - output.pdf
# With options for better quality
wkhtmltopdf --page-size A4 --margin-top 25mm input.html output.pdf
libreoffice - Document ConversionConvert various document formats to PDF:
# Install if needed
apt-get install -y libreoffice-writer
# Convert to PDF (headless mode)
libreoffice --headless --convert-to pdf input.docx
libreoffice --headless --convert-to pdf input.odt
libreoffice --headless --convert-to pdf input.txt
pandoc - Universal Document ConverterConvert between many formats including PDF:
# Install if needed
apt-get install -y pandoc texlive-latex-base
# Convert markdown to PDF
pandoc input.md -o output.pdf
# Convert text to PDF
pandoc input.txt -o output.pdf
# With custom template
pandoc input.md --template=template.tex -o output.pdf
pdftk - PDF ManipulationMerge, split, or modify existing PDFs:
# Install if needed
apt-get install -y pdftk
# Merge multiple PDFs
pdftk file1.pdf file2.pdf file3.pdf cat output merged.pdf
# Extract pages
pdftk input.pdf cat 1-3 output extracted.pdf
# Split into individual pages
pdftk input.pdf burst
# Check extraction tools
which pdfinfo || echo "pdfinfo not found"
which pdftotext || echo "pdftotext not found"
# Check generation tools (at least one should be available)
which enscript || echo "enscript not found"
which wkhtmltopdf || echo "wkhtmltopdf not found"
which libreoffice || echo "libreoffice not found"
which pandoc || echo "pandoc not found"
# Debian/Ubuntu - Full installation
apt-get update && apt-get install -y poppler-utils enscript ghostscript
# Or install wkhtmltopdf instead
apt-get install -y poppler-utils wkhtmltopdf
# Or install pandoc for markdown-based reports
apt-get install -y poppler-utils pandoc texlive-latex-base
# Create working directory
mkdir -p workdir/extracted
cd workdir
# Extract text from each source PDF
for pdf in ../source_pdfs/*.pdf; do
filename=$(basename "$pdf" .pdf)
pdftotext -layout "$pdf" "extracted/${filename}.txt"
echo "Extracted: $filename"
done
# Verify extraction
for txt in extracted/*.txt; do
lines=$(wc -l < "$txt")
echo "$txt: $lines lines"
done
# Create report from extracted content
cat > final_report.txt << 'EOF'
===========================================
NEW CASE CREATION REPORT
Generated: $(date)
===========================================
EOF
# Add sections from each source
echo "SECTION 1: CASE CREATION GUIDE" >> final_report.txt
echo "-------------------------------------------" >> final_report.txt
cat extracted/case_creation_guide.txt >> final_report.txt
echo "" >> final_report.txt
echo "SECTION 2: CASE DETAIL SUMMARY" >> final_report.txt
echo "-------------------------------------------" >> final_report.txt
cat extracted/case_detail_summary.txt >> final_report.txt
echo "" >> final_report.txt
echo "SECTION 3: PATERNITY TEST RESULTS" >> final_report.txt
echo "-------------------------------------------" >> final_report.txt
cat extracted/paternity_test_results.txt >> final_report.txt
echo "" >> final_report.txt
echo "SECTION 4: ORDER OF CHILD SUPPORT" >> final_report.txt
echo "-------------------------------------------" >> final_report.txt
cat extracted/order_of_child_support.txt >> final_report.txt
echo "" >> final_report.txt
echo "===========================================" >> final_report.txt
echo "END OF REPORT" >> final_report.txt
echo "===========================================" >> final_report.txt
Option A: Using enscript + ps2pdf
# Convert text to PDF
enscript -B -fCourier10 -o report.ps final_report.txt && ps2pdf report.ps final_report.pdf
# Or one-liner
enscript -B final_report.txt -o - | ps2pdf - final_report.pdf
# Verify output
pdfinfo final_report.pdf | grep Pages
Option B: Using wkhtmltopdf (with HTML formatting)
# Convert text to simple HTML
cat > final_report.html << 'EOF'
<!DOCTYPE html>
<html>
<head>
<style>
body { font-family: monospace; margin: 40px; }
h1 { text-align: center; }
.section { margin-top: 30px; }
</style>
</head>
<body>
EOF
# Add content (escape HTML special chars if needed)
sed 's/&/\&/g; s/</\</g; s/>/\>/g' final_report.txt | \
sed 's/^=\{30,\}/<h1>/; s/$/<\/h1>/; /^<h1>/!s/^/--<br>/; /^----/!s/$/<br>/' >> final_report.html
echo "</body></html>" >> final_report.html
# Convert to PDF
wkhtmltopdf --page-size A4 --margin-top 25mm final_report.html final_report.pdf
Option C: Using pandoc (markdown format)
# Create markdown version
cat > final_report.md << 'EOF'
# New Case Creation Report
*Generated: $(date)*
---
EOF
# Add sections with markdown formatting
for txt in extracted/*.txt; do
filename=$(basename "$txt" .txt)
echo "## $filename" >> final_report.md
echo "" >> final_report.md
cat "$txt" >> final_report.md
echo "" >> final_report.md
done
# Convert to PDF
pandoc final_report.md -o final_report.pdf
# Check PDF was created
if [ -f final_report.pdf ]; then
echo "✓ PDF created successfully"
# Verify page count
pages=$(pdfinfo final_report.pdf | grep Pages | awk '{print $2}')
echo " Pages: $pages"
# Verify file size
size=$(ls -lh final_report.pdf | awk '{print $5}')
echo " Size: $size"
# Verify content
if pdftotext final_report.pdf - | grep -q "CASE CREATION REPORT"; then
echo "✓ Content verified"
else
echo "✗ Content verification failed"
fi
else
echo "✗ PDF generation failed"
exit 1
fi
import subprocess
import os
from datetime import datetime
class PDFReportGenerator:
def __init__(self, workdir="workdir"):
self.workdir = workdir
os.makedirs(workdir, exist_ok=True)
os.makedirs(f"{workdir}/extracted", exist_ok=True)
def check_tools(self):
"""Check available tools and return best option"""
tools = {}
for tool in ['pdfinfo', 'pdftotext', 'enscript', 'ps2pdf',
'wkhtmltopdf', 'pandoc']:
result = subprocess.run(['which', tool],
capture_output=True, text=True)
tools[tool] = result.returncode == 0
return tools
def extract_pdf(self, pdf_path, output_txt=None):
"""Extract text from PDF"""
if output_txt is None:
output_txt = f"{self.workdir}/extracted/{os.path.basename(pdf_path).replace('.pdf', '.txt')}"
result = subprocess.run(
['pdftotext', '-layout', pdf_path, output_txt],
capture_output=True, text=True
)
if result.returncode != 0:
raise Exception(f"Extraction failed: {result.stderr}")
return output_txt
def generate_report_pdf(self, text_content, output_pdf):
"""Generate PDF from text content using best available tool"""
tools = self.check_tools()
# Write content to temp file
temp_txt = f"{self.workdir}/temp_report.txt"
with open(temp_txt, 'w') as f:
f.write(text_content)
if tools.get('enscript') and tools.get('ps2pdf'):
# Use enscript + ps2pdf
temp_ps = f"{self.workdir}/temp_report.ps"
subprocess.run(['enscript', '-B', '-fCourier10', '-o', temp_ps, temp_txt], check=True)
subprocess.run(['ps2pdf', temp_ps, output_pdf], check=True)
return output_pdf
elif tools.get('pandoc'):
# Use pandoc
temp_md = f"{self.workdir}/temp_report.md"
with open(temp_md, 'w') as f:
f.write(f"# Report\n\nGenerated: {datetime.now()}\n\n---\n\n")
f.write(text_content)
subprocess.run(['pandoc', temp_md, '-o', output_pdf], check=True)
return output_pdf
elif tools.get('wkhtmltopdf'):
# Use wkhtmltopdf
temp_html = f"{self.workdir}/temp_report.html"
with open(temp_html, 'w') as f:
f.write(f"""<!DOCTYPE html>
<html><head><style>body{{font-family:monospace;margin:40px;}}</style></head>
<body><pre>{text_content}</pre></body></html>""")
subprocess.run(['wkhtmltopdf', temp_html, output_pdf], check=True)
return output_pdf
else:
raise Exception("No PDF generation tools available")
def get_pdf_info(self, pdf_path):
"""Get PDF metadata"""
result = subprocess.run(['pdfinfo', pdf_path],
capture_output=True, text=True)
info = {}
for line in result.stdout.split('\n'):
if ':' in line:
key, value = line.split(':', 1)
info[key.strip()] = value.strip()
return info
def create_report_from_pdfs(self, source_pdfs, output_pdf, report_title="Report"):
"""Complete workflow: extract from multiple PDFs and create report"""
extracted_texts = []
# Extract from each source
for pdf in source_pdfs:
txt = self.extract_pdf(pdf)
with open(txt, 'r') as f:
content = f.read()
extracted_texts.append((os.path.basename(pdf), content))
# Assemble report
report_content = f"""{'='*50}
{report_title}
Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
{'='*50}
"""
for filename, content in extracted_texts:
report_content += f"\n{'-'*50}\n"
report_content += f"SOURCE: {filename}\n"
report_content += f"{'-'*50}\n\n"
report_content += content + "\n"
report_content += f"\n{'='*50}\nEND OF REPORT\n{'='*50}\n"
# Generate PDF
return self.generate_report_pdf(report_content, output_pdf)
# Usage example
if __name__ == "__main__":
generator = PDFReportGenerator()
source_pdfs = [
"source_pdfs/case_creation_guide.pdf",
"source_pdfs/case_detail_summary.pdf",
"source_pdfs/paternity_test_results.pdf",
"source_pdfs/order_of_child_support.pdf"
]
output = generator.create_report_from_pdfs(
source_pdfs,
"final_report.pdf",
"NEW CASE CREATION REPORT"
)
print(f"Report created: {output}")
print(f"Pages: {generator.get_pdf_info(output).get('Pages', 'Unknown')}")
| Task | Command Sequence |
|---|---|
| Extract single PDF | pdftotext -layout file.pdf output.txt |
| Extract multiple PDFs | for f in *.pdf; do pdftotext -layout "$f" "${f%.pdf}.txt"; done |
| Text to PDF (enscript) | enscript -B input.txt -o - | ps2pdf - output.pdf |
| Text to PDF (pandoc) | pandoc input.md -o output.pdf |
| Merge PDFs | pdftk file1.pdf file2.pdf cat output merged.pdf |
| Verify PDF | pdfinfo file.pdf | grep Pages |
| Check PDF content | pdftotext file.pdf - | grep -i "keyword" |
No PDF generation tools available
apt-get install enscript ghostscript (simplest)apt-get install pandoc texlive-latex-base (best formatting)apt-get install wkhtmltopdf (HTML support)enscript produces garbled output
file input.txt-r option for raw output-fCourier10 for fixed-width fontps2pdf produces large files
ps2pdf -dPDFSETTINGS=/ebook input.ps output.pdfps2pdf -dPDFSETTINGS=/screen input.ps output.pdf (smaller, lower quality)pdftotext returns empty output
pdftotext -layout for better extractionReport formatting looks poor
pandoc with markdown for better formattingwkhtmltopdf with HTML/CSS for full control-layout flag to pdftotext to preserve structurepdftotext -layout for better structure#!/bin/bash
# Quick PDF Report Generation Script
set -e
# Configuration
SOURCE_DIR="${1:-source_pdfs}"
OUTPUT_PDF="${2:-final_report.pdf}"
WORKDIR="workdir_$$"
# Setup
mkdir -p "$WORKDIR/extracted"
trap "rm -rf $WORKDIR" EXIT
# Check tools
for tool in pdfinfo pdftotext; do
if ! which "$tool" &>/dev/null; then
echo "ERROR: $tool not found. Install poppler-utils."
exit 1
fi
done
# Extract all source PDFs
echo "Extracting PDFs from $SOURCE_DIR..."
for pdf in "$SOURCE_DIR"/*.pdf; do
[ -f "$pdf" ] || continue
name=$(basename "$pdf" .pdf)
pdftotext -layout "$pdf" "$WORKDIR/extracted/${name}.txt"
echo " Extracted: $name"
done
# Assemble report
echo "Assembling report..."
{
echo "========================================"
echo "REPORT GENERATED: $(date)"
echo "========================================"
echo ""
for txt in "$WORKDIR/extracted"/*.txt; do
[ -f "$txt" ] || continue
name=$(basename "$txt" .txt)
echo "=== $name ==="
cat "$txt"
echo ""
done
echo "========================================"
echo "END OF REPORT"
} > "$WORKDIR/report.txt"
# Generate PDF
echo "Generating PDF..."
if which enscript &>/dev/null && which ps2pdf &>/dev/null; then
enscript -B "$WORKDIR/report.txt" -o - | ps2pdf - "$OUTPUT_PDF"
elif which pandoc &>/dev/null; then
pandoc "$WORKDIR/report.txt" -o "$OUTPUT_PDF"
else
echo "ERROR: No PDF generation tool available"
exit 1
fi
# Verify
echo "Verifying output..."
pages=$(pdfinfo "$OUTPUT_PDF" | grep Pages | awk '{print $2}')
echo "✓ Report created: $OUTPUT_PDF ($pages pages)"