Document manipulation toolkit for DOCX, PDF, PPTX, and XLSX files. Create, edit, extract, and convert documents programmatically.
Comprehensive toolkit for creating, editing, and manipulating documents across multiple formats including Word (DOCX), PDF, PowerPoint (PPTX), and Excel (XLSX). Use this agent for professional document processing, text extraction, tracked changes, and content manipulation.
Use this agent when:
A .docx file is a ZIP archive containing XML files and resources. Create, edit, or analyze Word documents using text extraction, raw XML access, or redlining workflows.
# Convert document to markdown with tracked changes
pandoc --track-changes=all path-to-file.docx -o output.md
# Options: --track-changes=accept/reject/all
# Unpack a file
python ooxml/scripts/unpack.py <office_file> <output_directory>
Key file structures:
word/document.xml - Main document contentsword/comments.xml - Comments referenced in document.xmlword/media/ - Embedded images and media files<w:ins> (insertions) and <w:del> (deletions) tagsUse docx-js for creating documents from scratch:
docx-js.md for detailed syntax and examplesUse the Document library (Python) for editing:
ooxml.md for the Document library APIpython ooxml/scripts/unpack.py <office_file> <output_directory>python ooxml/scripts/pack.py <input_directory> <office_file>CRITICAL: For complete tracked changes, implement ALL changes systematically.
Batching Strategy: Group related changes into batches of 3-10 changes.
Principle: Minimal, Precise Edits
Workflow:
pandoc --track-changes=all path-to-file.docx -o current.mdooxml.md and unpack documentpython ooxml/scripts/pack.py unpacked reviewed-document.docxpandoc --track-changes=all reviewed-document.docx -o verification.md# Convert DOCX to PDF
soffice --headless --convert-to pdf document.docx
# Convert PDF pages to JPEG
pdftoppm -jpeg -r 150 document.pdf page
from pypdf import PdfReader, PdfWriter
# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
# Extract text
text = ""
for page in reader.pages:
text += page.extract_text()
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
c.drawString(100, height - 100, "Hello World!")
c.save()
# Extract text
pdftotext input.pdf output.txt
# Merge with qpdf
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
# Extract images
pdfimages -j input.pdf output_prefix
.pptx files are ZIP archives containing XML files for slides, layouts, themes, and media.
# Convert to markdown
pandoc presentation.pptx -o output.md
Use pptxgenjs (JavaScript):
# Install
npm install pptxgenjs
# Create presentation
node create_presentation.js
Example:
const PptxGenJS = require("pptxgenjs");
const pptx = new PptxGenJS();
const slide = pptx.addSlide();
slide.addText("Hello World", { x: 1, y: 1, fontSize: 18 });
slide.addShape(pptx.ShapeType.rect, { x: 1, y: 2, w: 5, h: 3 });
pptx.writeFile({ fileName: "presentation.pptx" });
Use python-pptx:
from pptx import Presentation
# Load presentation
prs = Presentation('existing.pptx')
# Add slide
blank_slide_layout = prs.slide_layouts[6]
slide = prs.slides.add_slide(blank_slide_layout)
# Add text
title = slide.shapes.title
title.text = "New Slide Title"
prs.save('modified.pptx')
For complex edits, unpack and edit XML directly:
# Unpack
python ooxml/scripts/unpack.py presentation.pptx unpacked/
# Edit ppt/slides/slide1.xml, ppt/presentation.xml, etc.
# Pack
python ooxml/scripts/pack.py unpacked/ presentation.pptx
import pandas as pd
# Read entire sheet
df = pd.read_excel('file.xlsx')
# Read specific sheet
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
# Read specific columns
df = pd.read_excel('file.xlsx', usecols=['A', 'B', 'C'])
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['NYC', 'LA', 'Chicago']
})
# Write to Excel
df.to_excel('output.xlsx', index=False)
# Multiple sheets
with pd.ExcelWriter('output.xlsx') as writer:
df1.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill
# Load workbook
wb = load_workbook('file.xlsx')
ws = wb.active
# Modify cells
ws['A1'] = 'New Value'
ws['A1'].font = Font(bold=True)
ws['A1'].fill = PatternFill(start_color='FFFF00', end_color='FFFF00', fill_type='solid')
# Add formula
ws['B10'] = '=SUM(B1:B9)'
# Save
wb.save('modified.xlsx')
| Format | Task | Best Tool |
|---|---|---|
| DOCX | Create new | docx-js (JavaScript) |
| DOCX | Edit existing | Document library (Python) |
| DOCX | Extract text | pandoc |
| DOCX | Tracked changes | Redlining workflow |
| Extract text | pdfplumber | |
| Extract tables | pdfplumber | |
| Merge/split | pypdf or qpdf | |
| Create | reportlab | |
| PPTX | Create new | pptxgenjs |
| PPTX | Edit | python-pptx |
| PPTX | Extract | pandoc |
| XLSX | Read/Write | pandas |
| XLSX | Advanced edits | openpyxl |
# DOCX
npm install -g docx
pip install defusedxml
# PDF
pip install pypdf pdfplumber reportlab
apt-get install pandoc poppler-utils qpdf
# PPTX
npm install pptxgenjs
pip install python-pptx
# XLSX
pip install pandas openpyxl