Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. Use when working with professional documents (.docx files) for creating new documents, modifying content, working with tracked changes, or adding comments.
A .docx file is essentially a ZIP archive containing XML files that you can read or edit.
Use text extraction or raw XML access
Use docx-js workflow
Convert to markdown using pandoc:
# Convert document to markdown with tracked changes
pandoc --track-changes=all path-to-file.docx -o output.md
Needed for: comments, complex formatting, document structure, embedded media.
# Unpack a file
python ooxml/scripts/unpack.py <input.docx> <output_dir>
Key file structures:
word/document.xml - Main document contentsword/comments.xml - Comments referenced in document.xmlword/media/ - Embedded images and media files<w:ins> and <w:del> tagsUse docx-js (JavaScript/TypeScript):
import { Document, Paragraph, TextRun, Packer } from "docx";
const doc = new Document({
sections: [{
properties: {},
children: [
new Paragraph({
children: [new TextRun("Hello World")],
}),
],
}],
});
const buffer = await Packer.toBuffer(doc);
Use the Document library (Python):
python ooxml/scripts/unpack.py <input.docx> <output_dir>python ooxml/scripts/pack.py <unpacked_dir> <output.docx>For document review with tracked changes:
Principle: Minimal, Precise Edits Only mark text that actually changes. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]
Get markdown representation:
pandoc --track-changes=all path-to-file.docx -o current.md
Identify and group changes into batches of 3-10
Unpack the document:
python ooxml/scripts/unpack.py <input.docx> <output_dir>
Implement changes in batches using Document library
Pack the document:
python ooxml/scripts/pack.py unpacked reviewed-document.docx
Final verification:
pandoc --track-changes=all reviewed-document.docx -o verification.md
# Convert DOCX to PDF
soffice --headless --convert-to pdf document.docx
# Convert PDF pages to JPEG
pdftoppm -jpeg -r 150 document.pdf page