A user may ask you to create, edit, or analyze the contents of a .docx file. A .docx file is essentially a ZIP archive containing XML files and other resources that you can read or edit. You have different tools and workflows available for different tasks.
A user may ask you to create, edit, or analyze the contents of a .docx file. A .docx file is essentially a ZIP archive containing XML files and other resources that you can read or edit. You have different tools and workflows available for different tasks.
Use "Text extraction" or "Raw XML access" sections below
Use "Creating a new Word document" workflow
Your own document + simple changes Use "Basic OOXML editing" workflow
Someone else's document Use "Redlining workflow" (recommended default)
Legal, academic, business, or government docs Use "Redlining workflow" (required)
If you just need to read the text contents of a document, you should convert the document to markdown using pandoc. Pandoc provides excellent support for preserving document structure and can show tracked changes:
# Convert document to markdown with tracked changes
pandoc --track-changes=all path-to-file.docx -o output.md
# Options: --track-changes=accept/reject/all
You need raw XML access for: comments, complex formatting, document structure, embedded media, and metadata. For any of these features, you'll need to unpack a document and read its raw XML contents.
python ooxml/scripts/unpack.py <office_file> <output_directory>
word/document.xml - Main document contentsword/comments.xml - Comments referenced in document.xmlword/media/ - Embedded images and media files<w:ins> (insertions) and <w:del> (deletions) tagsThis workflow allows you to plan comprehensive tracked changes using markdown before implementing them in OOXML. CRITICAL: For complete tracked changes, you must implement ALL changes systematically.
Batching Strategy: Group related changes into batches of 3-10 changes. This makes debugging manageable while maintaining efficiency. Test each batch before moving to the next.
Principle: Minimal, Precise Edits
When implementing tracked changes, only mark text that actually changes. Repeating unchanged text makes edits harder to review and appears unprofessional. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]. Preserve the original run's RSID for unchanged text by extracting the <w:r> element from the original and reusing it.
Example - Changing "30 days" to "60 days" in a sentence:
# BAD - Replaces entire sentence
'<w:del><w:r><w:delText>The term is 30 days.</w:delText></w:r></w:del><w:ins><w:r><w:t>The term is 60 days.</w:t></w:r></w:ins>'
# GOOD - Only marks what changed, preserves original <w:r> for unchanged text
'<w:r w:rsidR="00AB12CD"><w:t>The term is </w:t></w:r><w:del><w:r><w:delText>30</w:delText></w:r></w:del><w:ins><w:r><w:t>60</w:t></w:r></w:ins><w:r w:rsidR="00AB12CD"><w:t> days.</w:t></w:r>'
Get markdown representation: Convert document to markdown with tracked changes preserved:
pandoc --track-changes=all path-to-file.docx -o current.md
Identify and group changes: Review the document and identify ALL changes needed, organizing them into logical batches:
Location methods (for finding changes in XML):
Batch organization (group 3-10 related changes per batch):
Read documentation and unpack:
ooxml.md (~600 lines) completely from start to finish. NEVER set any range limits when reading this file. Pay special attention to the "Document Library" and "Tracked Change Patterns" sections.python ooxml/scripts/unpack.py <file.docx> <dir>Implement changes in batches: Group changes logically (by section, by type, or by proximity) and implement them together in a single script. This approach:
Suggested batch groupings:
For each batch of related changes:
a. Map text to XML: Grep for text in word/document.xml to verify how text is split across <w:r> elements.
b. Create and run script: Use get_node to find nodes, implement changes, then doc.save(). See "Document Library" section in ooxml.md for patterns.
Note: Always grep word/document.xml immediately before writing a script to get current line numbers and verify text content. Line numbers change after each script run.
Pack the document: After all batches are complete, convert the unpacked directory back to .docx:
python ooxml/scripts/pack.py unpacked reviewed-document.docx
Final verification: Do a comprehensive check of the complete document:
pandoc --track-changes=all reviewed-document.docx -o verification.md
grep "original phrase" verification.md # Should NOT find it
grep "replacement phrase" verification.md # Should find it
IMPORTANT: When generating code for DOCX operations:
Required dependencies (install if not available):
sudo apt-get install pandoc (for text extraction)sudo apt-get install libreoffice (for PDF conversion)pip install defusedxml (for secure XML parsing)This skill is applicable to execute the workflow or actions described in the overview.