Split large Word documents (.docx) into smaller chunks and reassemble them. Use when working with large legal documents, contracts, or agreements that need to be processed in smaller pieces—particularly useful when AI tools struggle with document size limits. Supports splitting by heading style (e.g., "Heading 1") or text pattern (e.g., "ARTICLE"), and merging edited chunks back together.
Split large .docx files into smaller documents by article/section, then reassemble after editing.
pip install python-docx lxml --break-system-packages
| Task | Command |
|---|---|
| List styles | python scripts/docx_splitter.py doc.docx --list-styles |
| Split by style | python scripts/docx_splitter.py doc.docx --split-on "Heading 1" |
| Split by pattern | python scripts/docx_splitter.py doc.docx --pattern "ARTICLE" |
| Merge chunks | python scripts/docx_merger.py ./chunks/ -o merged.docx |
Always check available styles first:
python scripts/docx_splitter.py document.docx --list-styles
By heading style (preferred for well-formatted documents):
python scripts/docx_splitter.py document.docx --split-on "Heading 1" --output-dir ./chunks
By text pattern (for inconsistent styling):
python scripts/docx_splitter.py document.docx --pattern "ARTICLE" --output-dir ./chunks
python scripts/docx_splitter.py document.docx --pattern "^SECTION \d+" --output-dir ./chunks
| Option | Description |
|---|---|
--split-on, -s | Style name (e.g., "Heading 1", "Heading 2") |
--pattern, -p | Regex pattern (e.g., "ARTICLE", "^SECTION \d+") |
--output-dir, -o | Output directory (default: <input>_split/) |
--no-header | Exclude preamble content from each chunk |
Files are numbered for proper ordering:
document_split/
├── 00_Preamble.docx # Content before first split point
├── 01_ARTICLE_I.docx
├── 02_ARTICLE_II.docx
└── ...
Reassemble edited chunks:
python scripts/docx_merger.py ./chunks/ -o final_document.docx
Merge specific files in order:
python scripts/docx_merger.py file1.docx file2.docx file3.docx -o merged.docx
| Option | Description |
|---|---|
-o, --output | Output path (required) |
--page-breaks | Add page breaks between sections |
--include-duplicates | Keep duplicate preamble content |
# 1. Check document structure
python scripts/docx_splitter.py BigContract.docx --list-styles
# 2. Split by articles
python scripts/docx_splitter.py BigContract.docx --split-on "Heading 1" -o ./chunks
# 3. Edit individual chunks (with Harvey, Claude, or manually)
# 4. Reassemble
python scripts/docx_merger.py ./chunks/ -o BigContract_Final.docx
--pattern when heading styles are inconsistent (common in opposing counsel docs)