Extract text from DOCX files with shell-primary approach and Python zipfile fallback for maximum reliability
Extract text from Microsoft Word (.docx) files using a robust two-tier approach: shell-based extraction as the primary method, with Python zipfile fallback when shell commands fail or return no output.
python-docx but zipfile module is available (standard library)unzip command returns errors or no outputDOCX files are ZIP archives containing XML files. This skill provides two extraction methods:
unzip -p + sed for fast extractionzipfile module for reliable extraction when shell failsls -la document.docx
Try the shell-based approach:
unzip -p document.docx word/document.xml 2>/dev/null | sed -e 's/<[^>]*>//g'
Verify the shell method returned content:
content=$(unzip -p document.docx word/document.xml 2>/dev/null | sed -e 's/<[^>]*>//g')
if [ -z "$content" ]; then
echo "Shell extraction returned no output, trying Python fallback..."
fi
When shell commands fail or return empty output, use Python's standard zipfile module:
python3 -c "
import zipfile
import sys
import re