Extract text from DOCX files using shell or Python zipfile, with environment-aware fallback
Extract text from Microsoft Word (.docx) files using either shell commands or Python's zipfile module, automatically selecting the most reliable method for your environment.
python-docx but has standard library accessunzip, sed) may be unavailable or restrictedDOCX files are ZIP archives containing XML files. This skill provides two extraction methods:
Method A (Shell): unzip -p + sed for tag stripping
Method B (Python): zipfile module for archive access + string parsing
Before extraction, detect which method will work:
# Quick shell method test
if unzip -v >/dev/null 2>&1; then
echo "Shell method available"
else
echo "Shell method unavailable, try Python"
fi
# Quick Python method test
python3 -c "import zipfile; print('Python method available')" 2>/dev/null
Use when unzip and sed are available and the environment allows shell operations.
1. Verify the DOCX file exists
ls -la document.docx
2. Extract raw XML content
unzip -p document.docx word/document.xml
3. Strip XML tags from content
unzip -p document.docx word/document.xml | sed -e 's/<[^>]*>//g'
4. Clean up whitespace (optional)
unzip -p document.docx word/document.xml | \
sed -e 's/<[^>]*>//g' | \
sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' | \
sed -e '/^$/d'
5. Save extracted text to file
unzip -p document.docx word/document.xml | \
sed -e 's/<[^>]*>//g' > output.txt
parse_docx_shell() {
local file="$1"
if [ ! -f "$file" ]; then
echo "Error: File not found: $file" >&2
return 1
fi
if ! command -v unzip >/dev/null 2>&1; then
echo "Error: unzip not available" >&2
return 1
fi
unzip -p "$file" word/document.xml 2>/dev/null | \
sed -e 's/<[^>]*>//g' | \
sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' | \
sed -e '/^$/d'
}
# Usage: parse_docx_shell document.docx
Use when shell method fails or Python environment is more reliable than shell.
1. Verify the DOCX file exists
ls -la document.docx
2. Run Python extraction via run_shell
run_shell 'python3 -c "
import zipfile
import re
with zipfile.ZipFile(\"document.docx\", \"r\") as z:
content = z.read(\"word/document.xml\").decode(\"utf-8\")
text = re.sub(r\"<[^>]*>\", \"\", content)
lines = [l.strip() for l in text.splitlines() if l.strip()]
for line in lines:
print(line)
"'
3. Save to file by redirecting output
run_shell 'python3 -c "
import zipfile
import re
with zipfile.ZipFile(\"document.docx\", \"r\") as z:
content = z.read(\"word/document.xml\").decode(\"utf-8\")
text = re.sub(r\"<[^>]*>\", \"\", content)
lines = [l.strip() for l in text.splitlines() if l.strip()]
with open(\"output.txt\", \"w\") as f:
for line in lines:
f.write(line + \"\\n\")
"'
parse_docx_python() {
local file="$1"
local output="$2"
if [ ! -f "$file" ]; then
echo "Error: File not found: $file" >&2
return 1
fi
run_shell "python3 -c \"
import zipfile
import re
import sys