Resilient DOCX Text Extraction

Extract text from Microsoft Word (.docx) files using a robust two-tier approach: shell-based extraction as the primary method, with Python zipfile fallback when shell commands fail or return no output.

When to Use

Python environment may lack python-docx but zipfile module is available (standard library)
Working in constrained or inconsistent environments (containers, minimal images, CI/CD)
Shell unzip command returns errors or no output
Need reliable extraction with automatic fallback

Core Technique

DOCX files are ZIP archives containing XML files. This skill provides two extraction methods:

Primary (Shell): unzip -p + sed for fast extraction
Fallback (Python): zipfile module for reliable extraction when shell fails

Step-by-Step Instructions

Resilient DOCX Text Extraction

When to Use

Python environment may lack python-docx but zipfile module is available (standard library)
Working in constrained or inconsistent environments (containers, minimal images, CI/CD)
Shell unzip command returns errors or no output
Need reliable extraction with automatic fallback

Core Technique

DOCX files are ZIP archives containing XML files. This skill provides two extraction methods:

Primary (Shell): unzip -p + sed for fast extraction
Fallback (Python): zipfile module for reliable extraction when shell fails

Docx Parse Resilient

Resilient DOCX Text Extraction

When to Use

Core Technique

Step-by-Step Instructions

Docx Parse Resilient

Resilient DOCX Text Extraction

When to Use

Core Technique

Step-by-Step Instructions

1. Verify the DOCX file exists

2. Test shell extraction first (recommended)

3. Check if shell extraction produced output

4. Use Python zipfile fallback if needed

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing