Extract text, tables, metadata, and assets from PDF, Word (.docx), and Excel (.xlsx) files. Use when the user wants to read, parse, or extract content from documents.
Unified skill for extracting content from common document formats. Identify the file type, then follow the corresponding reference.
.pdf, .docx, .xlsx / .xls).| Extension | Format | Reference |
|---|---|---|
.pdf | PDF (digital or scanned) | pdf.md |
.docx | Word (Office Open XML) | docx.md |
.xlsx | Excel (Office Open XML) | xlsx.md |
.doc | Legacy Word (binary) | Convert to .docx first — see docx.md |
.xls | Legacy Excel (binary) | Convert to .xlsx or use xlrd — see xlsx.md |
.doc / .xls, convert with LibreOffice (soffice --headless --convert-to <target>) before applying the OOXML workflow.