Convert non-Markdown documents (docx, xlsx, pptx, html, pdf) to Markdown before AI processing
Purpose: Ensure agents always receive plain-text Markdown regardless of the source format the user provides. This skill is a pre-processing gate — it runs before any story-generator, prd-gap-analyzer, or other skill that expects Markdown input.
Invoke this skill automatically (without asking) whenever the user provides or references a file whose extension is NOT .md:
| Extension | Source |
|---|---|
.docx | Word document — PRD, spec, meeting notes |
.xlsx / .xls | Excel — requirements matrix, test plan, roadmap |
.pptx | PowerPoint — design deck, stakeholder presentation |
.html / .htm | Web page, exported Confluence/Notion page |
.pdf |
| Design brief, contract, scanned spec |
# Single file
sdlc doc convert ./docs/prd-draft.docx
# Directory (converts all supported formats inside)
sdlc doc convert ./uploads/
# Custom output dir
sdlc doc convert ./prd.pdf --output-dir ./stories/source/
Output is written to .sdlc/import/<filename>.extracted.md (default).
When a user uploads or pastes a path to a non-Markdown file:
sdlc doc convert <path> (or python3 scripts/doc-to-md.py <path>)..extracted.md file.prd-gap-analyzer, story-generator).Do not attempt to parse binary file content directly from a chat attachment. Always extract first.
Libraries are installed automatically by ./setup.sh and bootstrap-sdlc-features.sh (best-effort, non-fatal). No manual step is required after initial setup.
To install manually (e.g. on a new machine without re-running setup):
pip install pypdf pdfplumber mammoth python-docx openpyxl python-pptx \
beautifulsoup4 html2text trafilatura
All libraries are optional per format; the script degrades gracefully if one is missing.
To skip the automatic install during setup: set SDL_SKIP_DOC_LIBS=1 before running ./setup.sh.
The extracted .md file:
<!-- source: <filename> ... --> provenance comment.#, ##, …) where the source had structure.## Slide N: <title> section per slide.<!-- page N --> comment per page boundary..xlsm, .docm) are treated as read-only data; macros are never executed..md.skills/shared/prd-gap-analyzer/ — consumes the extracted Markdown.skills/shared/story-generator/ — ingests the normalized text to build Master Stories.rules/ask-first-protocol.md — ask before overwriting an existing .extracted.md.scripts/doc-to-md.py — implementation.cli/lib/executor.sh (cmd_doc) — sdlc doc convert entry point.