Convert heavy document formats (PDF, Word, Excel, PowerPoint, and 10+ others) to token-efficient Markdown/CSV with structurally-aware digest compression. Use when Claude needs to read documents without burning excessive context budget. Triggers on /distill, 'distill this', 'convert to markdown', 'make this readable'.
All subagent dispatches use disk-mediated dispatch. See shared/dispatch-convention.md for the full protocol.
Convert heavy document formats to token-efficient representations (Markdown, CSV) for LLM consumption. The core deliverable is the .digest.md — a structurally-aware compression at 20-30% of token count.
Skill type: Rigid — follow exactly, no shortcuts.
Models:
Announce at start: "I'm using the distill skill to convert documents to token-efficient formats."
/distill <path> [path2 ...]
/distill <directory>
Examples:
/distill docs/report.pdf — convert one file/distill docs/report.pdf data/sheet.xlsx slides/deck.pptx — convert multiple files/distill docs/ — convert all supported files in directory (single-level, not recursive)Mixed mode is supported: /distill docs/ extra/report.pdf
Execute phases in this order. Each phase completes for all files before the next begins.
At skill start, before processing any files, check for required tools:
| Check | Command | If Missing |
|---|---|---|
| Tier 1 | which pandoc | "pandoc not found. Install: apt install pandoc (Debian/Ubuntu) or brew install pandoc (macOS). Tier 1 formats will be skipped." |
| Tier 2 | which pdftotext | "pdftotext not found. Install: apt install poppler-utils (Debian/Ubuntu) or brew install poppler (macOS). PDF conversion will be skipped." |
| Tier 3 | which python3 | "python3 not found. PPTX and XLSX conversion will be skipped." |
| Pre-flight | which unzip | Skip zip bomb detection with note. Not a conversion blocker. |
| Pre-flight | which pdfdetach | Skip PDF attachment detection with note. Not a conversion blocker. |
Build a set of available tiers. Route files only to available tiers. Files targeting unavailable tiers get routed to unsupported-with-guidance (Phase 1b).
Individual file paths: Use directly. Verify each file exists.
Directory paths: Single-level glob for files with supported extensions (not recursive). Build file list sorted alphabetically. Report: "Found {N} convertible files in {directory}: {list}."
Supported extensions for glob: .pdf, .docx, .rtf, .html, .htm, .odt, .epub, .rst, .org, .tex, .ipynb, .pptx, .xlsx
Mixed mode: Process both directory globs and individual paths. Deduplicate by absolute path.
For each file, determine the conversion tier by extension:
| Extension | Tier | Format Flag |
|---|---|---|
.docx | 1 | docx |
.rtf | 1 | rtf |
.html | 1 | html |
.htm | 1 | html |
.odt | 1 | odt |
.epub | 1 | epub |
.rst | 1 | rst |
.org | 1 | org |
.tex | 1 | latex |
.ipynb | 1 | ipynb |
.pdf | 2 | — |
.pptx | 3 | — |
.xlsx | 3 | — |
Unsupported formats: Output actionable guidance per this table, then continue with remaining files:
| Extension | Guidance |
|---|---|
.xls | "Legacy Excel format. Export as .xlsx from Excel/LibreOffice, then re-run /distill." |
.ods | "OpenDocument Spreadsheet. Export as .csv (single-sheet) or .xlsx (multi-sheet), then re-run /distill." |
.odp | "OpenDocument Presentation. Export as .pptx, then re-run /distill." |
.key | "Apple Keynote. Export as .pptx from Keynote, then re-run /distill." |
.numbers | "Apple Numbers. Export as .xlsx from Numbers, then re-run /distill." |
.pages | "Apple Pages. Export as .docx from Pages, then re-run /distill." |
Unknown extensions: "Unsupported format: {ext}. Supported formats: docx, rtf, html, odt, epub, rst, org, tex, ipynb, pdf, pptx, xlsx."
Unavailable tier: If a file's tier is unavailable (tool missing from Phase 0), report: "{file}: requires {tool} (not installed). Skipping."
Run per-file safety checks before conversion. Failures are per-file — do not halt the batch.
Office formats are ZIP archives. If unzip is available:
UNCOMPRESSED=$(unzip -l "$INPUT_PATH" 2>/dev/null | tail -1 | awk '{print $1}')
If uncompressed size exceeds 500MB (524288000 bytes), abort this file: "File uncompressed size ({size}) exceeds 500MB safety limit. Skipping."
If unzip is not available, skip this check (noted in Phase 0).
For PDF files, if pdfdetach is available:
ATTACHMENTS=$(pdfdetach -list "$INPUT_PATH" 2>/dev/null | grep -c "^[0-9]")
If attachments found, warn: "PDF contains {N} embedded attachments. These are not extracted — only text content is converted." Continue with conversion.
After conversion (not before), verify output is valid UTF-8:
file --mime-encoding "$OUTPUT_PATH"
If not UTF-8, attempt re-encoding: iconv -f <detected-charset> -t UTF-8 "$OUTPUT_PATH" -o "$OUTPUT_PATH.tmp" && mv "$OUTPUT_PATH.tmp" "$OUTPUT_PATH". If re-encoding fails, report and skip.
Process files sequentially. For each file:
INPUT_PATH="$1"
OUTPUT_PATH="${INPUT_PATH%.*}.md"
FORMAT="$2" # from routing table
pandoc -f "$FORMAT" -t markdown --wrap=none "$INPUT_PATH" -o "$OUTPUT_PATH"
Shell safety: All file paths via quoted shell variables. Never inline interpolation. Never use unquoted $() or backtick interpolation of file paths.
Error handling:
Idempotency: Overwrites existing output files without warning.
Step 1 — Extract:
INPUT_PATH="$1"
TEXT_PATH="${INPUT_PATH%.*}.txt"
OUTPUT_PATH="${INPUT_PATH%.*}.md"
pdftotext -layout "$INPUT_PATH" "$TEXT_PATH"
Scanned PDF detection: Count total characters and pages:
CHARS=$(wc -c < "$TEXT_PATH")
PAGES=$(pdfinfo "$INPUT_PATH" 2>/dev/null | grep "^Pages:" | awk '{print $2}')
If pdfinfo is unavailable, estimate pages from pdftotext output (count form-feed characters). If average chars/page < 50, report: "This PDF appears to be scanned/image-based. Text extraction produced minimal content. Consider OCR processing externally before distilling." Skip structuring pass. Clean up temp .txt file.
Step 2 — Structure: Dispatch a Sonnet agent using skills/distill/pdf-structurer-prompt.md to transform the raw pdftotext output into clean Markdown with recovered headings, lists, tables, and code blocks. Write result to OUTPUT_PATH. Clean up temp .txt file.
Venv setup (once per invocation, only if Tier 3 files exist):
VENV="/tmp/crucible-distill-venv"
# Health check
if [ -d "$VENV" ]; then
"$VENV/bin/python3" -c "import sys" 2>/dev/null || rm -rf "$VENV"
fi
# Create if missing
if [ ! -d "$VENV" ]; then
echo "Installing Python dependencies (one-time setup, ~15 seconds)..."
python3 -m venv "$VENV"
"$VENV/bin/pip" install --quiet python-pptx==1.0.2 openpyxl==3.1.5
if [ $? -ne 0 ]; then
echo "Failed to install Python dependencies."
echo "Manual install: pip install python-pptx==1.0.2 openpyxl==3.1.5"
echo "PPTX and XLSX conversion will be skipped."
# Route remaining Tier 3 files to unsupported
return
fi
fi
PPTX conversion:
"$VENV/bin/python3" skills/distill/convert_pptx.py --input "$INPUT_PATH" --output "$OUTPUT_PATH"
XLSX conversion:
"$VENV/bin/python3" skills/distill/convert_xlsx.py --input "$INPUT_PATH" --output-dir "$(dirname "$INPUT_PATH")"
Output: one CSV per sheet at {basename}-{sheetname}.csv. Sheetnames sanitized (spaces → hyphens, special chars stripped).
After all conversions complete, run the digest pass on eligible files.
Eligibility:
.md (not .csv)Dispatch: For each eligible file, dispatch a Sonnet digest agent using skills/distill/digest-prompt.md. Before dispatching, fill template placeholders: replace {{ORIGINAL_WORDS}} with the converted file's word count and {{TARGET_WORDS}} with 25% of that count. The raw pdftotext output (for pdf-structurer-prompt.md) or converted .md content (for digest-prompt.md) is included as a content block below the prompt template in the dispatch file.
Quality check: After the digest agent returns, count words in the digest:
Output: Write digest to {original-path-without-ext}.digest.md.
Word count is a proxy for token count. These diverge for code-heavy or CJK content, but word count is sufficient for v1.
After all conversions and digests complete, output:
## Distill Summary
| File | Format | Tier | Converted | Digest | Token Savings |
|---|---|---|---|---|---|
| {file} | {format} | {tier} | {output} ({words} words) | {digest} ({words} words) | ~{pct}% |
**Total:** {N} files converted, {M} digests produced, ~{pct}% average token savings on digestible content.
Generated files can be added to .gitignore if not needed in version control.
Token savings per file = 1 - (digest words / converted words) expressed as percentage.
Files that were skipped (unsupported, tool missing, pre-flight failure) are listed separately:
**Skipped:** {N} files
- {file}: {reason}
Every Bash command that touches file paths MUST use quoted shell variables:
# CORRECT
pandoc -f "$FORMAT" -t markdown --wrap=none "$INPUT_PATH" -o "$OUTPUT_PATH"
# WRONG — never do this
pandoc -f $FORMAT -t markdown --wrap=none $INPUT_PATH -o $OUTPUT_PATH
"$VAR", never bare $VAR$() or backtick interpolation of paths| Failure | Behavior |
|---|---|
| Tool not installed | Skip tier, report with install guidance, continue |
| Conversion fails (non-zero exit) | Report per-file, continue with remaining files |
| Empty conversion output | Report per-file, continue |
| Zip bomb detected | Skip file, report, continue |
| Scanned PDF | Report, skip digest, continue |
| Venv/pip failure | Skip Tier 3, report with manual install instructions |
| Digest out of range | One retry, accept second result regardless |
| File not found | Report, continue with remaining files |
| Permission denied | Report, continue |
| Encoding error | Attempt re-encode, skip on failure, continue |
Principle: Never halt the batch for a single file failure. Report and continue.
Standalone usage:
/distill <path> — convert one or more files/distill <directory> — convert all supported files in directoryCalled by:
Dispatches:
skills/distill/pdf-structurer-prompt.mdskills/distill/digest-prompt.mdDoes not dispatch: No quality gate, no red-team, no review loop. Distill is a utility skill — it converts and compresses. Quality is ensured by the digest quality metric (word count check + one retry).