Skill-Datei

Distill

Name: Distill
Author: raddue

Convert heavy document formats (PDF, Word, Excel, PowerPoint, and 10+ others) to token-efficient Markdown/CSV with structurally-aware digest compression. Use when Claude needs to read documents without burning excessive context budget. Triggers on /distill, 'distill this', 'convert to markdown', 'make this readable'.

raddue10 Sterne06.04.2026

Beruf
Kategorien: Dokumente

Skill-Inhalt

Overview

All subagent dispatches use disk-mediated dispatch. See shared/dispatch-convention.md for the full protocol.

Convert heavy document formats to token-efficient representations (Markdown, CSV) for LLM consumption. The core deliverable is the .digest.md — a structurally-aware compression at 20-30% of token count.

Skill type: Rigid — follow exactly, no shortcuts.

Models:

PDF structuring agent: Sonnet
Digest agent: Sonnet
Orchestrator: runs on whatever model the session uses

Announce at start: "I'm using the distill skill to convert documents to token-efficient formats."

Invocation API

/distill <path> [path2 ...]
/distill <directory>

Examples:

/distill docs/report.pdf — convert one file

Verwandte Skills

Distill | Skills Pool

Check	Command	If Missing
Tier 1	`which pandoc`	"pandoc not found. Install: `apt install pandoc` (Debian/Ubuntu) or `brew install pandoc` (macOS). Tier 1 formats will be skipped."
Tier 2	`which pdftotext`	"pdftotext not found. Install: `apt install poppler-utils` (Debian/Ubuntu) or `brew install poppler` (macOS). PDF conversion will be skipped."
Tier 3	`which python3`	"python3 not found. PPTX and XLSX conversion will be skipped."
Pre-flight	`which unzip`	Skip zip bomb detection with note. Not a conversion blocker.
Pre-flight	`which pdfdetach`	Skip PDF attachment detection with note. Not a conversion blocker.

Extension	Guidance
`.xls`	"Legacy Excel format. Export as .xlsx from Excel/LibreOffice, then re-run /distill."
`.ods`	"OpenDocument Spreadsheet. Export as .csv (single-sheet) or .xlsx (multi-sheet), then re-run /distill."
`.odp`	"OpenDocument Presentation. Export as .pptx, then re-run /distill."
`.key`	"Apple Keynote. Export as .pptx from Keynote, then re-run /distill."
`.numbers`	"Apple Numbers. Export as .xlsx from Numbers, then re-run /distill."
`.pages`	"Apple Pages. Export as .docx from Pages, then re-run /distill."

UNCOMPRESSED=$(unzip -l "$INPUT_PATH" 2>/dev/null | tail -1 | awk '{print $1}')

ATTACHMENTS=$(pdfdetach -list "$INPUT_PATH" 2>/dev/null | grep -c "^[0-9]")

file --mime-encoding "$OUTPUT_PATH"

INPUT_PATH="$1"
OUTPUT_PATH="${INPUT_PATH%.*}.md"
FORMAT="$2"  # from routing table

pandoc -f "$FORMAT" -t markdown --wrap=none "$INPUT_PATH" -o "$OUTPUT_PATH"

INPUT_PATH="$1"
TEXT_PATH="${INPUT_PATH%.*}.txt"
OUTPUT_PATH="${INPUT_PATH%.*}.md"

pdftotext -layout "$INPUT_PATH" "$TEXT_PATH"

CHARS=$(wc -c < "$TEXT_PATH")
PAGES=$(pdfinfo "$INPUT_PATH" 2>/dev/null | grep "^Pages:" | awk '{print $2}')

VENV="/tmp/crucible-distill-venv"

# Health check
if [ -d "$VENV" ]; then
    "$VENV/bin/python3" -c "import sys" 2>/dev/null || rm -rf "$VENV"
fi

# Create if missing
if [ ! -d "$VENV" ]; then
    echo "Installing Python dependencies (one-time setup, ~15 seconds)..."
    python3 -m venv "$VENV"
    "$VENV/bin/pip" install --quiet python-pptx==1.0.2 openpyxl==3.1.5
    if [ $? -ne 0 ]; then
        echo "Failed to install Python dependencies."
        echo "Manual install: pip install python-pptx==1.0.2 openpyxl==3.1.5"
        echo "PPTX and XLSX conversion will be skipped."
        # Route remaining Tier 3 files to unsupported
        return
    fi
fi

"$VENV/bin/python3" skills/distill/convert_pptx.py --input "$INPUT_PATH" --output "$OUTPUT_PATH"

"$VENV/bin/python3" skills/distill/convert_xlsx.py --input "$INPUT_PATH" --output-dir "$(dirname "$INPUT_PATH")"

## Distill Summary

| File | Format | Tier | Converted | Digest | Token Savings |
|---|---|---|---|---|---|
| {file} | {format} | {tier} | {output} ({words} words) | {digest} ({words} words) | ~{pct}% |

**Total:** {N} files converted, {M} digests produced, ~{pct}% average token savings on digestible content.
Generated files can be added to .gitignore if not needed in version control.

**Skipped:** {N} files
- {file}: {reason}

# CORRECT
pandoc -f "$FORMAT" -t markdown --wrap=none "$INPUT_PATH" -o "$OUTPUT_PATH"

# WRONG — never do this
pandoc -f $FORMAT -t markdown --wrap=none $INPUT_PATH -o $OUTPUT_PATH

Failure	Behavior
Tool not installed	Skip tier, report with install guidance, continue
Conversion fails (non-zero exit)	Report per-file, continue with remaining files
Empty conversion output	Report per-file, continue
Zip bomb detected	Skip file, report, continue
Scanned PDF	Report, skip digest, continue
Venv/pip failure	Skip Tier 3, report with manual install instructions
Digest out of range	One retry, accept second result regardless
File not found	Report, continue with remaining files
Permission denied	Report, continue
Encoding error	Attempt re-encode, skip on failure, continue

Extension	Tier	Format Flag
`.docx`	1	`docx`
`.rtf`	1	`rtf`
`.html`	1	`html`
`.htm`	1	`html`
`.odt`	1	`odt`
`.epub`	1	`epub`
`.rst`	1	`rst`
`.org`	1	`org`
`.tex`	1	`latex`
`.ipynb`	1	`ipynb`
`.pdf`	2	—
`.pptx`	3	—
`.xlsx`	3	—

Distill

Overview

Invocation API

Distill

Overview

Invocation API

The Process

Phase 0: Tool Availability Check

Phase 1: Input Resolution

1a: Build File List

1b: Route Files to Tiers

Phase 2: Pre-Flight Checks

Zip Bomb Detection (docx, pptx, xlsx)

PDF Attachment Detection

Encoding Validation

Phase 3: Conversion

Tier 1: Pandoc-Native

Tier 2: PDF (pdftotext + Claude structuring)

Tier 3: Python Venv

Phase 4: Digest Pass

Phase 5: Summary

Shell Safety (Non-Negotiable)

Error Handling

Integration

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing