Name: Converting Epub To Markdown
Author: LeoDPraetorian

Buscar habilidades.../

Contenido de la habilidad

EPUB to markdown conversion using Pandoc (recommended) or native tools (fallback), with vision-based figure descriptions.

When to Use

Use this skill when:

Converting EPUB ebooks to markdown for the books-to-skills workflow
Extracting text content from EPUB files
Preparing ebooks for chapter splitting and skill creation

Quick Reference

Phase	Purpose	Output
0. Check Pandoc	Verify Pandoc installed, prompt if not	Decision on method
1. Convert	Pandoc EPUB→markdown (or native fallback)	Raw markdown
2. Extract Images	Unzip EPUB, count images, estimate tokens	Image manifest + stats
3. Prompt for Images	Ask user if they want image descriptions	Decision on vision
4. Describe Figures

Aspect	Pandoc	Native (sed/awk)
HTML entity decoding	Complete	Partial (misses many)
Chapter ordering	Correct (OPF spine)	Alphabetical (incorrect)
Cross-references	Preserved as links	Plain text
Figure references	Image links + IDs	Plain text
Code formatting	Semantic spans	Basic backticks
Dependencies	Requires Pandoc	None
Recommendation	Primary	Fallback only

# Check if Pandoc is installed
which pandoc && pandoc --version | head -1

AskUserQuestion(
  questions: [{
    question: "Pandoc is not installed. How would you like to proceed?",
    header: "Pandoc",
    options: [
      {
        label: "Install Pandoc (Recommended)",
        description: "Run 'brew install pandoc' - provides best conversion quality with correct chapter ordering, full entity decoding, and preserved cross-references"
      },
      {
        label: "Use native conversion",
        description: "Continue with sed/awk fallback - may have incorrect chapter order, missing entities, and no cross-references"
      }
    ],
    multiSelect: false
  }]
)

brew install pandoc
# Verify installation
pandoc --version | head -1

# Set paths
EPUB_FILE="path/to/book.epub"
WORK_DIR=".tmp/epub-convert"
mkdir -p "$WORK_DIR"

# Convert EPUB to markdown
pandoc "$EPUB_FILE" -t markdown -o "$WORK_DIR/book.md"

# Check result
wc -l "$WORK_DIR/book.md"
head -100 "$WORK_DIR/book.md"

EPUB_FILE="path/to/book.epub"
WORK_DIR=".tmp/epub-convert"
rm -rf "$WORK_DIR"
mkdir -p "$WORK_DIR"

# Extract EPUB (it's a ZIP file)
unzip -q "$EPUB_FILE" -d "$WORK_DIR/extracted"

# Combine HTML files (WARNING: alphabetical order may be wrong)
find "$WORK_DIR/extracted" \( -name "*.xhtml" -o -name "*.html" \) | \
  grep -v nav.xhtml | sort | xargs cat > "$WORK_DIR/combined.html"

# Multi-pass conversion
sed -e 's/<h1[^>]*>/\n# /g' -e 's/<\/h1>/\n/g' \
    -e 's/<h2[^>]*>/\n## /g' -e 's/<\/h2>/\n/g' \
    -e 's/<h3[^>]*>/\n### /g' -e 's/<\/h3>/\n/g' \
    -e 's/<p[^>]*>/\n/g' -e 's/<\/p>/\n/g' \
    -e 's/<li[^>]*>/- /g' -e 's/<\/li>/\n/g' \
    -e 's/<strong[^>]*>/**/g' -e 's/<\/strong>/**/g' \
    -e 's/<em[^>]*>/_/g' -e 's/<\/em>/_/g' \
    -e 's/<code[^>]*>/`/g' -e 's/<\/code>/`/g' \
    "$WORK_DIR/combined.html" > "$WORK_DIR/pass1.md"

sed -e 's/<[^>]*>//g' \
    -e 's/&nbsp;/ /g' -e 's/&lt;/</g' -e 's/&gt;/>/g' \
    -e 's/&amp;/\&/g' -e 's/&mdash;/—/g' -e 's/&ndash;/–/g' \
    -e 's/&#160;/ /g' -e 's/&#8212;/—/g' -e 's/&#8211;/–/g' \
    -e "s/&#8217;/'/g" -e "s/&#8216;/'/g" \
    -e 's/&#8220;/"/g' -e 's/&#8221;/"/g' \
    "$WORK_DIR/pass1.md" > "$WORK_DIR/pass2.md"

awk 'BEGIN{blank=0} /^$/{blank++; if(blank<=2) print; next} {blank=0; print}' \
    "$WORK_DIR/pass2.md" > "$WORK_DIR/book.md"

# If not already extracted (Pandoc path)
unzip -q "$EPUB_FILE" -d "$WORK_DIR/extracted"

# Create image manifest
find "$WORK_DIR/extracted" -type f \( -name "*.jpg" -o -name "*.png" -o -name "*.gif" -o -name "*.webp" \) | sort > "$WORK_DIR/images.txt"

# Count images
IMAGE_COUNT=$(wc -l < "$WORK_DIR/images.txt")
echo "Found $IMAGE_COUNT images"

# Calculate total image size and estimate tokens
TOTAL_BYTES=$(cat "$WORK_DIR/images.txt" | xargs -I{} stat -f%z "{}" 2>/dev/null | awk '{sum+=$1} END {print sum}')
TOTAL_MB=$(echo "scale=2; $TOTAL_BYTES / 1048576" | bc)
# Token estimate: ~1 token per 4 bytes for vision, plus ~100 tokens overhead per image for description
ESTIMATED_TOKENS=$(echo "scale=0; ($TOTAL_BYTES / 4) + ($IMAGE_COUNT * 100)" | bc)
echo "Total image size: ${TOTAL_MB}MB"
echo "Estimated vision tokens: $ESTIMATED_TOKENS"

# Display stats for user decision
echo "Image Processing Summary:"
echo "  - Images found: $IMAGE_COUNT"
echo "  - Total size: ${TOTAL_MB}MB"
echo "  - Estimated tokens: $ESTIMATED_TOKENS"
echo "  - Estimated sub-agents needed: $((($IMAGE_COUNT + 59) / 60))"

AskUserQuestion(
  questions: [{
    question: "Found {IMAGE_COUNT} images ({TOTAL_MB}MB, ~{ESTIMATED_TOKENS} tokens). Would you like to generate inline figure descriptions?",
    header: "Images",
    options: [
      {
        label: "Yes, process all images (Recommended)",
        description: "Generate descriptions for all {IMAGE_COUNT} images using vision. Takes ~{ESTIMATED_AGENTS} sub-agents. Provides complete coverage."
      },
      {
        label: "Yes, referenced images only",
        description: "Only process images referenced in the markdown. Faster but may miss some figures."
      },
      {
        label: "No, skip image processing",
        description: "Convert text only. Image references will remain as links without descriptions."
      }
    ],
    multiSelect: false
  }]
)

# Find only images referenced in markdown
grep -oh 'images/[^)]*' "$WORK_DIR/book.md" | sort -u > "$WORK_DIR/referenced-images.txt"
REF_COUNT=$(wc -l < "$WORK_DIR/referenced-images.txt")
echo "Processing $REF_COUNT referenced images (of $IMAGE_COUNT total)"
# Use referenced-images.txt instead of images.txt in Step 4

# Find all images and create numbered manifest
find "$WORK_DIR/extracted" -type f \( -name "*.jpg" -o -name "*.png" \) | sort > "$WORK_DIR/images.txt"
wc -l "$WORK_DIR/images.txt"

# Create output directory for descriptions
mkdir -p "$WORK_DIR/descriptions"

1. Read images.txt to get total count
2. Set START_INDEX=1
3. While START_INDEX <= TOTAL_IMAGES:
   a. Spawn sub-agent with Task tool (see prompt below)
   b. Agent returns: {"last_processed": N, "output_file": "descriptions/batch-N.md"}
   c. Set START_INDEX = N + 1
4. Concatenate all batch files into final descriptions

You are processing images for EPUB figure descriptions.

IMAGE_DIR: {work_dir}/extracted/OEBPS/images/
MANIFEST: {work_dir}/images.txt
START_INDEX: {start_index}
OUTPUT_FILE: {work_dir}/descriptions/batch-{start_index}.md

TASK:
1. Read the manifest file to get the list of images
2. Starting from image #{start_index}, read each image using the Read tool
3. For each image, generate a description (50-100 words) in this format:

   > **Figure {number}: {title}**
   > {description focusing on technical details, components, attack points}

4. Write descriptions to OUTPUT_FILE as you go
5. Process as many images as you can before context limits
6. When stopping, report back:
   - Last image index successfully processed
   - Path to output file
   - Count of images described in this batch

IMAGE TYPE GUIDANCE:
| Type           | Focus on                                       |
| -------------- | ---------------------------------------------- |
| Die photo      | Memory arrays, logic blocks, bond pads, scale  |
| Schematic      | Key components, connections, attack points     |
| Memory map     | Address ranges, blocks, vulnerabilities        |
| Code listing   | Language, purpose, key operations (transcribe) |
| Oscilloscope   | Signals, timing, glitch characteristics        |
| Photo          | Hardware setup, probe points, modifications    |

CRITICAL: If you cannot process more images due to context, STOP and report
your progress. The parent will spawn another agent to continue.

Task(
  subagent_type: "general-purpose",
  model: "sonnet",  # Use Sonnet for cost efficiency - image description is straightforward
  description: "Process images 1-50 for descriptions",
  prompt: "[prompt above with variables filled in]"
)

# Tracking state
TOTAL_IMAGES=$(wc -l < "$WORK_DIR/images.txt")
PROCESSED=0

# Loop until all processed
while [ $PROCESSED -lt $TOTAL_IMAGES ]; do
  # Spawn sub-agent starting at PROCESSED+1
  # Sub-agent returns last_processed
  # Update PROCESSED = last_processed
done

# Combine all batch files
cat "$WORK_DIR/descriptions/batch-*.md" > "$WORK_DIR/all-descriptions.md"

# Find which images are referenced in markdown
grep -oh 'images/[^)]*' "$WORK_DIR/book.md" | sort -u > "$WORK_DIR/referenced-images.txt"

# Process only referenced images first

Image Type	Description Focus
Die photo	Memory arrays, logic blocks, bond pads, scale
Schematic	Key components, connections, attack points
Memory map	Address ranges, blocks, vulnerabilities
Code listing	Language, purpose, key operations (transcribe)
Oscilloscope	Signals, timing, glitch characteristics
Photo	Hardware setup, probe points, modifications

# Check file size
wc -l "$WORK_DIR/book.md"

# Check for chapter markers
grep -n "^## " "$WORK_DIR/book.md" | head -20

# Check token count (approximate)
chars=$(wc -c < "$WORK_DIR/book.md")
tokens=$((chars / 4))
echo "Approximate tokens: $tokens"

# Check for remaining HTML artifacts
grep -c "<[a-zA-Z]" "$WORK_DIR/book.md"

# Preview content
head -200 "$WORK_DIR/book.md"

OUTPUT_FILE="path/to/BookName_full.md"
cp "$WORK_DIR/book.md" "$OUTPUT_FILE"
rm -rf "$WORK_DIR"
echo "Conversion complete: $OUTPUT_FILE"

Content Type	Pandoc	Native	With Vision	Notes
Headers (h1-h6)	Excellent	Good	-	Pandoc preserves IDs
Paragraphs	Excellent	Good	-	Both work well
Lists (ul/ol)	Excellent	Fair	-	Pandoc handles nesting
Bold/Italic	Excellent	Good	-	Direct mapping
Code blocks	Excellent	Fair	-	Pandoc uses semantic spans
Tables	Good	Poor	-	Both may need cleanup
Cross-references	Excellent	None	-	Pandoc preserves links
HTML entities	Complete	Partial	-	Pandoc decodes all
Chapter order	Correct	Wrong	-	Pandoc follows OPF spine
Figures (images)	Reference	Reference	Excellent	Vision extracts semantic meaning
Code as images	None	None	Excellent	Vision can transcribe

Read(".claude/skill-library/claude/skill-management/converting-books-to-skills/SKILL.md")

brew install pandoc
# Or on Linux: apt install pandoc

grep -A 100 "<spine" "$WORK_DIR/extracted/OEBPS/content.opf"

sed -i '' -e 's/&#937;/Ω/g' -e 's/&#8194;/ /g' "$WORK_DIR/book.md"

Read("$WORK_DIR/extracted/OEBPS/images/f0148-01.jpg")

Converting Epub To Markdown | Skills Pool

Converting Epub To Markdown

Converting Epub To Markdown

When to Use

Quick Reference

Method Comparison

Workflow

Step 0: Check for Pandoc

Step 1: Convert with Pandoc (Recommended)

Step 1 (Alternative): Native Conversion (Fallback)

Step 2: Extract Images and Calculate Stats

Step 3: Prompt User for Image Processing

Step 4: Generate Figure Descriptions (Vision via Sub-Agents)

4.1 Create Image Manifest

4.2 Sub-Agent Batch Processing Pattern

4.3 Graceful Resumption

4.4 Prioritization (Optional)

4.5 Image Types and Description Focus

Step 5: Validate Output

Step 6: Move to Final Location

Quality Comparison

Integration with books-to-skills

Troubleshooting

Integration

Called By

Pairs With

Changelog

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing