Convert narration audio plus slide decks into a narrated video. Use when the user has an audio-only `mp4/m4a/mp3/wav` and a `ppt/pptx/pdf` deck, and needs slide images, transcript extraction, slide timing planning, or final `mp4` rendering with `whisper-cpp` and `ffmpeg`.
Use this skill when the source video has narration audio but no usable slide visuals, and the final deliverable should be a slide-based lecture video.
Resolve bundled scripts relative to this skill directory. If the runtime has already opened this SKILL.md, prefer paths like scripts/extract_slide_outline.py and scripts/render_from_timing_csv.py instead of machine-specific absolute paths.
Inventory inputs.
mp4/m4a/mp3/wav, ppt/pptx, pdf, and any pre-rendered slide images.pdf or image directory for rendering. Treat pptx as the source of slide text and as a fallback for export.Prepare tools.
ffmpeg, ffprobe, pdftoppm.whisper-cli from whisper-cpp plus a multilingual model such as ggml-small.bin.pptx exists and no pdf/images exist, prefer Keynote or PowerPoint export on macOS. Use soffice only as fallback because profile or rendering issues are common.Produce slide images.
pdf exists, render it to images:
pdftoppm -png -r 200 "$PDF" "$OUTDIR/slide"
pptx exists, export to pdf or slide images with Keynote or PowerPoint, then continue from pdf.slide-01.png, slide-02.png, ...Extract slide text.
python3 scripts/extract_slide_outline.py \
--pptx "$PPTX" \
--out "$WORKDIR/slide_outline.csv"
Extract clean audio for ASR.
mp4, extract mono wav:
ffmpeg -y -i "$AUDIO_MP4" -ar 16000 -ac 1 -c:a pcm_s16le "$WORKDIR/audio.wav"
wav/mp3/m4a, convert to the same mono wav form if needed.Transcribe with whisper-cli.
whisper-cli -ng \
-m "$MODEL" \
-f "$WORKDIR/audio.wav" \
-l zh \
-ocsv -osrt -of "$WORKDIR/transcript"
transcript.csv for downstream parsing. transcript.srt is useful for manual review.-ng to force CPU mode.Build slide_timings.csv.
slide,start_sec,end_sec,duration_sec,reason
1,0.000,15.000,15.000,opening title and agenda
2,15.000,100.000,85.000,architecture overview starts here
duration_sec = end_sec - start_sec.end_sec matches the audio duration or is within a small tolerance.Render the final video.
python3 scripts/render_from_timing_csv.py \
--images "$SLIDE_IMAGES_DIR" \
--timings "$WORKDIR/slide_timings.csv" \
--audio "$WORKDIR/audio.wav" \
--output "$OUT_VIDEO"
ffconcat file, validates timing continuity, and calls ffmpeg to encode the final mp4.Verify and iterate.
ffprobe.slide_timings.csv and rerun the render script.slide_timings.csv.Install dependencies on macOS if missing:
brew install ffmpeg poppler whisper-cpp
Typical multilingual model download:
mkdir -p .models
curl -L 'https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin' -o .models/ggml-small.bin
scripts/extract_slide_outline.py
Extract slide text from pptx into CSV or JSON for timing analysis.scripts/render_from_timing_csv.py
Validate a timing CSV, generate an ffconcat, and render the final video with ffmpeg.