Skip to content

Search skills.../

Agent Skill Search Engine

Search

Search
Categories
Occupations

About

About
Privacy
Terms

© 2026 Skills Pool. All rights reserved.

Generating Voiceover | Skills Pool

Skill File

Generating Voiceover

Produce delivery-ready TTS audio for video tasks. Use when asked to generate narration or voiceover, choose a TTS engine, clean artifacts, normalize loudness, and export segment-ready audio files.

soasme0 starsMar 29, 2026

Occupation
Categories: Media

Skill Content

Generate Voiceover

Choose engine by need:

Prototyping: use espeak-ng for fast local iteration.
Cloud: use scripts/chatterbox_tts.js for Replicate Chatterbox.

Processing goals:

Keep narration intelligible and free of low-end rumble and digital fizz.
Keep tonal balance and loudness consistent across all segments in one video.
Avoid boundary clicks/pops with short fades on every cut.
Export a delivery-ready file that matches project specs.

Recommended workflow:

Generate raw TTS first. Do not normalize at this stage.
Split or assemble segments by sentence/phrase boundaries, then apply the same cleanup chain to every segment.
Do timing edits (trim, pacing, alignment) before final loudness normalization.
Normalize loudness as the final audio-processing step.
Export to the project target format and run a final listening pass.

Engine selection guidance:

Use espeak-ng for drafts, timing tests, and rapid iteration where quality is less important.

Related Skills

Quick Install

Generating Voiceover

npx skillvault add soasme/soasme-violyra-skills-generating-voiceover-skill-md

Download Skill Open repository

Author: soasme
stars: 0
Updated: Mar 29, 2026
Occupation

On this page

01Generate Voiceover

Use Chatterbox for cloud-quality output when naturalness and delivery quality matter.

Use --audio-ref in Chatterbox when you need style/voice steering; omit it for generic delivery.

Prototyping example: espeak-ng -v en-us -s 165 -w draft.wav "Your script text"

Cloud example (requires REPLICATE_API_TOKEN): REPLICATE_API_TOKEN=<token> node .agents/skills/generating-voiceover/scripts/chatterbox_tts.js --prompt "Your script text" --output narration.wav --audio-ref ref.wav

Cleanup guidance:

Always high-pass around 20 Hz to remove DC/rumble.
Apply low-pass around 16 kHz only when top-end fizz/harshness is audible.
Add short fade-in/out (~50 ms) at clip edges to prevent clicks and pops.
Keep one consistent filter chain across all segments to avoid tonal mismatch.

Use consistent cleanup filters across all segments. Recommended FFmpeg chain: ffmpeg -y -i raw.wav -af "highpass=f=20,lowpass=f=16000,afade=t=in:st=0:d=0.05,areverse,afade=t=in:st=0:d=0.05,areverse" cleaned.wav

If low-pass is not needed, remove lowpass=f=16000.

Loudness guidance (BS.1770 / EBU R128 style):

Measure first, then normalize.
Use integrated loudness target -23 LUFS, true peak around -1.5 dBTP, and optional LRA=11.
Re-normalize if any later tempo/stretch/timing edit changes the signal.

Measure loudness (EBU R128 / BS.1770 style): ffmpeg -hide_banner -i cleaned.wav -filter_complex ebur128 -f null -

Normalize as the final step: ffmpeg -y -i cleaned.wav -af "loudnorm=I=-23:TP=-1.5:LRA=11:print_format=summary" normalized.wav

Export delivery format (typical video-ready WAV): ffmpeg -y -i normalized.wav -ar 48000 -ac 1 -c:a pcm_s16le delivery.wav

Final QC checklist:

No audible clicks at segment starts/ends.
No clipping on peaks and no unexpected pumping/breathing.
Speech remains clear on small speakers and headphones.
All segments sound like the same voice in the same acoustic space.

If tempo or duration changes happen after normalization, run normalization again.

Logging

Log to {project_dir}/logs/production.jsonl. See skills/lib/logging-guide.md for schema.

On invocation — key inputs: script_path, voice_style On completion — key outputs: audio_path, duration_s, lufs (normalized loudness)

02

Logging

Audio and Video Technicians

Songsee

Generate spectrograms and feature-panel visualizations from audio with the songsee CLI.

Video Frames

Extract frames or short clips from videos using ffmpeg.

Gifgrep

Search GIF providers with CLI/TUI, download results, and extract stills/sheets.

Qqbot Media

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

Camsnap

Capture frames or clips from RTSP/ONVIF cameras.

Openai Whisper Api

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

Audio and Video Technicians