Turn research or topics into podcast scripts and audio using ElevenLabs
Turn research, articles, or topics into podcast-ready scripts and audio content. Generate conversational scripts with host/guest dynamics and produce audio using ElevenLabs text-to-speech.
Gather the source material:
Choose the podcast format:
| Format | Description | Best For |
|---|---|---|
| Solo explainer | One host walks through the topic | Tutorials, news summaries, deep dives |
| Conversational duo | Two hosts discuss and riff | Making complex topics accessible, entertainment |
| Interview style | Host asks questions, expert answers | Technical topics, research papers |
| Debate | Two perspectives argue a topic | Controversial or nuanced subjects |
| Narrative | Storytelling with narration | Case studies, historical events |
The two-host format that works (reverse-engineered from Google's Audio Overviews):
Structure (target ~150 words per minute of audio):
Write for ears, not eyes: contractions always, no semicolons, no parentheticals. If you wouldn't say it out loud, rewrite it.
Script format — one line per utterance, speaker tag in brackets, blank line between speakers. This is the unit you'll chunk for TTS:
[ALEX]: So today we're diving into something that honestly broke my brain a little.
[SAM]: Oh no. What now.
[ALEX]: Okay — you know how everyone says [common belief]? There's this paper from [source] that basically says... the opposite.
[SAM]: Wait. The *opposite* opposite?
Install: pip install elevenlabs pydub
Model choice: eleven_multilingual_v2 for quality (10K char limit per call); eleven_turbo_v2_5 for speed/cost (40K char limit, ~300ms latency, ~3x faster).
Voice IDs that work for duo podcasts (from the default library — verify with client.voices.search()):
JBFqnCBsd6RMkjVDRZzb (George — warm, mid-range male)21m00Tcm4TlvDq8ikWAM (Rachel — clear, measured female)pNInz6obpgDQGcFmaJgB (Adam — energetic narrator)EXAVITQu4vr4xnSDxMaL (Bella — conversational female)Settings for conversational podcast delivery:
stability: 0.45 — lower = more expressive; below 0.3 gets inconsistentsimilarity_boost: 0.8 — keeps voice consistent across chunksstyle: 0.3 — mild exaggeration for energy (0 = flat)use_speaker_boost: Trueimport os
from elevenlabs.client import ElevenLabs
from pydub import AudioSegment
import io
client = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])
VOICES = {"ALEX": "JBFqnCBsd6RMkjVDRZzb", "SAM": "21m00Tcm4TlvDq8ikWAM"}
def render_line(speaker: str, text: str) -> AudioSegment:
audio = client.text_to_speech.convert(
voice_id=VOICES[speaker],
text=text,
model_id="eleven_multilingual_v2",
output_format="mp3_44100_128",
voice_settings={"stability": 0.45, "similarity_boost": 0.8,
"style": 0.3, "use_speaker_boost": True},
)
return AudioSegment.from_mp3(io.BytesIO(b"".join(audio)))
# parse script → list of (speaker, text) tuples, render each, concat
gap = AudioSegment.silent(duration=350) # 350ms between speakers
episode = sum((render_line(s, t) + gap for s, t in lines), AudioSegment.empty())
episode.export("episode_raw.mp3", format="mp3", bitrate="128k")
Chunking long utterances: split at sentence boundaries (., ?, !), keep under ~800 chars per call. Pass previous_text/next_text params to preserve prosody across chunk boundaries.
Podcast standard is -16 LUFS (stereo) per Apple/Spotify specs. pydub's normalize() is peak-only — not LUFS. Use ffmpeg's two-pass loudnorm via the ffmpeg-normalize wrapper:
pip install ffmpeg-normalize
ffmpeg-normalize episode_raw.mp3 -o episode.mp3 -c:a libmp3lame -b:a 128k \
-t -16 -tp -1.5 -lra 11 --normalization-type ebu
-t -16 = target LUFS, -tp -1.5 = true-peak ceiling (prevents clipping), -lra 11 = loudness range. This runs two passes automatically (analyze, then correct).
| Content Type | Target Length | Script Word Count |
|---|---|---|
| News summary | 5-10 min | 750-1,500 words |
| Topic explainer | 10-20 min | 1,500-3,000 words |
| Deep dive | 20-40 min | 3,000-6,000 words |
| Research paper review | 15-25 min | 2,250-3,750 words |
Rule of thumb: ~150 words per minute of audio.
ELEVENLABS_API_KEY env var"Kubernetes" → "koo-ber-NET-eez") or use ElevenLabs pronunciation dictionarieseleven_multilingual_v2 has known issues with very long single calls (voice drift, occasional stutter) — chunk at sentence boundaries, don't send 5K-char blobs