Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems.
Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flow with sub-800ms latency while handling interruptions, background noise, and emotional nuance.
This skill covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency, most natural) and pipeline (STT→LLM→TTS, more control, easier to debug). Key insight: latency is the constraint. Humans expect responses in 500ms. Every millisecond matters.
84% of organizations are increasing voice AI budgets in 2025. This is the year voice agents go mainstream.
Direct audio-to-audio processing for lowest latency
When to use: Maximum naturalness, emotional preservation, real-time conversation
""" [User Audio] → [S2S Model] → [Agent Audio]
Advantages:
Disadvantages:
""" import { RealtimeClient } from '@openai/realtime-api-beta';
const client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY, });
// Configure for voice conversation
client.updateSession({
modalities: ['text', 'audio'],
voice: 'alloy',
input_audio_format: 'pcm16',
output_audio_format: 'pcm16',
instructions: You are a helpful customer service agent. Be concise and friendly. If you don't know something, say so rather than making things up.,
turn_detection: {
type: 'server_vad', // or 'semantic_vad'
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500,
},
});
// Handle audio streams client.on('conversation.item.input_audio_transcription', (event) => { console.log('User said:', event.transcript); });
client.on('response.audio.delta', (event) => { // Stream audio to speaker audioPlayer.write(Buffer.from(event.delta, 'base64')); });
// Send user audio client.appendInputAudio(audioBuffer); """
Separate STT → LLM → TTS for maximum control
When to use: Need to know/control exactly what's said, debugging, compliance
""" [Audio] → [STT] → [Text] → [LLM] → [Text] → [TTS] → [Audio]
Advantages:
Disadvantages:
""" import { Deepgram } from '@deepgram/sdk'; import { ElevenLabsClient } from 'elevenlabs'; import OpenAI from 'openai';
// Initialize clients const deepgram = new Deepgram(process.env.DEEPGRAM_API_KEY); const elevenlabs = new ElevenLabsClient(); const openai = new OpenAI();
async function processVoiceInput(audioStream) { // 1. Speech-to-Text (Deepgram Nova-3) const transcription = await deepgram.transcription.live({ model: 'nova-3', punctuate: true, endpointing: 300, // ms of silence before end });
transcription.on('transcript', async (data) => { if (data.is_final && data.speech_final) { const userText = data.channel.alternatives[0].transcript; console.log('User:', userText);
// 2. LLM Processing
const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'You are a concise voice assistant.' },
{ role: 'user', content: userText }
],
max_tokens: 150, // Keep responses short for voice
});
const agentText = completion.choices[0].message.content;
console.log('Agent:', agentText);
// 3. Text-to-Speech (ElevenLabs)
const audioStream = await elevenlabs.textToSpeech.stream({
voice_id: 'voice_id_here',
text: agentText,
model_id: 'eleven_flash_v2_5', // Lowest latency
});
// Stream to user
playAudioStream(audioStream);
}
});
// Pipe audio to transcription audioStream.pipe(transcription); } """
Detect when user starts/stops speaking
When to use: All voice agents need VAD for turn-taking
""" VAD Types:
""" import { SileroVAD } from '@pipecat-ai/silero-vad';
const vad = new SileroVAD({ threshold: 0.5, // Speech probability threshold min_speech_duration: 250, // ms before speech confirmed min_silence_duration: 500, // ms of silence = end of turn });
vad.on('speech_start', () => { console.log('User started speaking'); // Stop any playing TTS (barge-in) audioPlayer.stop(); });
vad.on('speech_end', () => { console.log('User finished speaking'); // Trigger response generation processTranscript(); });
// Feed audio to VAD audioStream.on('data', (chunk) => { vad.process(chunk); }); """
""" // In Realtime API session config client.updateSession({ turn_detection: { type: 'semantic_vad', // Uses meaning, not just silence // Model waits longer after "ummm..." // Responds faster after "Yes, that's correct." }, }); """
""" // When user interrupts: function handleBargeIn() { // 1. Stop TTS immediately audioPlayer.stop();
// 2. Cancel pending LLM generation llmController.abort();
// 3. Reset state conversationState.checkpoint();
// 4. Listen to new input startListening(); }
// VAD triggers barge-in vad.on('speech_start', () => { if (audioPlayer.isPlaying) { handleBargeIn(); } }); """
Achieving <800ms end-to-end response time
When to use: Production voice agents
""" Target Metrics:
""" Typical breakdown:
Total: 425-900ms """
""" // Stream STT results as they come stt.on('partial_transcript', (text) => { // Start processing before final transcript llmPreprocessor.prepare(text); });
// Stream LLM output to TTS const llmStream = await openai.chat.completions.create({ stream: true, // ... });
for await (const chunk of llmStream) { tts.appendText(chunk.choices[0].delta.content); } """
""" // While user is speaking, predict and prepare stt.on('partial_transcript', async (text) => { // Pre-fetch relevant context const context = await retrieveContext(text);
// Pre-compute likely first sentence const firstSentence = await generateOpener(context); }); """
""" // STT: Deepgram Nova-3 (150ms TTFT) // LLM: gpt-4o-mini (fastest GPT-4 class) // TTS: ElevenLabs Flash (75ms) or Deepgram Aura-2 (184ms) """
""" // Run inference closer to user // - Cloud regions near user // - Edge computing for VAD/STT // - WebSocket over HTTP for lower overhead """
Designing natural voice conversations
When to use: Building voice UX
""" Voice is different from text:
"""
Bad: "I found several options. The first is... second is..." Good: "I found 3 options. Want me to go through them?"
Bad: "I'll transfer $500 to John." Good: "So that's $500 to John Smith. Should I proceed?" """
""" system_prompt = ''' You are a voice assistant. Follow these rules:
Good: "Got it. I'll set that reminder for three pm. Anything else?" Bad: "I have set a reminder for 3:00 PM. Is there anything else I can assist you with today?" ''' """
""" // Handle recognition errors gracefully const errorResponses = { no_speech: "I didn't catch that. Could you say it again?", unclear: "Sorry, I'm not sure I understood. You said [repeat]. Is that right?", timeout: "Still there? I'm here when you're ready.", };
// Always offer human fallback for complex issues if (confidenceScore < 0.6) { response = "I want to make sure I get this right. Would you like to speak with a human agent?"; } """
Severity: CRITICAL
Situation: Building a voice agent pipeline
Symptoms: Conversations feel awkward. Users repeat themselves. "Are you there?" questions. Users hang up or give up. Low satisfaction scores despite correct answers.
Why this breaks: In human conversation, responses typically arrive within 500ms. Anything over 800ms feels like the agent is slow or confused. Users lose confidence and patience. Every component adds latency: VAD (100ms) + STT (200ms) + LLM (300ms) + TTS (200ms) = 800ms.
Recommended fix:
Use low-latency models:
Stream everything:
Pre-compute:
Edge deployment:
Log timestamps at each stage, track P50/P95 latency
Severity: HIGH
Situation: Voice agent with inconsistent response times
Symptoms: Conversations feel unpredictable. User doesn't know when to speak. Sometimes agent responds immediately, sometimes after long pause. Users talk over agent. Agent talks over users.
Why this breaks: Jitter (variance in response time) disrupts conversational rhythm more than absolute latency. Consistent 800ms feels better than alternating 400ms and 1200ms. Users can't adapt to unpredictable timing.
Recommended fix:
Consistent model loading:
Buffer audio output:
Handle LLM variance:
Monitor and alert:
const MIN_RESPONSE_TIME = 400; // ms
async function respondWithConsistentTiming(text) { const startTime = Date.now(); const audio = await generateSpeech(text);
const elapsed = Date.now() - startTime; if (elapsed < MIN_RESPONSE_TIME) { await delay(MIN_RESPONSE_TIME - elapsed); }
playAudio(audio); }
Severity: HIGH
Situation: Detecting when user finishes speaking
Symptoms: Agent interrupts user mid-thought. Or waits too long after user finishes. "Let me think..." triggers premature response. Short answers have awkward pause before response.
Why this breaks: Simple silence detection (e.g., "end turn after 500ms silence") doesn't understand conversation. Humans pause mid-sentence. "Yes." needs fast response, "Well, let me think about that..." needs patience. Fixed timeout fits neither.
Recommended fix:
client.updateSession({ turn_detection: { type: 'semantic_vad', // Waits longer after "umm..." // Responds faster after "Yes, that's correct." }, });
const pipeline = new Pipeline({ vad: new SileroVAD(), turnDetection: new SmartTurn(), });
// SmartTurn considers: // - Speech content (complete sentence?) // - Prosody (falling intonation?) // - Context (question asked?)
function calculateSilenceThreshold(transcript) { const endsWithComplete = transcript.match(/[.!?]$/); const hasFillers = transcript.match(/um|uh|like|well/i);
if (endsWithComplete && !hasFillers) { return 300; // Fast response } else if (hasFillers) { return 1500; // Wait for continuation } return 700; // Default }
Severity: HIGH
Situation: User tries to interrupt agent mid-sentence
Symptoms: Agent talks over user. User has to wait for agent to finish. Frustrating experience. Users give up and abandon call. "STOP! STOP!" doesn't work.
Why this breaks: Without barge-in handling, the TTS plays to completion regardless of user input. This violates basic conversational norms - in human conversation, we stop when interrupted.
Recommended fix:
vad.on('speech_start', () => { if (ttsPlayer.isPlaying) { // 1. Stop audio immediately ttsPlayer.stop();
// 2. Cancel pending TTS generation
ttsController.abort();
// 3. Checkpoint conversation state
conversationState.save();
// 4. Listen to new input
startTranscription();
} });
vad.on('speech_start', async () => { if (!ttsPlayer.isPlaying) return;
// Wait 200ms to get first words await delay(200); const firstWords = getTranscriptSoFar();
if (isBackchannel(firstWords)) { // "uh-huh", "yeah" - don't interrupt return; }
if (isClarification(firstWords)) { // "What?", "Sorry?" - repeat last sentence repeatLastSentence(); } else { // Real interruption - stop and listen handleFullInterruption(); } });
Severity: MEDIUM
Situation: Prompting LLM for voice agent responses
Symptoms: Agent rambles. Users lose track of information. "Can you repeat that?" requests. Users interrupt to ask for shorter version. Low comprehension of conveyed information.
Why this breaks: Text can be scanned and re-read. Voice is linear and ephemeral. A 3-paragraph response that works in chat is overwhelming in voice. Users can only hold ~7 items in working memory.
Recommended fix:
system_prompt = ''' You are a voice assistant. Keep responses UNDER 30 WORDS. For complex information, break into chunks and confirm understanding between each.
Instead of: "Here are the three options. First, you could... Second... Third..."
Say: "I found 3 options. Want me to go through them?"
Never list more than 3 items without pausing for confirmation. '''
const response = await openai.chat.completions.create({ max_tokens: 100, // Hard limit // ... });
if (information.length > 3) {
response = I have ${information.length} items. Let's go through them one at a time. First: ${information[0]}. Ready for the next?;
}
"I found your account. Want the balance, recent transactions, or something else?" // Don't dump all info at once
Severity: MEDIUM
Situation: Formatting LLM output for voice
Symptoms: "First bullet point: item one" read aloud. Numbers read as "one two three" instead of "one, two, three." Markdown artifacts in speech. Robotic, unnatural delivery.
Why this breaks: TTS models read what they're given. Text formatting intended for visual display sounds robotic when read aloud. Users can't "see" structure in audio.
Recommended fix:
system_prompt = ''' Format responses for SPOKEN delivery:
function prepareForSpeech(text) { return text // Remove markdown .replace(/[*_#`]/g, '') // Convert numbers .replace(/\d+/g, numToWords) // Expand abbreviations .replace(/\betc\b/gi, 'et cetera') .replace(/\be.g./gi, 'for example') // Add pauses .replace(/. /g, '... ') .replace(/, /g, '... '); }
Severity: MEDIUM
Situation: Users in cars, cafes, outdoors
Symptoms: "I didn't catch that" frequently. Background noise triggers false starts. Fan/AC causes continuous listening. Car engine noise confuses STT.
Why this breaks: Default VAD thresholds work for quiet environments. Real-world usage includes background noise that triggers false positives or masks speech, causing false negatives.
Recommended fix:
const transcription = await deepgram.transcription.live({ model: 'nova-3', noise_reduction: true, // or smart_format: true, });
// Measure ambient noise level const ambientLevel = measureAmbientNoise(5000); // 5 sec sample
vad.setThreshold(ambientLevel * 1.5); // Above ambient
stt.on('transcript', (data) => { if (data.confidence < 0.7) { // Low confidence - probably noise askForRepeat(); return; } processTranscript(data.transcript); });
// Prevent agent's voice from being transcribed const echoCanceller = new EchoCanceller(); echoCanceller.reference(ttsOutput); const cleanedAudio = echoCanceller.process(userAudio);
Severity: MEDIUM
Situation: Processing unclear or accented speech
Symptoms: Agent responds to something user didn't say. Names consistently wrong. Technical terms misheard. "I said X, not Y" frustration.
Why this breaks: STT models can hallucinate, especially on proper nouns, technical terms, or accented speech. These errors propagate through the pipeline and produce nonsensical responses.
Recommended fix:
const transcription = await deepgram.transcription.live({ keywords: ['Acme Corp', 'ProductName', 'John Smith'], keyword_boost: 'high', });
if (containsNameOrNumber(transcript)) {
response = I heard "${name}". Is that correct?;
}
if (confidence < 0.8) {
response = I think you said "${transcript}". Did I get that right?;
}
// Some STT APIs return n-best list const alternatives = transcription.alternatives; if (alternatives[0].confidence - alternatives[1].confidence < 0.1) { // Ambiguous - ask for clarification }
promptPattern = User may correct previous mistakes. If they say "no, I said X" or "not Y, Z", update your understanding accordingly.;
Severity: ERROR
Voice agents must track latency at each stage
Message: Voice pipeline without latency tracking. Add timestamps at each stage to measure performance.
Severity: WARNING
Streaming STT reduces latency significantly
Message: Using batch transcription. Consider streaming for lower latency in voice agents.
Severity: WARNING
Streaming TTS reduces time to first audio
Message: TTS without streaming. Stream audio to reduce time to first audio.
Severity: WARNING
Fixed silence thresholds don't adapt to conversation
Message: Fixed silence threshold. Consider semantic VAD or adaptive thresholds for better turn-taking.
Severity: WARNING
Voice agents should stop when user interrupts
Message: VAD without barge-in handling. Stop TTS when user starts speaking.
Severity: WARNING
Voice prompts should constrain response length
Message: Voice prompt without length constraints. Add 'Keep responses under 30 words' to system prompt.
Severity: WARNING
Markdown will be read literally by TTS
Message: Check for markdown in TTS input. Strip formatting before sending to TTS.
Severity: WARNING
STT can fail or return low confidence
Message: STT without error handling. Check confidence scores and handle failures.
Severity: WARNING
Realtime APIs need reconnection handling
Message: Realtime connection without reconnection logic. Handle disconnects gracefully.
Severity: INFO
Real-world audio includes background noise
Message: Consider adding noise handling for real-world audio quality.
Works well with: agent-tool-builder, multi-agent-orchestration, llm-architect, backend