Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flow with sub-800ms latency while handling interruptions, background noise, and emotional nuance.

This skill covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency, most natural) and pipeline (STT→LLM→TTS, more control, easier to debug). Key insight: latency is the constraint. Humans expect responses in 500ms. Every millisecond matters.

84% of organizations are increasing voice AI budgets in 2025. This is the year voice agents go mainstream.

Principles

Latency is the constraint - target <800ms end-to-end
Jitter (variance) matters as much as absolute latency
VAD quality determines conversation flow
Interruption handling makes or breaks the experience
Start with focused MVP, iterate based on real conversations
Combine best-in-class components (Deepgram STT + ElevenLabs TTS)

Capabilities

voice-agents
speech-to-speech
speech-to-text

<speak> The total is <say-as interpret-as="currency">$49.99</say-as>. <break time="500ms"/> Want to proceed? </speak>

84% of organizations are increasing voice AI budgets in 2025. This is the year voice agents go mainstream.

Principles

Latency is the constraint - target <800ms end-to-end
Jitter (variance) matters as much as absolute latency
VAD quality determines conversation flow
Interruption handling makes or breaks the experience
Start with focused MVP, iterate based on real conversations
Combine best-in-class components (Deepgram STT + ElevenLabs TTS)

Capabilities

voice-agents
speech-to-speech
speech-to-text

<speak> The total is <say-as interpret-as="currency">$49.99</say-as>. <break time="500ms"/> Want to proceed? </speak>

Voice Agents

Principles

Capabilities

Voice Agents

Principles

Capabilities

Scope

Tooling

Speech_to_speech

Speech_to_text

Text_to_speech

Frameworks

Patterns

Speech-to-Speech Architecture

SPEECH-TO-SPEECH ARCHITECTURE:

OpenAI Realtime API

Use Cases:

Pipeline Architecture

PIPELINE ARCHITECTURE:

Production Pipeline Example

Optimization Tips:

Voice Activity Detection Pattern

VOICE ACTIVITY DETECTION (VAD):

Silero VAD (Popular Open Source)

OpenAI Semantic VAD

Barge-In Handling

Latency Optimization Pattern

LATENCY OPTIMIZATION:

Pipeline Latency Breakdown

Optimization Strategies

1. Streaming Everything

2. Pre-computation

3. Use Low-Latency Models

4. Edge Deployment

Conversation Design Pattern

CONVERSATION DESIGN:

Voice-First Principles

Response Design

Keep responses short (10-20 seconds max)

Front-load the answer

Use signposting for lists

Confirm understanding

Prompting for Voice

Error Recovery

Sharp Edges

Response Latency Exceeds 800ms

Measure and budget latency for each component:

Target latencies:

Optimization strategies:

Measure continuously:

Response Time Variance Disrupts Rhythm

Target jitter metrics:

Reduce jitter sources:

Implementation:

Using Silence Duration for Turn Detection

Use semantic VAD:

OpenAI Semantic VAD:

Pipecat SmartTurn:

Fallback: Adaptive silence threshold:

Agent Doesn't Stop When User Interrupts

Implement barge-in detection:

Basic barge-in:

Advanced: Distinguish interruption types:

Response time target:

Generating Text-Length Responses for Voice

Constrain response length in prompts:

Enforce at generation:

Chunking pattern:

Progressive disclosure:

Using Bullets/Numbers/Markdown in Voice

Prompt for spoken format:

Post-processing:

SSML for precise control:

VAD/STT Fails in Noisy Environments

Implement noise handling:

1. Noise reduction in STT:

2. Adaptive VAD threshold:

3. Confidence filtering:

4. Echo cancellation:

STT Produces Incorrect or Hallucinated Text