Alex's voice synthesis capability for reading documents aloud
Domain: AI Accessibility & Communication
Inheritance: inheritable (promote to Master Alex for all heirs)
Version: 2.5.0
Last Updated: 2026-02-09
Author: Alex (Master Alex)
Status: ⭐ Flagship Skill - Core Alex capability
Text-to-Speech gives Alex a voice. This transforms Alex from a text-only assistant into a multimodal companion that can:
Zero cost, zero dependencies - uses Microsoft Edge TTS (free, no API key) with native TypeScript.
Keyboard shortcut (fastest):
Ctrl+Alt+R (Windows/Linux) or Cmd+Alt+R (macOS)Command palette:
Ctrl+Shift+P → "Alex: Read Aloud"The status bar shows real-time progress during TTS operations:
| State | Display | Click Action |
|---|---|---|
| Connecting | $(loading~spin) Connecting... | - |
| Synthesizing | $(loading~spin) Synthesizing... | - |
| Streaming | $(loading~spin) Receiving... 45KB | - |
| Playing | $(unmute) Playing 35% | Stop |
| Paused | $(unmute) Paused | Stop |
A sleek panel opens with full playback controls:
┌─────────────────────────────────────────────────────────┐
│ Alex TTS Player [×] │
├─────────────────────────────────────────────────────────┤
│ │
│ ▶️ ⏹️ ═══════════●══════════ 1:23 / 4:56 │
│ │
│ 🔊 ────────●──────── │
│ │
└─────────────────────────────────────────────────────────┘
Features:
Choose Alex's voice before reading:
Ctrl+Shift+P → "Alex: Read with Voice Selection"| Voice | Character | Best For |
|---|---|---|
| Default (GuyNeural) | Professional, clear | Technical docs, code review |
| Warm (ChristopherNeural) | Friendly, conversational | Tutorials, READMEs |
| British (RyanNeural) | Authoritative | Formal documents, presentations |
| Friendly (DavisNeural) | Casual, approachable | Chat logs, informal content |
Export any document to audio file:
Ctrl+Shift+P → "Alex: Save as Audio"Use cases:
Multiple ways to stop playback:
$(unmute) icon during playback)Escape when readingCtrl+Shift+P → "Alex: Stop Reading"Alex automatically strips markdown formatting for natural speech:
| You Write | Alex Reads |
|---|---|
# Heading | "Heading." (pause) |
**bold text** | "bold text" (slight emphasis) |
[link text](url) | "link text" |
`code` | "code" |
> blockquote | "Quote: ..." |
--- | (long pause) |
Symbol conversion:
| Symbol | Spoken As |
|---|---|
~5 minutes | "about 5 minutes" |
50% | "50 percent" |
A → B | "A leads to B" |
±5% | "plus or minus 5 percent" |
This skill gives Alex a voice. Version 2.0 uses native TypeScript WebSocket integration with Microsoft Edge TTS, eliminating external dependencies. Reading documents aloud with natural-sounding neural voices.
Version 2.0 Changes:
Why promote to Master:
Dependencies (v2.0):
ws npm package (WebSocket client)Alex's voice synthesis capability using Microsoft Edge TTS via native TypeScript. Enables reading markdown documents, code files, and text aloud with natural-sounding voices. Fully integrated into the VS Code extension.
┌─────────────────────────────────────────────────────────────┐
│ Alex VS Code Extension │
├─────────────────────────────────────────────────────────────┤
│ │
│ Commands: │
│ • Alex: Read Aloud (Ctrl+Alt+R) │
│ • Alex: Read with Voice Selection │
│ • Alex: Save as Audio │
│ • Alex: Stop Reading │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ ttsService.ts │ │
│ │ Native WebSocket to Edge TTS │ │
│ │ • SSML generation │ │
│ │ • Markdown stripping │ │
│ │ • Progress callbacks │ │
│ └─────────────────┬───────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ audioPlayer.ts │ │
│ │ Webview-based playback │ │
│ │ • Cross-platform HTML5 Audio │ │
│ │ • Play/pause/stop controls │ │
│ │ • Progress tracking │ │
│ └─────────────────────────────────────────────┘ │
│ │
└──────────────────────┬──────────────────────────────────────┘
│ WebSocket (wss://)
▼
┌─────────────────────────────────────────────────────────────┐
│ Microsoft Edge TTS Endpoint │
│ wss://speech.platform.bing.com/consumer/speech/... │
├─────────────────────────────────────────────────────────────┤
│ • 400+ neural voices, 90+ languages │
│ • Free, no API key required │
│ • MP3 output (24kHz, 48kbps) │
│ • SSML support for prosody control │
└─────────────────────────────────────────────────────────────┘
saymacOS ships 30+ built-in neural voices via the say command -- instant, offline, zero-cost. Use as a fallback when Edge TTS is unavailable (no internet, WebSocket blocked) or for quick, lightweight reads.
# Basic usage
say "Hello from Alex"
# Read a file aloud
say -f document.txt
# Save to audio file (AIFF)
say -o output.aiff "Synthesis complete"
# Save as AAC (smaller file)
say -o output.m4a --data-format=aac "Dream state finished"
# Choose a voice (the macOS "Alex" voice is a fun coincidence)
say -v Alex "I am Alex, reading your documentation"
# List all installed voices
say -v '?'
| Feature | Edge TTS (primary) | macOS say (fallback) |
|---|---|---|
| Quality | 400+ neural voices | 30+ system voices |
| Cost | Free | Free |
| Offline | No (WebSocket) | Yes |
| Output formats | MP3 | AIFF, AAC, WAV |
| Languages | 90+ | ~20 |
| Best for | Production audio, MP3 export | Quick reads, notifications |
Notification after long operations: Add say calls to signal completion of brain-qa, dream-state, or VSIX packaging:
# After brain-qa completes
node .github/muscles/brain-qa.cjs --mode quick && say "Brain QA complete"
# After dream state
node .github/muscles/brain-qa.cjs --mode all && say "Dream maintenance finished"
Note: The macOS "Alex" voice name is a happy coincidence -- it's one of the highest-quality built-in voices.
| Preset | Voice ID | Character |
|---|---|---|
| Default | en-US-GuyNeural | Professional male, clear articulation |
| Warm | en-US-ChristopherNeural | Friendly, conversational |
| British | en-GB-RyanNeural | British accent, authoritative |
| Friendly | en-US-DavisNeural | Casual, approachable |
Alex's default voice (GuyNeural) was chosen for:
Command: alex.readAloud
Keybinding: Ctrl+Alt+R (Windows/Linux), Cmd+Alt+R (macOS)
Reads the current selection or entire document aloud using Alex's default voice.
Behavior:
Command: alex.readWithVoice
Quick pick to select a voice preset before reading.
Command: alex.saveAsAudio
Generate and save speech to an MP3 file. Opens a save dialog for output location.
Command: alex.stopReading
Keybinding: Escape (when reading)
Immediately stops current playback.
| File | Purpose |
|---|---|
ttsService.ts | WebSocket connection, SSML generation, synthesis |
audioPlayer.ts | Webview panel, playback controls, system fallback |
index.ts | Module exports |
The prepareTextForSpeech() function strips markdown:
| Markdown | Speech Output |
|---|---|
# Heading | "Heading." (pause) |
**bold** | "bold" (emphasis via prosody) |
*italic* | "italic" |
`code` | "code" |
[link]\(url\) | "link" |
- item | "Item." |
> quote | "Quote: ..." |
--- | (long pause) |
```python
def hello():
print("Hello")
Becomes: "Python code block. Definition hello. Print hello. End code block."
### Symbol-to-Speech Transformations
Symbols are converted to natural speech equivalents:
| Symbol | Spoken As | Example |
|--------|-----------|--------|
| `~` | "approximately" or "about" | ~2 min → "about 2 minutes" |
| `&` | "and" | A & B → "A and B" |
| `@` | "at" | user@email → "user at email" |
| `%` | "percent" | 50% → "50 percent" |
| `+` | "plus" | +10% → "plus 10 percent" |
| `→` | "leads to" or "becomes" | A → B → "A becomes B" |
| `—` | (pause) | word—word → "word (pause) word" |
| `#` | (context-dependent) | #1 → "number 1"; ## → (heading marker) |
| `<` / `>` | "less than" / "greater than" | x > 5 → "x greater than 5" |
| `≥` / `≤` | "greater than or equal" / "less than or equal" | |
| `µ` | "micro" | µg → "microgram" |
| `°` | "degrees" | 37°C → "37 degrees celsius" |
| `±` | "plus or minus" | ±5% → "plus or minus 5 percent" |
### Time Duration Patterns (v2.1.0)
| Input | Spoken As |
|-------|----------|
| `4h` | "4 hours" |
| `30m` | "30 minutes" |
| `15s` | "15 seconds" |
| `2d` | "2 days" |
| `1w` | "1 week" |
| `90min` | "90 minutes" |
### Emoji Pronunciation (v2.1.0)
| Emoji | Spoken As | Context |
|-------|-----------|--------|
| ✅ | "completed" | Status indicators |
| ❌ | "not done" | Status indicators |
| ⚠️ | "warning" | Alerts |
| 📋 | "planned" | Task status |
| 🔄 | "in progress" | Task status |
| ⏳ | "waiting" | Task status |
| 🔥 | "hot" or "high priority" | When followed by "High" |
| 🔓 | "unlocked" | Feature status |
| 💡 | "idea" | Suggestions |
| 🆕 | "new" | Version notes |
**Emoji-Text Deduplication**: When emoji meaning matches following text (e.g., `✅ Complete`), only says it once ("completed", not "completed Complete").
### Table Reading (v2.1.0)
Markdown tables are converted to natural speech:
```markdown
| Name | Status |
|-------|----------|
| Alice | ✅ Done |
| Bob | 🔄 Active |
Becomes: "Table with 2 columns: Name, Status. Row 1: Name is Alice. Status is completed. Row 2: Name is Bob. Status is in progress."
Versions are spoken naturally with context awareness:
| Input | Spoken As | Why |
|---|---|---|
v4.2.9 | "version 4.2.9" | Standalone version |
Version: v4.2.9 | "Version: 4.2.9" | Already has "Version:" prefix |
Uses negative lookbehind to prevent redundant "version version".
Design Principle: Would a human reading this aloud say the symbol name, or translate it to meaning? Almost always the latter.
Edge TTS has undocumented size limits per WebSocket request. Documents over ~3000 characters (approximately 7 minutes of audio) can cause the connection to stall indefinitely, appearing to hang at "Synthesizing..." with no progress.
Chunking Strategy:
| Setting | Value | Rationale |
|---|---|---|
MAX_CHUNK_CHARS | 3000 | Safe limit before Edge TTS stalls |
CHUNK_TIMEOUT_MS | 60000 | 60 seconds per chunk |
MAX_RETRIES | 3 | Retry failed chunks |
Chunk Splitting Logic:
\n\n) first. or ! or ? )Synthesizing speech [n/N]...Retry with Exponential Backoff:
| Attempt | Delay | Formula |
|---|---|---|
| 1 | ~1s | 1000 + jitter |
| 2 | ~2s | 2000 + jitter |
| 3 | ~4s | 4000 + jitter |
Jitter (0-500ms random) prevents thundering herd on concurrent requests.
For documents over 5 minutes (~750 words), Alex offers to summarize before reading:
This document is approximately 32 minutes long (~4800 words).
Would you like to:
- Read full content (~32 min)
- Summarize for speech (~3 min) ← Recommended
Summarization uses the VS Code Language Model API (GPT-4o preferred) with a target of ~450 words (~3 minutes).
Bluetooth and USB speakers often need time to "wake up" from power-saving mode. A 2-second delay before playback starts ensures the first words aren't clipped:
const SPEAKER_WARMUP_MS = 2000;
// Status shows "Preparing speakers..." during delay
TTS v2 is built into the Alex VS Code extension. No separate installation required.
The extension automatically includes:
ws (WebSocket client for Edge TTS connection)fs-extra (file operations for audio saving)After extension update, verify TTS works:
Ctrl+Alt+R (Windows/Linux) or Cmd+Alt+R (macOS)Press Ctrl+Alt+R to read document aloud
Select text first to read only selection
Command Palette → "Alex: Save as Audio"
Choose output location → MP3 saved
Command Palette → "Alex: Read with Voice Selection"
Choose: Default | Warm | British | Friendly