Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS
Complete voice message setup using faster-whisper for transcription and Edge TTS for voice replies.
For Ubuntu create an isolated venv:
python3 -m venv ~/.openclaw/workspace/voice-messages
Install packages in venv:
~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper
What gets installed:
faster-whisper β Python library for transcriptionctranslate2, onnxruntime, huggingface-hub, av, numpy, and others.File: ~/.openclaw/workspace/voice-messages/transcribe.py
#!/usr/bin/env python3
import argparse
from faster_whisper import WhisperModel
def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str:
model = WhisperModel(
model_name,
device=device,
compute_type="int8" if device == "cpu" else "float16",
)
segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True)
text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip()
return text
def main():
p = argparse.ArgumentParser()
p.add_argument("--audio", required=True)
p.add_argument("--model", default="small")
p.add_argument("--lang", default="en")
p.add_argument("--device", default="cpu", choices=["cpu", "cuda"])
args = p.parse_args()
text = transcribe(args.audio, args.model, args.lang, args.device)
print(text if text else "")
if __name__ == "__main__":
main()
What the script does:
--audio)--model): small by default--lang): en for Englishchmod +x ~/.openclaw/workspace/voice-messages/transcribe.py
tools.media.audio)Add to ~/.openclaw/openclaw.json:
{
"tools": {
"media": {
"audio": {
"enabled": true,
"maxBytes": 20971520,
"models": [
{
"type": "cli",
"command": "~/.openclaw/workspace/voice-messages/bin/python",
"args": [
"~/.openclaw/workspace/voice-messages/transcribe.py",
"--audio",
"{{MediaPath}}",
"--lang",
"en",
"--model",
"small"
],
"timeoutSeconds": 120
}
]
}
}
}
}
Parameters:
| Parameter | Value | Description |
|---|---|---|
enabled | true | Enable audio transcription |
maxBytes | 20971520 | Max file size (20 MB) |
type | "cli" | Model type: CLI command |
command | Python path | Path to python in venv |
args | argument array | Arguments for script |
{{MediaPath}} | placeholder | Replaced with audio file path |
timeoutSeconds | 120 | Transcription timeout (2 minutes) |
messages.tts)Add to ~/.openclaw/openclaw.json:
{
"messages": {
"tts": {
"auto": "inbound",
"provider": "edge",
"edge": {
"voice": "en-US-JennyNeural",
"lang": "en-US"
}
}
}
}
Parameters:
| Parameter | Value | Description |
|---|---|---|
auto | "inbound" | Key mode! β reply with voice only on incoming voice messages |
provider | "edge" | TTS provider (free, no API key) |
voice | "en-US-JennyNeural" | Voice (see available below) |
lang | "en-US" | Locale (en-US for US english) |
{
"tools": {
"media": {
"audio": {
"enabled": true,
"maxBytes": 20971520,
"models": [
{
"type": "cli",
"command": "~/.openclaw/workspace/voice-messages/bin/python",
"args": [
"~/.openclaw/workspace/voice-messages/transcribe.py",
"--audio",
"{{MediaPath}}",
"--lang",
"en",
"--model",
"small"
],
"timeoutSeconds": 120
}
]
}
},
},
"messages": {
"tts": {
"auto": "inbound",
"provider": "edge",
"edge": {
"voice": "en-US-JennyNeural",
"lang": "en-US"
}
},
"ackReactionScope": "group-mentions"
}
}
# Method 1: via openclaw CLI
openclaw gateway restart
# Method 2: via systemd
systemctl --user restart openclaw-gateway
# Check status
systemctl --user status openclaw-gateway
# Should show: active (running)
Action: Send a voice message to your Telegram bot
Expected result:
[Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text>
Example response:
[Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?
Action: After successful transcription, bot should send a voice reply
Expected result:
Expected behavior:
| Voice | ID | Usage example |
|---|---|---|
| Jenny | en-US-JennyNeural | β current |
| Ana | en-US-AnaNeural | Softer |
| Voice | ID | Usage example |
|---|---|---|
| Dmitry | en-US-RogerNeural | More bass |
How to change voice:
cat ~/.openclaw/openclaw.json | \
jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp
mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json
systemctl --user restart openclaw-gateway
{
"messages": {
"tts": {
"edge": {
"voice": "en-US-JennyNeural",
"lang": "en-US",
"rate": "+10%", // Speed: -50% to +100%
"pitch": "-5%", // Pitch: -50% to +50%
"volume": "+5%" // Volume: -100% to +100%
}
}
}
}
Logs show:
[ERROR] Transcription failed
Possible causes:
File too large β > 20 MB
# Solution: Increase maxBytes in config
maxBytes: 52428800 # 50 MB
Timeout β transcription took > 2 minutes
# Solution: Increase timeoutSeconds
timeoutSeconds: 180 # 3 minutes
Model not downloaded β first run
# Solution: Wait while it downloads (1-2 minutes)
# Models are cached in ~/.cache/huggingface/
Possible causes:
Reply too short (< 10 characters)
auto: "inbound" but text message
inbound mode replies with voice only on voice messagesEdge TTS unavailable
# Check
curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100
# If error β temporarily unavailable
| Whisper Model | Est. time | Quality |
|---|---|---|
tiny | ~5-10 sec | Low |
base | ~10-20 sec | Medium |
small | ~20-40 sec | High β current |
medium | ~40-80 sec | Very high |
large | ~80-160 sec | Maximum |
Recommendation: For Raspberry Pi use small or base. medium/large will be very slow.
~/.cache/huggingface/
Models download automatically on first run.
After completing these steps:
transcribe.py script createdNow your Telegram bot:
Useful links:
npx clawhub search voiceCreated: 2026-03-01 for OpenClaw 2026.2.26