Name: Voice Stt Tts
Author: openclaw

SkillsPool

Skills suchen.../

Skill-Inhalt

Voice Messages (STT + TTS) for OpenClaw 🎙️

Complete voice message setup using faster-whisper for transcription and Edge TTS for voice replies.

What we configure

✅ STT (Speech-to-Text) — transcribe voice messages via faster-whisper
✅ TTS (Text-to-Speech) — voice replies via Edge TTS
🎯 Result: voice → text → reply with voice

Installation

1. Create virtual environment (venv)

For Ubuntu create an isolated venv:

python3 -m venv ~/.openclaw/workspace/voice-messages

2. Install faster-whisper

~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper

#!/usr/bin/env python3
import argparse
from faster_whisper import WhisperModel


def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str:
    model = WhisperModel(
        model_name,
        device=device,
        compute_type="int8" if device == "cpu" else "float16",
    )
    segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True)
    text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip()
    return text


def main():
    p = argparse.ArgumentParser()
    p.add_argument("--audio", required=True)
    p.add_argument("--model", default="small")
    p.add_argument("--lang", default="en")
    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"])
    args = p.parse_args()

    text = transcribe(args.audio, args.model, args.lang, args.device)
    print(text if text else "")


if __name__ == "__main__":
    main()

chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    }
  }
}

Parameter	Value	Description
`enabled`	`true`	Enable audio transcription
`maxBytes`	`20971520`	Max file size (20 MB)
`type`	`"cli"`	Model type: CLI command
`command`	Python path	Path to python in venv
`args`	argument array	Arguments for script
`{{MediaPath}}`	placeholder	Replaced with audio file path
`timeoutSeconds`	`120`	Transcription timeout (2 minutes)

{
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    }
  }
}

Parameter	Value	Description
`auto`	`"inbound"`	Key mode! — reply with voice only on incoming voice messages
`provider`	`"edge"`	TTS provider (free, no API key)
`voice`	`"en-US-JennyNeural"`	Voice (see available below)
`lang`	`"en-US"`	Locale (en-US for US english)

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    },
  },
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    },
    "ackReactionScope": "group-mentions"
  }
}

# Method 1: via openclaw CLI
openclaw gateway restart

# Method 2: via systemd
systemctl --user restart openclaw-gateway

# Check status
systemctl --user status openclaw-gateway
# Should show: active (running)

[Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text>

[Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?

cat ~/.openclaw/openclaw.json | \
  jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp
mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json
systemctl --user restart openclaw-gateway

{
  "messages": {
    "tts": {
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US",
        "rate": "+10%",      // Speed: -50% to +100%
        "pitch": "-5%",     // Pitch: -50% to +50%
        "volume": "+5%"     // Volume: -100% to +100%
      }
    }
  }
}

[ERROR] Transcription failed

File too large — > 20 MB

# Solution: Increase maxBytes in config
maxBytes: 52428800  # 50 MB

Timeout — transcription took > 2 minutes

# Solution: Increase timeoutSeconds
timeoutSeconds: 180  # 3 minutes

Model not downloaded — first run

# Solution: Wait while it downloads (1-2 minutes)
# Models are cached in ~/.cache/huggingface/

Reply too short (< 10 characters)
- TTS skips very short replies
- Solution: this is expected behavior
auto: "inbound" but text message
- TTS in inbound mode replies with voice only on voice messages
- Text messages get text replies — this is correct!

Edge TTS unavailable

# Check
curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100
# If error — temporarily unavailable

~/.cache/huggingface/

Voice Stt Tts | Skills Pool

Voice	ID	Usage example
Jenny	`en-US-JennyNeural`	← current
Ana	`en-US-AnaNeural`	Softer

Whisper Model	Est. time	Quality
`tiny`	~5-10 sec	Low
`base`	~10-20 sec	Medium
`small`	~20-40 sec	High ← current
`medium`	~40-80 sec	Very high
`large`	~80-160 sec	Maximum

Voice Stt Tts

Voice Stt Tts

Voice Messages (STT + TTS) for OpenClaw 🎙️

What we configure

Installation

1. Create virtual environment (venv)

2. Install faster-whisper

Transcription Script

Path and content

Make file executable:

OpenClaw Configuration

1. Configure STT (`tools.media.audio`)

2. Configure TTS (`messages.tts`)

3. Full configuration example

Apply Changes

Restart Gateway

Testing

Test STT (transcription)

Test TTS (voice replies)

Available Edge TTS Voices

Female voices

Male voices

Additional Edge TTS Parameters

Adjusting speed, pitch, volume

Troubleshooting

Problem: Voice not transcribed

Problem: No voice reply

Performance

Transcription time (Raspberry Pi 4/ARM)

Where Whisper models are stored

Done! 🎉

Helm Chart Scaffolding

Python Observability

K8s Manifest Generator

Istio Traffic Management

Secrets Management

Gitops Workflow

Voice Stt Tts

Voice Stt Tts

Voice Messages (STT + TTS) for OpenClaw 🎙️

What we configure

Installation

1. Create virtual environment (venv)

2. Install faster-whisper

Transcription Script

Path and content

Make file executable:

OpenClaw Configuration

1. Configure STT (tools.media.audio)

2. Configure TTS (messages.tts)

3. Full configuration example

Apply Changes

Restart Gateway

Testing

Test STT (transcription)

Test TTS (voice replies)

Available Edge TTS Voices

Female voices

Male voices

Additional Edge TTS Parameters

Adjusting speed, pitch, volume

Troubleshooting

Problem: Voice not transcribed

Problem: No voice reply

Performance

Transcription time (Raspberry Pi 4/ARM)

Where Whisper models are stored

Done! 🎉

Helm Chart Scaffolding

Python Observability

K8s Manifest Generator

Istio Traffic Management

Secrets Management

Gitops Workflow

1. Configure STT (`tools.media.audio`)

2. Configure TTS (`messages.tts`)