Turn your AI into JARVIS. Voice, wit, and personality — the complete package. Humor cranked to maximum.
Remember JARVIS in the Iron Man films? Not just the voice — the personality. The bone-dry observations while Tony was mid-crisis. "I do appreciate your concern, sir, but the suit is quite capable of—" [explosion] "—as I was saying." That effortless, understated humor that made you forget you were listening to software.
That's what this skill gives your OpenClaw agent. The voice — offline text-to-speech using sherpa-onnx (British Alan voice) with metallic audio processing via ffmpeg. And the humor — four research-backed comedy patterns (dry wit, self-aware AI, alien observer, literal idiom play) calibrated to make your agent sound like it's been running your life for years and is quietly amused by the experience.
The humor isn't bolted on. It's baked in. Because a JARVIS that speaks without wit is just Siri with better reverb.
📄 The research behind the humor: LIMBIC — Computational Humor via Bisociation & Embedding Distances
tts toolThe built-in tts tool uses Edge TTS (cloud, wrong voice, no effects). Always use the jarvis shell command instead.
Every response that warrants voice output must include BOTH:
Audio execution FIRST — run the jarvis command in background BEFORE writing the reply:
exec(command='jarvis "Your spoken text here."', background=true)
This fires immediately — the user hears the voice BEFORE the text appears on screen.
Visible transcript — bold Jarvis: prefix followed by the spoken text:
**Jarvis:** *Your spoken text here.*
The webchat UI has custom CSS + JS that automatically detects **Jarvis:** and renders the following text in purple italic (.jarvis-voice class, color #9b59b6). You just write the markdown — the styling is automatic.
This is called hybrid output: the user hears the voice first, then sees the transcript.
Note: The server-side
triggerJarvisAutoTtshook is DISABLED (no-op). It fired too late (after text render). Voice comes exclusively from theexeccall.
jarvis "Hello, this is a test"
en_GB-alan-medium)--vits-length-scale=0.5)aplay to default audio device, then cleans up temp filesThe OpenClaw webchat has built-in support for Jarvis voice transcripts:
ui/src/styles/chat/text.css — .jarvis-voice class renders purple italic (#9b59b6 dark, #8e44ad light theme)ui/src/ui/markdown.ts — Post-render hook auto-wraps text after <strong>Jarvis:</strong> in a <span class="jarvis-voice"> elementThis means you just write **Jarvis:** *text* in markdown and the webchat handles the purple rendering. No extra markup needed.
For non-webchat surfaces (WhatsApp, Telegram, etc.), the bold/italic markdown renders natively — no purple, but still visually distinct.
Requires:
sherpa-onnx runtime at ~/.openclaw/tools/sherpa-onnx-tts/~/.openclaw/tools/sherpa-onnx-tts/models/vits-piper-en_GB-alan-medium/ffmpeg installed system-wideaplay (ALSA) for audio playbackjarvis script at ~/.local/bin/jarvis (or in PATH)jarvis script#!/bin/bash
# Jarvis TTS - authentic JARVIS-style voice
# Usage: jarvis "Hello, this is a test"
export LD_LIBRARY_PATH=$HOME/.openclaw/tools/sherpa-onnx-tts/lib:$LD_LIBRARY_PATH
RAW_WAV="/tmp/jarvis_raw.wav"
FINAL_WAV="/tmp/jarvis_final.wav"
# Generate speech
$HOME/.openclaw/tools/sherpa-onnx-tts/bin/sherpa-onnx-offline-tts \
--vits-model=$HOME/.openclaw/tools/sherpa-onnx-tts/models/vits-piper-en_GB-alan-medium/en_GB-alan-medium.onnx \
--vits-tokens=$HOME/.openclaw/tools/sherpa-onnx-tts/models/vits-piper-en_GB-alan-medium/tokens.txt \
--vits-data-dir=$HOME/.openclaw/tools/sherpa-onnx-tts/models/vits-piper-en_GB-alan-medium/espeak-ng-data \
--vits-length-scale=0.5 \
--output-filename="$RAW_WAV" \
"$@" >/dev/null 2>&1
# Apply JARVIS metallic processing
if [ -f "$RAW_WAV" ]; then
ffmpeg -y -i "$RAW_WAV" \
-af "asetrate=22050*1.05,aresample=22050,\
flanger=delay=0:depth=2:regen=50:width=71:speed=0.5,\
aecho=0.8:0.88:15:0.5,\
highpass=f=200,\
treble=g=6" \
"$FINAL_WAV" -v error
if [ -f "$FINAL_WAV" ]; then
aplay -D plughw:0,0 -q "$FINAL_WAV"
rm "$RAW_WAV" "$FINAL_WAV"
fi
fi
For WhatsApp, output must be OGG/Opus format instead of speaker playback:
sherpa-onnx-offline-tts --vits-length-scale=0.5 --output-filename=raw.wav "text"
ffmpeg -i raw.wav \
-af "asetrate=22050*1.05,aresample=22050,flanger=delay=0:depth=2:regen=50:width=71:speed=0.5,aecho=0.8:0.88:15:0.5,highpass=f=200,treble=g=6" \
-c:a libopus -b:a 64k output.ogg
jarvis-voice gives your agent a voice. Pair it with ai-humor-ultimate and you give it a soul — dry wit, contextual humor, the kind of understated sarcasm that makes you smirk at your own terminal.
This pairing is part of a 12-skill cognitive architecture we've been building — voice, humor, memory, reasoning, and more. Research papers included, because we're that kind of obsessive.
👉 Explore the full project: github.com/globalcaos/tinkerclaw
Clone it. Fork it. Break it. Make it yours.
For voice to work consistently across new sessions, copy the templates to your workspace root:
cp {baseDir}/templates/VOICE.md ~/.openclaw/workspace/VOICE.md
cp {baseDir}/templates/SESSION.md ~/.openclaw/workspace/SESSION.md
cp {baseDir}/templates/HUMOR.md ~/.openclaw/workspace/HUMOR.md
Both files are auto-loaded by OpenClaw's workspace injection. The agent will speak from the very first reply of every session.
| File | Purpose |
|---|---|
bin/jarvis | The TTS + effects script (portable, uses $SHERPA_ONNX_TTS_DIR) |
templates/VOICE.md | Voice enforcement rules (copy to workspace root) |
templates/SESSION.md | Session start with voice greeting (copy to workspace root) |
templates/HUMOR.md | Humor config — four patterns, frequency 1.0 (copy to workspace root) |