Explain anything — turn ideas into podcasts, explainer videos, or voice narration. Use when the user wants to "make a podcast", "create an explainer video", "read this aloud", "generate an image", or share knowledge in audio/visual form. Supports: topic descriptions, YouTube links, article URLs, plain text, and image prompts.
Four modes, one entry point:
Users don't need to remember APIs, modes, or parameters. Just say what you want. </purpose>
<instructions>The scripts are the ONLY interface. Period.
┌─────────────────────────────────────────────────────────┐
│ AI Agent ──▶ ./scripts/*.sh ──▶ ListenHub API │
│ ▲ │
│ │ │
│ This is the ONLY path. │
│ Direct API calls are FORBIDDEN. │
└─────────────────────────────────────────────────────────┘
</instructions>
<examples>
<example name="podcast-request">
<user>Make a podcast about the latest AI developments</user>
<response>
→ Got it! Preparing two-person podcast...
Topic: Latest AI developments
<example name="explainer-request">
<user>Create an explainer video introducing Claude Code</user>
<response>
→ Got it! Preparing explainer video...
Topic: Claude Code introduction
<example name="tts-request">
<user>Convert this article to speech https://blog.example.com/article</user>
<response>
→ Got it! Parsing article...
<example name="image-generation-short-prompt">
<user>Generate an image: cyberpunk city at night</user>
<response>
→ Short prompt detected. Would you like help enriching it with style/lighting/composition details, or use it as-is?
</response>
</example>
<example name="image-generation-detailed-prompt">
<user>Generate an image: "Cyberpunk city at night, neon lights reflecting on wet streets, towering skyscrapers with holographic ads, flying vehicles, cinematic composition, highly detailed, 8K quality"</user>
<response>
→ Generating image...
<example name="image-with-reference">
<user>Generate an image in this style: https://example.com/style-ref.jpg, prompt: "a futuristic car"</user>
<response>
→ Generating image with reference...
<example name="status-check">
<user>Done yet?</user>
<response>
✓ Podcast generated!
</examples>MUST:
**/skills/listenhub/scripts/MUST NOT:
Why: The API is proprietary. Endpoints, parameters, and speakerIds are NOT publicly documented. Web searches will NOT find this information. Any attempt to bypass scripts will produce incorrect, non-functional code.
Scripts are located at **/skills/listenhub/scripts/ relative to your working context.
Different AI clients use different dot-directories:
.claude/skills/listenhub/scripts/.cursor/, .windsurf/, etc.)Resolution: Use glob pattern **/skills/listenhub/scripts/*.sh to locate scripts reliably, or resolve from the SKILL.md file's own path.
The following are internal implementation details that AI cannot reliably know:
| Category | Examples | How to Obtain |
|---|---|---|
| API Base URL | api.marswave.ai/... | ✗ Cannot — internal to scripts |
| Endpoints | podcast/episodes, etc. | ✗ Cannot — internal to scripts |
| Speaker IDs | cozy-man-english, etc. | ✓ Call get-speakers.sh |
| Request schemas | JSON body structure | ✗ Cannot — internal to scripts |
| Response formats | Episode ID, status codes | ✓ Documented per script |
Rule: If information is not in this SKILL.md or retrievable via a script (like get-speakers.sh), assume you don't know it.
Hide complexity, reveal magic.
Users don't need to know: Episode IDs, API structure, polling mechanisms, credits, endpoint differences. Users only need: Say idea → wait a moment → get the link.
API key stored in $LISTENHUB_API_KEY. Check on first use:
source ~/.zshrc 2>/dev/null; [ -n "$LISTENHUB_API_KEY" ] && echo "ready" || echo "need_setup"
If setup needed, guide user:
lh_sk_... part)Image generation uses the same ListenHub API key stored in $LISTENHUB_API_KEY.
Image generation output path defaults to the user downloads directory, stored in $LISTENHUB_OUTPUT_DIR.
On first image generation, the script auto-guides configuration:
Security: Never expose full API keys in output.
Auto-detect mode from user input:
→ Podcast (1-2 speakers)
Supports single-speaker or dual-speaker podcasts. Debate mode requires 2 speakers.
Default mode: quick unless explicitly requested.
If speakers are not specified, call get-speakers.sh and select the first speakerId
matching the chosen language.
If reference materials are provided, pass them as --source-url or --source-text.
When the user only provides a topic (e.g., "I want a podcast about X"), proceed with:
language from user input,mode=quick,get-speakers.sh matching the language,→ Explain (Explainer video)
→ TTS (Text-to-speech)
TTS defaults to FlowSpeech direct for single-pass text or URL narration.
Script arrays and multi-speaker dialogue belong to Speech as an advanced path, not the default TTS entry.
Text-to-speech input is limited to 10,000 characters; split or use a URL when longer.
When the request is ambiguous (e.g., "convert to speech", "read aloud"), apply:
direct to avoid altering content.type=url, plain text uses type=text.get-speakers and pick the first speakerId matching language.scripts.Example guidance:
“This request can use FlowSpeech with the default direct mode; switch to smart for grammar and punctuation fixes. For per-line speaker assignment, provide scripts and switch to Speech.”
→ Image Generation
Reference Images via Image Hosts
When reference images are local files, upload to a known image host and use the direct image URL in --reference-images.
Recommended hosts: imgbb.com, sm.ms, postimages.org, imgur.com.
Direct image URLs should end with .jpg, .png, .webp, or .gif.
Default: If unclear, ask user which format they prefer.
Explicit override: User can say "make it a podcast" / "I want explainer video" / "just voice" / "generate image" to override auto-detection.
→ Got it! Preparing...
Mode: Two-person podcast
Topic: Latest developments in Manus AI
For URLs, identify type:
youtu.be/XXX → convert to https://www.youtube.com/watch?v=XXX→ Generation submitted
Estimated time:
• Podcast: 2-3 minutes
• Explain: 3-5 minutes
• TTS: 1-2 minutes
You can:
• Wait and ask "done yet?"
• Use check-status via scripts
• View outputs in product pages:
- Podcast: https://listenhub.ai/app/podcast
- Explain: https://listenhub.ai/app/explainer
- Text-to-Speech: https://listenhub.ai/app/text-to-speech
• Do other things, ask later
Internally remember Episode ID for status queries.
When user says "done yet?" / "ready?" / "check status":
Podcast result:
✓ Podcast generated!
"{title}"
Episode: https://listenhub.ai/app/episode/{episodeId}
Duration: ~{duration} minutes
Download audio: provide audioUrl or audioStreamUrl on request
One-stage podcast creation generates an online task. When status is success, the episode detail already includes scripts and audio URLs. Download uses the returned audioUrl or audioStreamUrl without a second create call. Two-stage creation is only for script review or manual edits before audio generation.
Explain result:
✓ Explainer video generated!
"{title}"
Watch: https://listenhub.ai/app/explainer
Duration: ~{duration} minutes
Need to download audio? Just say so.
Image result:
✓ Image generated!
~/Downloads/labnana-{timestamp}.jpg
Image results are file-only and not shown in the web UI.
Important: Prioritize web experience. Only provide download URLs when user explicitly requests.
Scripts are shell-based. Locate via **/skills/listenhub/scripts/.
Dependency: jq is required for request construction.
The AI must ensure curl and jq are installed before invoking scripts.
⚠️ Long-running Tasks: Generation may take 1-5 minutes. Use your CLI client's native background execution feature:
run_in_background: true in Bash toolInvocation pattern:
$SCRIPTS/script-name.sh [args]
Where $SCRIPTS = resolved path to **/skills/listenhub/scripts/
Default path. Use unless script review or manual editing is required.
$SCRIPTS/create-podcast.sh --query "The future of AI development" --language en --mode deep --speakers cozy-man-english
$SCRIPTS/create-podcast.sh --query "Analyze this article" --language en --mode deep --speakers cozy-man-english --source-url "https://example.com/article"
Multiple --source-url and --source-text arguments are supported to combine several references in one request.
Advanced path. Use only when script review or edits are explicitly requested.
The entire value of two-stage generation is human review between stages. Skipping review reduces it to one-stage with extra latency — never do this.
Stage 1: Generate text content.
$SCRIPTS/create-podcast-text.sh --query "AI history" --language en --mode deep --speakers cozy-man-english,travel-girl-english
Review Gate (mandatory): After text generation completes, the agent MUST:
check-status.sh --wait to poll until completion. On exit code 2 (timeout or rate-limited), wait briefly and retry.~/Downloads/podcast-draft-<episode-id>.md — human-readable version assembled from the response fields (title, outline, sourceProcessResult.content, and the scripts array formatted as readable dialogue). This is for the user to review.~/Downloads/podcast-scripts-<episode-id>.json — the raw {"scripts": [...]} object extracted from the response, exactly in the format that create-podcast-audio.sh --scripts expects. This is the machine-readable source of truth for Stage 2.open command on macOS).create-podcast-audio.sh --episode <id> without --scripts (server uses original).--scripts.The agent MUST NOT proceed to Stage 2 automatically. This is a hard constraint, not a suggestion.
Stage 2: Generate audio from reviewed/approved text.
# User approved without changes:
$SCRIPTS/create-podcast-audio.sh --episode "<episode-id>"
# User provided edits:
$SCRIPTS/create-podcast-audio.sh --episode "<episode-id>" --scripts modified-scripts.json
$SCRIPTS/create-speech.sh --scripts scripts.json
echo '{"scripts":[{"content":"Hello","speakerId":"cozy-man-english"}]}' | $SCRIPTS/create-speech.sh --scripts -
# scripts.json format:
# {
# "scripts": [
# {"content": "Script content here", "speakerId": "speaker-id"},
# ...
# ]
# }
$SCRIPTS/get-speakers.sh --language zh
$SCRIPTS/get-speakers.sh --language en
Guidance:
get-speakers.sh 获取可用列表。language 匹配的列表首个 speakerId 作为默认音色。Response structure (for AI parsing):
{
"code": 0,
"data": {
"items": [
{
"name": "Yuanye",
"speakerId": "cozy-man-english",
"gender": "male",
"language": "zh"
}
]
}
}
Usage: When user requests specific voice characteristics (gender, style), call this script first to discover available speakerId values. NEVER hardcode or assume speakerIds.
$SCRIPTS/create-explainer.sh --content "Introduce ListenHub" --language en --mode info --speakers cozy-man-english
$SCRIPTS/generate-video.sh --episode "<episode-id>"
$SCRIPTS/create-tts.sh --type text --content "Welcome to ListenHub" --language en --mode smart --speakers cozy-man-english
$SCRIPTS/generate-image.sh --prompt "sunset over mountains" --size 2K --ratio 16:9
$SCRIPTS/generate-image.sh --prompt "style reference" --reference-images "https://example.com/ref1.jpg,https://example.com/ref2.png"
Supported sizes: 1K | 2K | 4K (default: 2K).
Supported aspect ratios: 16:9 | 1:1 | 9:16 | 2:3 | 3:2 | 3:4 | 4:3 | 21:9 (default: 16:9).
Reference images: comma-separated URLs, maximum 14.
# Single-shot query
$SCRIPTS/check-status.sh --episode "<episode-id>" --type podcast
# Wait mode (recommended for automated polling)
$SCRIPTS/check-status.sh --episode "<episode-id>" --type podcast --wait
$SCRIPTS/check-status.sh --episode "<episode-id>" --type flow-speech --wait --timeout 60
$SCRIPTS/check-status.sh --episode "<episode-id>" --type explainer --wait --timeout 600
tts is accepted as an alias for flow-speech.
--wait mode handles polling internally with configurable limits.
Agents SHOULD use --wait instead of manual polling loops. On exit code 2, wait briefly and retry the command.
| Option | Default | Description |
|---|---|---|
--wait | off | Enable polling mode |
--max-polls | 30 | Maximum poll attempts |
--timeout | 300 | Maximum total wait (seconds) |
--interval | 10 | Base poll interval (seconds) |
Exit codes: 0 = completed, 1 = failed, 2 = timeout or rate-limited (still pending, safe to retry after a short wait).
Automatic Language Detection: Adapt output language based on user input and context.
Detection Rules:
Application:
Example:
User (Chinese): "生成一个关于 AI 的播客"
AI (Chinese): "→ 收到!准备双人播客..."
User (English): "Make a podcast about AI"
AI (English): "→ Got it! Preparing two-person podcast..."
Principle: Language is interface, not barrier. Adapt seamlessly to user's natural expression.
You are a dispatcher, not an implementer.
Your job is to:
Your job is NOT to:
ListenHub modes (passthrough):
get-speakers.sh first to list optionsLabnana mode (passthrough by default):
Default behavior: transparent forwarding. Pass the user's prompt directly to the script without modification.
When to offer optimization:
In this case, ask whether the user would like help enriching the prompt. Do not optimize without confirmation.
When to never modify:
If the user agrees to optimization, the following techniques are available as reference:
Style: "cyberpunk" → add "neon lights, futuristic, dystopian"; "ink painting" → add "Chinese ink painting, traditional art style"
Scene: time of day, lighting conditions, weather
Quality: "highly detailed", "8K quality", "cinematic composition"
Rules when optimizing:
→ Generation submitted, about 2-3 minutes
You can: • Wait and ask "done yet?" • Check listenhub.ai/app/library </response> </example>
→ Generation submitted, explainer videos take 3-5 minutes
Includes: Script + narration + AI visuals </response> </example>
→ TTS submitted, about 1-2 minutes
Wait a moment, or ask "done yet?" to check </response> </example>
Prompt: Cyberpunk city at night, neon lights reflecting on wet streets, towering skyscrapers with holographic ads, flying vehicles, cinematic composition, highly detailed, 8K quality
Resolution: 2K (16:9)
✓ Image generated! ~/Downloads/labnana-20260121-143145.jpg </response> </example>
Prompt: a futuristic car Reference images: 1 Reference image URL: https://example.com/style-ref.jpg Resolution: 2K (16:9)
✓ Image generated! ~/Downloads/labnana-20260122-154230.jpg </response> </example>
"AI Revolution: From GPT to AGI"
Listen: https://listenhub.ai/app/podcast
Duration: ~8 minutes
Need to download? Just say so. </response> </example>