Turn raw footage into fully captioned, audience-ready video in minutes. This auto-subtitle-generator-online skill transcribes spoken audio, syncs captions to each word, and formats subtitles for any platform — YouTube, TikTok, Instagram Reels, or broadcast. Supports multiple languages, custom styling, and SRT/VTT export. Perfect for content creators, educators, marketers, and accessibility advocates who need fast, accurate subtitles without manual typing.
Paste your video URL or transcript and I'll generate synced, formatted subtitles ready for any platform. No video? Just describe the content and tell me what subtitle style you need.
Try saying:
This skill connects to a cloud processing backend. On first use, set up the connection automatically and let the user know ("Connecting...").
Token check: Look for NEMO_TOKEN in the environment. If found, skip to session creation. Otherwise:
https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with X-Client-Id headerdata.token from the response — this is your NEMO_TOKEN (100 free credits, 7-day expiry)Session: POST https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Bearer auth and body {"task_name":"project"}. Keep the returned session_id for all operations.
Let the user know with a brief "Ready!" when setup is complete. Don't expose tokens or raw API output.
Most videos lose half their potential audience simply because they have no subtitles. Viewers watching on mute, people with hearing impairments, or non-native speakers all depend on captions to engage with your content. This skill eliminates the tedious work of manually transcribing and timestamping every line.
Using this auto subtitle generator online, you can paste a video URL, upload a transcript, or describe your footage and get back properly formatted subtitles — complete with timing cues that match the natural rhythm of speech. Whether you need clean single-line captions for TikTok or multi-line blocks for a documentary, the output adapts to your format requirements.
This tool is built for real production workflows. You can request subtitles in a specific language, ask for burned-in caption styling suggestions, or generate SRT and VTT files ready to upload directly to your video platform. No audio engineering knowledge required — just bring your content and walk away with captions that work.
When you submit a video, your transcription request is parsed by intent — detecting language, speaker count, and subtitle format preference — then dispatched to the optimal processing pipeline for frame-accurate caption alignment.
| User says... | Action | Skip SSE? |
|---|---|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" | → §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
The auto subtitle generator online backend leverages a distributed speech-to-text engine that processes audio streams in parallel chunks, synchronizing phoneme timestamps with video frame data to produce SRT, VTT, or ASS subtitle files. All media is handled over encrypted cloud nodes, ensuring low-latency turnaround even for long-form content.
Skill attribution — read from this file's YAML frontmatter at runtime:
X-Skill-Source: auto-subtitle-generator-onlineX-Skill-Version: from frontmatter versionX-Skill-Platform: detect from install path (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: https://mega-api-prod.nemovideo.ai
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id.
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: {"urls":["<url>"],"source_type":"url"}
Credits: GET /api/credits/balance/simple — returns available, frozen, total
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, data.state.generated_media
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
| Event | Action |
|---|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result | Process internally, don't forward |
heartbeat / empty data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" | Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
Timeline (3 tracks): 1. Video: city timelapse (0-10s) 2. BGM: Lo-fi (0-10s, 35%) 3. Title: "Urban Dreams" (0-3s)
| Code | Meaning | Action |
|---|---|---|
| 0 | Success | Continue |
| 1001 | Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with ?bind=<id> (get <id> from create-session or state response when needed). Registered: "Top up credits in your account" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register or upgrade your plan to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
If your generated subtitles appear out of sync with the audio, the most common cause is an inaccurate start-time reference. When submitting a transcript, always include the video's total duration and note whether the speech starts immediately at 00:00 or after an intro segment — this helps the subtitle generator align timing correctly from the first word.
For videos with heavy background music, strong accents, or overlapping speakers, providing a rough manual transcript alongside the video URL will dramatically improve caption accuracy. You can paste even a partial script and ask the skill to fill in the gaps based on context.
If the subtitle file format isn't being accepted by your video platform, double-check the encoding requirement — YouTube prefers UTF-8 encoded SRT files, while some broadcast tools require specific VTT header formatting. Just ask for the output in the exact format your platform specifies and the file will be generated accordingly.
Content creators uploading to YouTube or TikTok use this auto subtitle generator online to boost watch time — videos with captions consistently outperform uncaptioned ones because they hold attention even when audio is off. A single upload can be captioned and repurposed across multiple platforms in different formats within minutes.
Educators and e-learning developers rely on accurate subtitle generation to meet accessibility standards like WCAG and ADA compliance. Instead of manually syncing captions inside a video editor, they generate a clean SRT or VTT file and attach it directly to their LMS video player.
Marketers running paid video ads use this tool to quickly test captioned vs. uncaptioned versions without sending files to a production agency. Podcast producers also use it to generate word-for-word transcripts formatted as readable subtitles for video podcast uploads on Spotify and YouTube simultaneously.