Generate or refine high-quality transcription subtitles from audio or video with ElevenLabs STT, word-level timestamps, token-range editing, ASR error correction, terminology consistency, optional user-provided glossaries, and SRT/JSON round-tripping. Use when the user needs 音频/视频转字幕, 高质量转写字幕, 合理语义分段, 准确时间轴, 错词纠正, 专有名词统一, glossary-driven review, punctuation or casing cleanup, or an agent-editable transcript that must render back to SRT without losing token coverage.
Use this skill as a quality-first subtitle workflow, not a raw ASR dump.
<stem>.review.json, and preserve raw artifacts.qa_flags, fix segmentation/timing/text issues inside the editable JSON, and save <stem>.corrected.json.json -> review -> corrected json -> srt.raw json -> review json -> corrected json -> srt for later iterations.cd skills/transcribe2sub && pnpm install before invoking the script.<stem>.corrected.json, then render SRT.<stem>.review.json; this keeps the generated review draft, the reviewed output, and the raw cache aligned.subtitles[].qa_flags, review.normalization_diagnostics, glossary.entries, glossary.candidates, and glossary.collected before exporting.qa_flag as a must-fix item, not a suggestion.spacing token covered exactly once.subtitles[].text for obvious ASR correction, punctuation cleanup, terminology unification, and line-break polish within the same token span.glossary.collected; keep uncertain terms in glossary.candidates.--glossary when the user provides a term list or when terminology consistency matters.--from-raw-json when rerunning the same media with different segmentation or glossary settings.--max-chars and --max-duration per language or density when the defaults do not fit the material.transcript.json; reviewed outputs must end with .corrected.json.tokens[].id, tokens[].start, tokens[].end, tokens[].type, or tokens[].speaker_id.subtitles[].start, subtitles[].end, word_*, or speaker_ids; they are derived preview fields.glossary.candidates as locked truth before they are promoted into glossary.collected.ffmpeg is available.cd skills/transcribe2sub && pnpm install before doing anything else.pnpm install with elevated permissions in the skill directory.ELEVENLABS_API_KEY; the script can fall back to unauthenticated mode when needed and now auto-enables diarization for that path.references/subtitle-quality.md before regrouping or polishing subtitles.references/glossary-format.md before creating or loading a user glossary.references/elevenlabs-stt-api.md only when API field details matter.Generate editable JSON instead of direct SRT.
Ownership:
<stem>.elevenlabs.json, and outputs <stem>.review.json.Naming convention:
<stem>.review.json<stem>.elevenlabs.json<stem>.corrected.jsonThe script saves the raw ElevenLabs response alongside the main output by default as <output_basename>.elevenlabs.json.
CJK baseline:
cd skills/transcribe2sub
pnpm tsx scripts/transcribe2sub.ts <audio> --format json --max-chars 22 --max-duration 8.0 -o episode.review.json
Spaced-language baseline:
cd skills/transcribe2sub
pnpm tsx scripts/transcribe2sub.ts <audio> --format json --max-chars 38 --max-duration 8.0 -o episode.review.json
If the user provides a term list, pass it in at generation time:
cd skills/transcribe2sub
pnpm tsx scripts/transcribe2sub.ts <audio> --format json --glossary glossary.txt -o episode.review.json
Later, rebuild from the saved raw JSON without calling the API again:
cd skills/transcribe2sub
pnpm tsx scripts/transcribe2sub.ts --from-raw-json episode.elevenlabs.json --format json --glossary glossary.txt -o episode.review.json
Review subtitles[] against the quality rubric.
Save the reviewed file as <stem>.corrected.json, for example episode.corrected.json.
Ownership:
Subagent 2 owns this step. It reads <stem>.review.json, performs review and QA, and saves <stem>.corrected.json.
Subagent 2 must not re-run transcription or regenerate the draft unless the user explicitly asks to restart from raw audio or raw JSON.
Edit only subtitles[].token_start, subtitles[].token_end, subtitles[].text, glossary.candidates, and glossary.collected.
During correction, let the review LLM extract candidate terms into glossary.candidates.
Use subtitles[].text to correct obvious ASR misrecognitions within the same timed span.
Use glossary.entries as locked canonical terms from the user.
Use glossary.candidates as review-stage staging data only; do not treat them as final until they are copied into glossary.collected.
Add newly discovered people names, products, brands, organizations, or domain terms to glossary.collected.
Prioritize qa_flags and review.normalization_diagnostics before spending time on fine-grained polish.
For zero_duration, timing_span_mismatch, too_short, too_long, ends_mid_word, and starts_mid_word, adjust token boundaries first; do not try to polish text around a broken span.
Even when qa_flags are sparse, actively inspect for flash cues, short text hanging too long, unnatural joins across full sentences, and cross-speaker merges.
Treat subtitles[].start, subtitles[].end, word_*, and speaker_ids as derived preview fields.
Never edit tokens[].id, tokens[].start, tokens[].end, tokens[].type, or tokens[].speaker_id.
Ensure every non-spacing token belongs to exactly one subtitle.
After all edits, do a second pass over the whole file and confirm no obvious QA issue remains before exporting.
Render the corrected JSON back to SRT.
Ownership:
cd skills/transcribe2sub
pnpm tsx scripts/transcribe2sub.ts --from-json episode.corrected.json -o final.srt
Generate a direct draft only when the user prioritizes speed over review quality.
cd skills/transcribe2sub
pnpm tsx scripts/transcribe2sub.ts <audio> -o draft.srt
When rerunning the same audio with different glossary or segmentation settings, prefer --from-raw-json over re-uploading audio.
subtitles[].text to fix wrong words, punctuation, casing, line breaks, and obvious ASR formatting issues.glossary.candidates into glossary.collected or delete them during review.glossary.entries and glossary.collected.qa_flags remain unaddressed and that warning-level flags were consciously reviewed.