Purpose: Evaluate completed podcast episodes across 10 quality dimensions to identify strengths and weaknesses. This is a diagnostic tool (not before/after comparison) applied to every episode to understand its unique profile.
When to use: After any episode is complete (audio generated and published). Can be run on baseline episodes, improved episodes, or experimental formats.
Output: Detailed quality scorecard saved to podcast/episodes/EPISODE_PATH/logs/quality_scorecard.md
Workflow
Phase 1: Gather Episode Materials
Read the following files from the episode directory:
Required:
content_plan.md - Episode structure and NotebookLM guidance
report.md - Research synthesis
EPISODE_SLUG_chapters.json - Chapter structure
Audio transcript (one of):
関連 Skill
EPISODE_SLUG_transcript.json (Whisper output, extract text field)
transcript.txt (plain text)
Optional but valuable:
5. research/p3-briefing.md - Research organization
6. sources.md - Source validation
7. logs/metadata.md - Publishing metadata (if exists)
Audio duration:
Extract from chapters JSON (last chapter startTime + estimated final chapter duration ~120s)
OR use ffmpeg -i EPISODE.mp3 2>&1 | grep Duration
Phase 2: Evaluate 10 Dimensions
For each dimension, provide:
Score (1-5) using the rating scale
Evidence from episode materials (quotes, examples, observations)
Why this score (not higher/lower)
Specific observations unique to this episode
Dimension 1: Structural Clarity (1-5)
What we're measuring: Can a listener follow the episode's structure and know where they are at any moment?
Rating Scale:
5 - Crystal Clear: Structure stated upfront, clear signposting at transitions, easy to summarize arc in one sentence
4 - Well Structured: Most transitions are clear, structure is followable, minor gaps
3 - Adequate: Structure exists but requires listener effort to discern, some unclear transitions
2 - Meandering: Structure is hard to follow, transitions feel random, listener may get lost
1 - Chaotic: No discernible structure, topics jump without warning
Evidence to gather:
Does opening preview structure?
Count signposting phrases: "we just covered X, now we're moving to Y"
Can you write one-sentence arc summary?
Compare actual episode flow (chapters) to planned structure (content_plan.md)
Document:
One-sentence arc summary
Examples of signposting (or lack thereof)
Structural preview (if present)
Dimension 2: Depth Distribution (1-5)
What we're measuring: Do all major themes get proportional depth, or do some feel rushed/underdeveloped?
Rating Scale:
5 - Perfectly Balanced: All major themes get depth proportional to importance, no theme feels rushed or over-explored
4 - Well Balanced: Minor depth variations, but all themes adequately covered
3 - Uneven: One theme clearly gets more depth than equally important themes
2 - Imbalanced: Important theme feels like an add-on or afterthought, significant depth disparity
1 - Severely Skewed: Major theme mentioned briefly while minor themes dominate
Evidence to gather:
List all major themes from content_plan.md
Calculate time allocation per theme (from chapters)
Identify themes that got <15% of time when they deserved more
Note if depth differences are intentional (primary vs. secondary) or accidental
Document:
Theme analysis table with time allocations and percentages
Critical imbalances identified
Comparison to content plan intentions
Dimension 3: Mode-Switching Clarity (1-5)
What we're measuring: Are transitions between modes (philosophy, research, storytelling, practical, landing) intentional and smooth?
Rating Scale:
5 - Masterful: Modes are clearly defined, transitions feel purposeful, each mode serves its function
Header (episode title, date, evaluator, format, duration)
Scores table (summary)
Full 10-dimension evaluation (each dimension gets its own section with rating scale, evidence, assessment)
Summary (strengths, weaknesses, areas for improvement)
Workflow improvements (specific tasks from improvement plan)
Notes & observations (free-form insights, what worked, what needs work, ideas for next episode)
Quality Standards
Evidence-Based Evaluation
Quote from transcript to support claims about dialogue, signposting, etc.
Reference specific chapters when discussing structure or depth
Compare to content plan to assess execution vs. intention
Avoid vague assessments ("felt rushed") without evidence ("AI section: 90 seconds of 32-minute episode, 2.8% of total time")
Actionable Feedback
Not: "Dialogue needs improvement"
Instead: "Zero counterpoint moments. Founder Mode debate (Ch 9) presented perfect opportunity: one speaker could defend delegation, other defend hands-on involvement. Instead, both agreed throughout."
Not: "Packaging could be better"
Instead: "Missing 'What You'll Learn' bullets. Current description doesn't entice. Add: 'Why the famous 70% rule has zero research backing' + 4 more bullets highlighting key frameworks."
Respectful Tone
This is diagnostic feedback for improvement, not criticism. Focus on:
Opportunities (not "failures")
Specific improvements (not vague "be better")
Strengths to leverage (not just weaknesses)
Example Usage
Invocation via Task Tool
Use the Task tool with subagent_type='general-purpose':
"Run the podcast-quality-scorecard skill on the episode at podcast/episodes/algorithms-for-life/ep3-how-to-delegate/.
Follow the workflow in .claude/skills/podcast-quality-scorecard/SKILL.md to:
1. Gather episode materials (content_plan.md, report.md, transcript, chapters)
2. Evaluate all 10 dimensions with evidence-based scoring
3. Generate summary with strengths, weaknesses, and workflow improvement recommendations
4. Write comprehensive scorecard to logs/quality_scorecard.md
Episode title: Algorithms for Life: Ep. 3, How to Delegate
Format: Standard workflow (baseline evaluation)"