Process raw Co-Design Program transcripts from the Google Drive Staging_Transcripts folder into Golden Tier format. Use when asked to clean transcripts, promote staging files, or update the Golden Tier collection.
Process raw Co-Design Program interview transcripts from Staging_Transcripts on Google Drive into the curated Golden Tier collection.
gws CLI authenticated as [email protected] with Drive scopepython3 availablepandoc available (for .docx conversion)~/Documents/bella_assist/gws_copy/shared_drives/BellaAssist_-_Product/Co-Design_Program/Golden_Tier/1W169BsPkexomhf6Qeyz-6dFOkTn-Z2UWreferences/golden-tier-spec.md — format spec (frontmatter schema, speaker format, naming convention, cleaning checklist)references/known-speakers.md — speaker name-to-role mapping, Gemini display names, common transcription errorsRead both references before starting any processing.
List the staging folder and cross-reference against the manifest.
# List staging files
gws drive files list --params '{"q": "\"1W169BsPkexomhf6Qeyz-6dFOkTn-Z2UW\" in parents and trashed=false", "includeItemsFromAllDrives": true, "supportsAllDrives": true, "fields": "files(id,name,modifiedTime,mimeType,size)", "pageSize": 100}'
For each file:
manifest.yaml for an existing entry matching that person + stagePresent the inventory table to the user for approval before proceeding.
For each approved file:
# Download .docx files
gws drive files get --params '{"fileId": "<ID>", "alt": "media", "supportsAllDrives": true}' --output "/tmp/staging/<filename>"
# Convert .docx to .md
pandoc -f docx -t markdown --wrap=none "/tmp/staging/<filename>.docx" -o "/tmp/staging/<filename>.md"
For files already in .md format, download directly — no conversion needed.
Identify the source type and extract the transcript content.
Gemini Notes exports (most common in staging):
.docx contains both AI summary notes AND an embedded transcriptTranscript (often preceded by a clipboard emoji)_transcript.md, notes → _notes.mdRead AI exports (.txt files):
TIMESTAMP - Speaker Name followed by dialogue on next linesPre-cleaned files (already .md with some processing done):
Run the deterministic cleanup script first:
python3 scripts/clean_transcript.py "/tmp/staging/<file>.md" -o "/tmp/staging/<file>_cleaned.md"
This handles: pandoc artifacts, Unicode normalisation, empty spacer lines, Gemini disclaimer removal, timestamp anchor cleanup.
Then apply AI judgment (Claude does this directly):
references/known-speakers.md to map display names to canonical names and rolesBails18 Wills → keep as-is for Gemini format; resolve Unidentified Speaker where possible from context)**Speaker Name** [Role] (H:MM:SS): formatSpeaker Name: text format with standalone timestamp blocksGenerate the YAML frontmatter:
python3 scripts/generate_frontmatter.py \
--participant "Name" \
--role "Support Coordinator" \
--stage 4 \
--session-type "MVP Testing" \
--date "2026-04-09" \
--source "Gemini embedded transcript" \
--source-file "original.docx" \
--content-type transcript \
--has-companion-notes true \
"/tmp/staging/<file>_cleaned.md"
This prepends the frontmatter and document header, calculates word count, and writes the final file.
Place the output file:
Golden_Tier/{SC_or_Participants}/{name}/{session_id}_{date}_transcript.mdGolden_Tier/{SC_or_Participants}/{name}/{session_id}_{date}_notes.md# Lock processed files
chmod 444 "Golden_Tier/{path}_transcript.md"
chmod 444 "Golden_Tier/{path}_notes.md"
# Update manifest
python3 scripts/update_manifest.py \
--manifest "Golden_Tier/manifest.yaml" \
--transcript "Golden_Tier/{path}_transcript.md" \
--notes "Golden_Tier/{path}_notes.md"
Also update Golden_Tier/README.md:
IMPORTANT: Every file write to the Golden Tier is a curated corpus modification. Present the diff for user review.
Re-read each produced Golden Tier file and validate:
| Check | Method | Pass criteria |
|---|---|---|
| YAML frontmatter | Parse YAML block | All required fields present and correctly typed |
| Word count | Count body words, compare to frontmatter | Within +/- 5 of word_count field |
| Speaker format (Read AI) | Regex: ^\*\*[^*]+\*\* \[[^\]]+\] \(\d+:\d{2}:\d{2}\): | Every speaker turn matches |
| Speaker format (Gemini) | Regex: ^[A-Z][^:]+: after a timestamp block | Consistent speaker labels |
| No pandoc artifacts | Search for {.underline}, [~~, trailing \ | Zero matches |
| No Gemini noise | Search for "You should review", "Suggested next steps" | Zero matches in transcript files |
| No unmerged turns | Check for consecutive identical speaker labels | None found (or justified) |
| File permissions | stat -f %Lp or ls -la | 444 (read-only) |
| File location | Path check | Correct subfolder and naming convention |
| Manifest consistency | Parse manifest, check for duplicates, verify totals | Clean |
Present a verification summary table. Any failures get flagged for manual review.