Extract patterns, FAQs, and SOP topic map from clustered customer support data by reading full conversation transcripts. Stage 2 of the Userchat-to-SOP pipeline. **Language:** Auto-detects Korean (한국어) or Japanese (日本語) from user input.
Extract patterns, FAQs, and define SOP topics by reading full conversation transcripts from Stage 1 clustered data.
Language: Detect the language from the user's first message and respond in that language throughout. Support Korean (한국어) and Japanese (日本語). Default to Korean if language is unclear.
Input: Stage 1 results ({prefix}_clustered.xlsx, {prefix}_tags.xlsx, {prefix}_messages.csv, analysis_report.md)
Output: patterns.json (with sop_topic_map), faq.json, keywords.json, patterns_enriched.json
Core Principle: Read actual conversation turns, not summaries. Summaries lose customer tone, agent response patterns, and escalation moments.
results/kamoa/01_clustering)n_samples_per_cluster = max(25, ceil(min_total_samples / K))"all", "top_10", or list "0,2,5,7"Read tags and analysis report, then run enrichment to get full conversation transcripts.
Actions:
{prefix}_tags.xlsx and analysis_report.mdn_samples = max(25, ceil(500 / K))patterns.json from tags, then run enrichment:mkdir -p results/{company}/02_extraction
python3 -c "
import pandas as pd, json, math
tags = pd.read_excel('results/{company}/01_clustering/{prefix}_tags.xlsx')
K = len(tags)
n_samples = max(25, math.ceil(500 / K))
print(f'K={K}, n_samples_per_cluster={n_samples}, total={K * n_samples}')
data = {'metadata': {'company': '{company}', 'bootstrap': True}, 'clusters': []}
for _, r in tags.iterrows():
data['clusters'].append({'cluster_id': int(r['cluster_id']), 'label': r['label'], 'category': r['category'], 'cluster_size': int(r['cluster_size'])})
with open('results/{company}/02_extraction/patterns.json', 'w') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
"
python3 scripts/enrich_patterns.py \
--patterns results/{company}/02_extraction/patterns.json \
--messages results/{company}/01_clustering/{prefix}_messages.csv \
--output results/{company}/02_extraction/conversations_by_cluster.json \
--n-samples {n_samples}
Fallback: If enrichment fails, fall back to enhanced_text from clustered.xlsx and mark "data_source": "summary_fallback".
For each cluster, read full conversation transcripts and extract:
정보_요청/문제_신고/프로세스_문의/불만_제기), verbatim customer phrases, frequencyConstraints:
turns from enrichment output, NOT enhanced_textAfter analyzing ALL clusters, define SOP topics independent of cluster boundaries.
Handle these cases:
Output sop_topic_map (Stage 3 follows this exactly):
{
"sop_topic_map": {
"topics": [
{
"topic_id": "TS_HARDWARE_AS",
"title": "A/S 접수 및 하드웨어 불량 처리",
"type": "TS",
"journey_stage": "사용 중",
"source_clusters": [
{"cluster_id": 0, "portion": "partial", "conversation_ids": [1,3,5], "reason": "하드웨어 관련만"},
{"cluster_id": 6, "portion": "full"}
],
"estimated_records": 500,
"key_patterns": ["블루스크린", "택배_AS_접수"]
}
],
"merge_log": [...],
"label_corrections": [...]
}
}
Constraints:
FAQ (3-5 per SOP topic):
Keywords: Hierarchical taxonomy (category → subcategory → keywords) from actual conversations, including synonyms and common typos.
Save to results/{company}/02_extraction/:
patterns.json — patterns + sop_topic_mapfaq.json — FAQ pairs by SOP topickeywords.json — keyword taxonomyextraction_summary.md — human-readable summaryThen run final enrichment on the completed patterns:
python3 scripts/enrich_patterns.py \
--patterns results/{company}/02_extraction/patterns.json \
--messages results/{company}/01_clustering/{prefix}_messages.csv \
--output results/{company}/02_extraction/patterns_enriched.json \
--n-samples {n_samples}
| Issue | Solution |
|---|---|
| Enrichment fails (messages.csv missing) | Fall back to enhanced_text from clustered.xlsx, mark "data_source": "summary_fallback" |
| Patterns too generic | Re-read conversations, copy-paste exact customer phrases |
| Too many topics (>15) | Merge related topics (e.g., "SSD 인식" + "HDD 연결" → "저장장치 문제") |
sop_topic_map defined here — it does not redefine topics