This SOP guides the real sample-based LLM extraction of patterns, FAQs, and response strategies from clustered customer support data. This is Stage 2 of the Userchat-to-SOP pipeline, combining Python sample extraction with AI agent natural language analysis. **Language:** Auto-detects Korean (한국어) or Japanese (日本語) from user input.
This SOP guides the real sample-based LLM extraction of patterns, FAQs, and response strategies from clustered customer support data. This is Stage 2 of the Userchat-to-SOP pipeline, combining Python sample extraction with AI agent natural language analysis.
Language: Detect the language from the user's first message and respond in that language throughout. Support Korean (한국어) and Japanese (日本語). Default to Korean if language is unclear.
Stage Flow:
Critical Philosophy:
enhanced_text (summaries) — they lose customer expressions and agent responsesturns from enriched data) for all analysisclustering_output_dir: Directory containing Stage 1 results
results/meliens{prefix}_clustered.xlsx, {prefix}_tags.xlsx, {prefix}_messages.csv, analysis_report.mdcompany: Company name for context
n_samples_per_cluster based on cluster count (K)n_samples_per_cluster = max(20, ceil(min_total_samples / K))min_total_samples and K"all": All clusters"top_10": Top 10 by size"0,2,5,7"Read clustering results, then immediately run enrichment to extract full conversation transcripts. All subsequent steps use these transcripts, not summaries.
Constraints:
{prefix}_tags.xlsx: Cluster summary (ID, label, category, keywords, count)analysis_report.md: Analysis insights and recommendations{prefix}_clustered.xlsx for samples — enrichment will provide better data)n_samples_per_cluster dynamically (unless user explicitly set it):
import math
K = len(tags) # number of clusters from tags.xlsx
min_total = 300 # min_total_samples parameter
n_samples = max(20, math.ceil(min_total / K))
mkdir -p results/{company}/02_extraction
# 1. Create minimal patterns.json from tags (cluster IDs only)
python3 -c "
import pandas as pd, json
tags = pd.read_excel('results/{company}/01_clustering/{prefix}_tags.xlsx')
data = {'metadata': {'company': '{company}', 'bootstrap': True}, 'clusters': []}
for _, r in tags.iterrows():
data['clusters'].append({'cluster_id': int(r['cluster_id']), 'label': r['label'], 'category': r['category'], 'cluster_size': int(r['cluster_size'])})
with open('results/{company}/02_extraction/patterns.json', 'w') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
"
# 2. Run enrichment to extract full conversation transcripts
# n_samples is dynamically calculated: max(20, ceil(300 / K))
python3 scripts/enrich_patterns.py \
--patterns results/{company}/02_extraction/patterns.json \
--messages results/{company}/01_clustering/{prefix}_messages.csv \
--output results/{company}/02_extraction/conversations_by_cluster.json \
--n-samples {n_samples}
{n_samples} conversation transcripts with full turnsWhy enrichment first?
enhanced_text) lose: customer tone, agent response style, conversation flow, escalation moments, resolution stepsFallback if enrichment fails:
enrich_patterns.py fails or messages.csv is missing, fall back to extracting samples from clustered.xlsx using enhanced_text"data_source": "summary_fallback" in metadataExpected Output:
✅ Enrichment 완료
- 클러스터: 12개 (K=12)
- 클러스터당 대화: 25건 (= max(20, ceil(300/12)))
- 총 분석 대화: 300건
- 파일: conversations_by_cluster.json
For each cluster, read the full conversation transcripts and extract patterns, classify HT vs TS.
Constraints:
turns from enrichment output (NOT enhanced_text)정보_요청, 문제_신고, 프로세스_문의, 불만_제기{n_samples} conversationsPer-Cluster Analysis Process:
For each cluster, read {n_samples} conversations and extract:
Patterns (3-8 per cluster):
HT vs TS Classification:
Mislabel Detection:
actual_content descriptionCompany Tone (from agent messages in conversations):
Reading Conversations (important):
For Cluster X, read each conversation's turns:
Turn 1 (customer): "블루스크린이 계속 떠요. 어제부터 갑자기..."
Turn 2 (agent): "안녕하세요, 고객님! ... 먼저 메모리 재장착을 시도해주세요"
Turn 3 (customer): "재장착 했는데 아직도 같아요"
Turn 4 (agent): "그렇다면 택배 AS 접수를 도와드리겠습니다..."
→ Pattern: 블루스크린 → 자가조치 → AS 접수 (TS, 문제 해결 프로세스)
→ Agent tone: emoji 사용, 공감 표현 먼저, 단계별 안내
→ Real phrase: "블루스크린이 계속 떠요" (not "블루스크린 문제 발생")
Output per cluster:
{
"cluster_id": 0,
"original_label": "오류 해결 문의",
"actual_content": "A/S 접수 + 하드웨어 불량 + 윈도우 설치 (혼합)",
"is_mixed": true,
"mixed_topics": ["하드웨어_AS", "윈도우_소프트웨어", "케이블_연결"],
"sop_type": "TS",
"patterns": [...],
"tone_observations": {...}
}
After analyzing ALL clusters, define the actual SOP topic list independent of cluster boundaries. This is the authoritative plan that Stage 3 must follow.
Constraints:
sop_topic_map — Stage 3 MUST follow this mapProcess:
Output Structure (sop_topic_map in patterns.json):
{
"sop_topic_map": {
"description": "Authoritative SOP topic list. Stage 3 MUST follow this map.",
"topics": [
{
"topic_id": "TS_HARDWARE_AS",
"title": "A/S 접수 및 하드웨어 불량 처리",
"type": "TS",
"journey_stage": "사용 중",
"source_clusters": [
{"cluster_id": 0, "portion": "partial", "conversation_ids": [1,3,5,6,8,10,12,14,15,17,19], "reason": "하드웨어 불량/AS 접수 관련 대화만"},
{"cluster_id": 6, "portion": "full", "reason": "전체가 AS 문의"}
],
"estimated_records": 500,
"key_patterns": ["블루스크린", "쿨러_소음", "택배_AS_접수"]
},
{
"topic_id": "TS_WINDOWS_SW",
"title": "윈도우/소프트웨어 문제 해결",
"type": "TS",
"journey_stage": "설치",
"source_clusters": [
{"cluster_id": 0, "portion": "partial", "conversation_ids": [2,4,7,9,11,13,16,18,20], "reason": "윈도우 설치/드라이버 관련 대화만"}
],
"estimated_records": 300,
"key_patterns": ["윈도우_설치", "정품_인증", "드라이버_문제"]
}
],
"merge_log": [
"Clusters 2+5+6 merged → HT_INITIAL_INTAKE (모두 초기 인사 메시지)"
],
"label_corrections": [
{"cluster_id": 7, "original_label": "현금영수증 문의", "actual_content": "80% 취소/환불 + 20% 서류 발행", "action": "split into TS_CANCEL + HT_DOCUMENTS"}
]
}
}
Validation Checklist:
conversation_ids assigned per topicCreate question-answer pairs based on actual conversations read in Step 2.
Constraints:
Critical Requirement:
Define response strategies and escalation rules per SOP topic.
Constraints:
Create keyword taxonomy from all analyzed conversations.
Constraints:
Save extraction results and run final enrichment.
Constraints:
results/{company}/02_extraction/:
patterns.json — patterns + sop_topic_map + HT/TS classificationfaq.json — FAQ pairs organized by SOP topicresponse_strategies.json — strategies per SOP topickeywords.json — keyword taxonomyextraction_summary.md — human-readable summary{n_samples} calculated in Step 1):
python3 scripts/enrich_patterns.py \
--patterns results/{company}/02_extraction/patterns.json \
--messages results/{company}/01_clustering/{company}_messages.csv \
--output results/{company}/02_extraction/patterns_enriched.json \
--n-samples {n_samples}
patterns_enriched.json contains conversation samples for each clusterNote: If enrichment was already run in Step 1 for analysis, this second run updates the file with the final patterns structure. The conversations are re-selected based on the finalized cluster assignments.
Parameters:
results/assacomExecution:
Time: ~8-12 minutes Output: 8 SOP topics, 47 patterns, 42 FAQ pairs, enriched conversations
Scenario: Cluster 0 has 799 records with hardware AS, Windows issues, and cable questions mixed together.
What happens in Step 2:
What happens in Step 3:
Solution:
enhanced_text from clustered.xlsx"data_source": "summary_fallback" in metadataRoot Cause: LLM summarized instead of quoting verbatim Solution: Re-read specific conversations and copy-paste exact customer phrases
Solution: Merge related topics. If "SSD 인식 문제" and "HDD 연결 문제" are separate, combine into "저장장치 문제"
| Aspect | Summary (enhanced_text) | Full Conversation (turns) |
|---|---|---|
| Customer tone | Lost | "블루스크린이 계속 떠요 ㅠㅠ" |
| Agent response pattern | Lost | "먼저 ~해주세요 → 안 되면 ~" |
| Escalation moment | Lost | "3번 시도 후 AS 접수 전환" |
| Resolution steps | Lost | Step-by-step troubleshooting flow |
| Conversation length | Lost | Short (3 turns) vs Complex (15+ turns) |
| Mixed topic detection | Hard (summary blends topics) | Clear (each turn has context) |
With min_total_samples=300, n_samples_per_cluster = max(20, ceil(300/K)):
| K (Clusters) | Samples/Cluster | Total Samples | Estimated Time |
|---|---|---|---|
| 8 | 38 | 304 | ~12-18 min |
| 10 | 30 | 300 | ~12-16 min |
| 12 | 25 | 300 | ~12-16 min |
| 15 | 20 | 300 | ~12-16 min |
| 20 | 20 | 400 | ~15-20 min |
| 25 | 20 | 500 | ~18-25 min |
sop_topic_map을 따라야 함patterns_enriched.json의 대화 샘플이 Stage 3의 주요 입력