Name: Stage2 Extraction
Author: channel-io

Search skills.../

Stage2 Extraction | Skills Pool

You MUST read Stage 1 files first:
1. {prefix}_tags.xlsx: Cluster summary (ID, label, category, keywords, count)
2. analysis_report.md: Analysis insights and recommendations
3. (DO NOT read {prefix}_clustered.xlsx for samples — enrichment will provide better data)

You MUST calculate n_samples_per_cluster dynamically (unless user explicitly set it):

import math
K = len(tags)  # number of clusters from tags.xlsx
min_total = 300  # min_total_samples parameter
n_samples = max(20, math.ceil(min_total / K))

You MUST create a bootstrap patterns.json from tags.xlsx, then run enrichment:

mkdir -p results/{company}/02_extraction

# 1. Create minimal patterns.json from tags (cluster IDs only)
python3 -c "
import pandas as pd, json
tags = pd.read_excel('results/{company}/01_clustering/{prefix}_tags.xlsx')
data = {'metadata': {'company': '{company}', 'bootstrap': True}, 'clusters': []}
for _, r in tags.iterrows():
    data['clusters'].append({'cluster_id': int(r['cluster_id']), 'label': r['label'], 'category': r['category'], 'cluster_size': int(r['cluster_size'])})
with open('results/{company}/02_extraction/patterns.json', 'w') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)
"

# 2. Run enrichment to extract full conversation transcripts
#    n_samples is dynamically calculated: max(20, ceil(300 / K))
python3 scripts/enrich_patterns.py \
  --patterns results/{company}/02_extraction/patterns.json \
  --messages results/{company}/01_clustering/{prefix}_messages.csv \
  --output results/{company}/02_extraction/conversations_by_cluster.json \
  --n-samples {n_samples}

You MUST verify enrichment succeeded: each cluster has {n_samples} conversation transcripts with full turns
You MUST display the calculated n_samples_per_cluster and total sample count

✅ Enrichment 완료
  - 클러스터: 12개 (K=12)
  - 클러스터당 대화: 25건 (= max(20, ceil(300/12)))
  - 총 분석 대화: 300건
  - 파일: conversations_by_cluster.json

For Cluster X, read each conversation's turns:
  Turn 1 (customer): "블루스크린이 계속 떠요. 어제부터 갑자기..."
  Turn 2 (agent): "안녕하세요, 고객님! ... 먼저 메모리 재장착을 시도해주세요"
  Turn 3 (customer): "재장착 했는데 아직도 같아요"
  Turn 4 (agent): "그렇다면 택배 AS 접수를 도와드리겠습니다..."

→ Pattern: 블루스크린 → 자가조치 → AS 접수 (TS, 문제 해결 프로세스)
→ Agent tone: emoji 사용, 공감 표현 먼저, 단계별 안내
→ Real phrase: "블루스크린이 계속 떠요" (not "블루스크린 문제 발생")

{
  "cluster_id": 0,
  "original_label": "오류 해결 문의",
  "actual_content": "A/S 접수 + 하드웨어 불량 + 윈도우 설치 (혼합)",
  "is_mixed": true,
  "mixed_topics": ["하드웨어_AS", "윈도우_소프트웨어", "케이블_연결"],
  "sop_type": "TS",
  "patterns": [...],
  "tone_observations": {...}
}

{
  "sop_topic_map": {
    "description": "Authoritative SOP topic list. Stage 3 MUST follow this map.",
    "topics": [
      {
        "topic_id": "TS_HARDWARE_AS",
        "title": "A/S 접수 및 하드웨어 불량 처리",
        "type": "TS",
        "journey_stage": "사용 중",
        "source_clusters": [
          {"cluster_id": 0, "portion": "partial", "conversation_ids": [1,3,5,6,8,10,12,14,15,17,19], "reason": "하드웨어 불량/AS 접수 관련 대화만"},
          {"cluster_id": 6, "portion": "full", "reason": "전체가 AS 문의"}
        ],
        "estimated_records": 500,
        "key_patterns": ["블루스크린", "쿨러_소음", "택배_AS_접수"]
      },
      {
        "topic_id": "TS_WINDOWS_SW",
        "title": "윈도우/소프트웨어 문제 해결",
        "type": "TS",
        "journey_stage": "설치",
        "source_clusters": [
          {"cluster_id": 0, "portion": "partial", "conversation_ids": [2,4,7,9,11,13,16,18,20], "reason": "윈도우 설치/드라이버 관련 대화만"}
        ],
        "estimated_records": 300,
        "key_patterns": ["윈도우_설치", "정품_인증", "드라이버_문제"]
      }
    ],
    "merge_log": [
      "Clusters 2+5+6 merged → HT_INITIAL_INTAKE (모두 초기 인사 메시지)"
    ],
    "label_corrections": [
      {"cluster_id": 7, "original_label": "현금영수증 문의", "actual_content": "80% 취소/환불 + 20% 서류 발행", "action": "split into TS_CANCEL + HT_DOCUMENTS"}
    ]
  }
}

You MUST save these files to results/{company}/02_extraction/:
1. patterns.json — patterns + sop_topic_map + HT/TS classification
2. faq.json — FAQ pairs organized by SOP topic
3. response_strategies.json — strategies per SOP topic
4. keywords.json — keyword taxonomy
5. extraction_summary.md — human-readable summary

You MUST run enrichment on the final patterns.json (using the same {n_samples} calculated in Step 1):

python3 scripts/enrich_patterns.py \
  --patterns results/{company}/02_extraction/patterns.json \
  --messages results/{company}/01_clustering/{company}_messages.csv \
  --output results/{company}/02_extraction/patterns_enriched.json \
  --n-samples {n_samples}

You MUST verify patterns_enriched.json contains conversation samples for each cluster

Aspect	Summary (`enhanced_text`)	Full Conversation (`turns`)
Customer tone	Lost	"블루스크린이 계속 떠요 ㅠㅠ"
Agent response pattern	Lost	"먼저 ~해주세요 → 안 되면 ~"
Escalation moment	Lost	"3번 시도 후 AS 접수 전환"
Resolution steps	Lost	Step-by-step troubleshooting flow
Conversation length	Lost	Short (3 turns) vs Complex (15+ turns)
Mixed topic detection	Hard (summary blends topics)	Clear (each turn has context)

Stage2 Extraction

Stage 2: Pattern & FAQ Extraction

Overview

Stage2 Extraction

Stage 2: Pattern & FAQ Extraction

Overview

Parameters

Required

Optional

Steps

1. Load Stage 1 Results and Run Enrichment

2. Analyze Full Conversations per Cluster

3. Define SOP Topics and Map Clusters (Re-classification)

4. Generate FAQ Pairs

5. Identify Response Strategies

6. Build Keyword Taxonomy

7. Save All Results

Examples

Example 1: Standard Extraction (Assacom)

Example 2: Mixed Cluster Handling

Troubleshooting

Issue 1: Enrichment fails (messages.csv missing)

Issue 2: Patterns are too generic

Issue 3: Too many SOP topics (>15)

Notes

Why Full Conversations Instead of Summaries?

Execution Strategy

Time Estimates

Relationship with Stage 3

Openai Whisper

Voice Call

Prose

Clawhub

Sherpa Onnx Tts

Openai Whisper Api

K (Clusters)	Samples/Cluster	Total Samples	Estimated Time
8	38	304	~12-18 min
10	30	300	~12-16 min
12	25	300	~12-16 min
15	20	300	~12-16 min
20	20	400	~15-20 min
25	20	500	~18-25 min