Python 생태계(Jupyter, Pandas, Scikit-learn)를 활용하여 데이터에서 심층적인 인사이트를 도출하는 전문 분석 워크플로우입니다.
Python 생태계(Jupyter, Pandas, Scikit-learn)를 활용하여 데이터에서 심층적인 인사이트를 도출하는 전문 분석 워크플로우입니다. OSEMN 방법론과 SKILL.md의 표준을 따릅니다.
this document를 로드하여 **'Core Principles'**를 확인합니다.SKILL.md의 **'Methodology Master List'**를 스캔합니다.resources/plan-template.md를 사용하여 docs/plans/ANALYSIS_[주제].md를 작성합니다.docs/notebooks/[Topic]_Analysis.ipynb를 생성하거나 엽니다.describe, p-value)과 그래프는 노트북 출력 셀에 남겨두어야 합니다 (No Hiding)."The Right Tool for the Job"
데이터 특성에 맞춰 SKILL.md의 Methodology Master List에서 최적의 기법을 선정하고 Plan에 명시합니다.
RobustScaler 선택.Target Encoding 고려.SMOTE 또는 Class Weights 적용 여부 결정.CatBoost(Categorical), XGBoost/LightGBM(Large Data), Isolation Forest(Anomaly) 등 선택.Time Series Split 사용.Stratified K-Fold 필수.Garbage In, Garbage Out을 방지하기 위한 데이터 신뢰성 확보 단계입니다.
openpyxl을 사용하고, 계산된 값(Calculated Values)을 하드코딩(Hardcoding)하지 않도록 주의합니다.utf-8, cp949) 및 구분자(Delimiter) 자동 감지 기능을 활용합니다.describe()), 결측치(isnull().sum()), 데이터 타입(dtypes)을 확인합니다.SKILL.md의 'Logical Failures' 항목을 참조하여 정밀 점검합니다.단순한 그래프 나열을 지양하고, 질문(Question) -> 시각화(Viz) -> 발견(Finding)의 흐름을 유지합니다.
SKILL.md 리스트의 기법(Encoding, Scaling, PCA)을 적용합니다.Optuna 등을 활용해 효율적으로 최적화합니다.Cross Validation 점수와 Hold-out Test 점수의 차이를 확인하여 오버피팅을 감지합니다. (Gap > 5% 시 경고)
Elbow Method나 Silhouette Score로 최적의 군집 수(K)를 결정합니다.SHAP, Permutation Importance 등을 활용해 모델의 판단 근거를 설명합니다.To transform raw data into actionable business insights using a rigorous, hypothesis-driven approach.
"No Hiding"
AI Readability: All statistical outputs (describe, corr, p-value) and graphs must remain in the notebook output. This ensures tools like NotebookLM can contextually understand the analysis.
"Garbage In, Garbage Out"
openpyxl for editing, pandas for reading. Zero Hardcoding of calculated values.csv.Sniffer). Handle encoding errors (utf-8 vs cp949) explicitly."Ask, Don't just Plot"
"Trust but Verify"
"Why did it predict that?"
Before finalizing:
Scan these tables to select the most appropriate methodology for your data and goal.
| Methodology | Usage / Purpose | Data Constraints |
|---|---|---|
| Simple Imputation | Missing Value Imputation (Simple Replacement) | Mean/Median (Numeric), Mode (Categorical) |
| KNN Imputation | Missing Value Imputation (Similarity-based) | Mainly Numeric, useful when correlations exist |
| Iterative Imputation | Missing Value Imputation (Model-based) | High variable correlation, assumes MAR |
| One-Hot Encoding | Categorical to Numeric | Nominal data, Low Cardinality |
| Label Encoding | Categorical to Numeric | Ordinal data |
| Target Encoding | Categorical to Numeric | High Cardinality features, Risk of Overfitting |
| Standard Scaler | Scaling (Standardization) | Sensitive to outliers, assumes Gaussian distribution |
| MinMax Scaler | Scaling (Normalization) | Bounded data, distribution agnostic |
| Robust Scaler | Scaling (Robust to Outliers) | Data with many outliers (Uses Median/IQR) |
| SMOTE | Oversampling (Imbalanced Data) | Synthesize minority class samples (Training set ONLY) |
| PCA | Dimensionality Reduction, Multicollinearity Removal | Continuous variables, assumes linear relationships |
| Methodology | Type | Usage / Purpose | Constraints / Notes |
|---|---|---|---|
| Linear Regression | Regression | Baseline for regression | Linear relationship assumption |
| Logistic Regression | Classification | Baseline for classification | Linear separation assumption, large sparse data OK |
| SVM / SVR | Class/Reg | High accuracy in high dimensional spaces | Computationally expensive (O(n^3)), Scale-sensitive |
| K-Nearest Neighbors | Class/Reg | Instance-based learning, Simple | Scale-sensitive, Small data |
| Random Forest | Ensemble | Robust Classification/Regression | Handles Mixed types, Robust to outliers/missing values |
| XGBoost / LightGBM | Ensemble | High Performance | Large datasets, handles missing values internally |
| CatBoost | Ensemble | Best for Categorical Features | Handles categories automatically, Slower training |
| Isolation Forest | Anomaly Detection | Outlier/Anomaly Detection | High dimensional data, efficiency |
| K-Means | Clustering | Partitioning into K clusters | Spherical Clusters, Sensitive to outliers, Scale-sensitive |
| DBSCAN | Clustering | Density-based clustering, Detects Outliers | Arbitrary shapes, Scale-sensitive, finding epsilon is hard |
| Hierarchical | Clustering | Dendrogram visualization | Computationally expensive for large data |
| Methodology | Usage / Purpose | Data Constraints |
|---|---|---|
| CNN | Image/Pattern Recognition | Grid-like data (Images, etc.) |
| RNN / LSTM | Sequence/Time-Series Prediction | Sequential data |
| Transformer | NLP, Complex Pattern Matching | Long sequences, Large-scale data |
| Methodology | Type | Usage / Purpose | Notes |
|---|---|---|---|
| Stratified K-Fold | Validation | Cross Validation (Generalization) | Essential for Imbalanced Class distribution |
| K-Fold CV | Validation | Cross Validation | Sufficient data, Balanced classes |
| Time Series Split | Validation | Cross Validation (Temporal) | No future data leakage (essential for time-series) |
| Grid Search | Tuning | Hyperparameter Optimization | Small search space (Exhaustive) |
| Bayesian Optimization | Tuning | Hyperparameter Optimization | Large search space, High evaluation cost |
| Optuna | Tuning | Next-gen Hyperparameter Optimization | Efficient, Define-by-run, Pruning capabilities |
| L1 (Lasso) | Regularization | Sparse Model, Feature Selection | When sparse solution is needed |
| L2 (Ridge) | Regularization | Prevent Overfitting, Weight Decay | When high multicollinearity exists |
| ElasticNet | Regularization | Combination of L1 and L2 | When both feature selection and regularization needed |
| Methodology | Usage / Purpose | Notes |
|---|---|---|
| SHAP | Explain Model Predictions | Specialized for Tree-based models |
Select metrics based on your problem type and business goal.
| Metric | Focus | When to use |
|---|---|---|
| Accuracy | Overall Correctness | Balanced datasets only. Misleading for imbalanced data. |
| Precision | False Positive Reduction | When FP is costly (e.g., Spam Filter). |
| Recall | False Negative Reduction | When FN is critical (e.g., Cancer Diagnosis, Fraud). |
| F1 Score | Balance | When you need a balance between Precision and Recall. |
| ROC-AUC | Ranking Quality | When you need robust performance across thresholds. |
| Log Loss | Probability Confidence | When the predicted probability value itself matters. |
| Metric | Focus | When to use |
|---|---|---|
| MSE | Large Error Penalty | When outliers/large errors should be heavily penalized. |
| RMSE | Interpretability | When you need error in the same unit as the target. |
| MAE | Robustness | When you want to be robust against outliers. |
| R2 Score | Explainability | To see how much variance is explained by the model. |
| MAPE | Business Interpretability | Error in Percentage (%). Easy for stakeholders. |
| Metric | Focus | When to use |
|---|---|---|
| Silhouette Score | Cluster Separation | To measure how similar an object is to its own cluster compared to other clusters. |
| Davies-Bouldin | Cluster Compactness | Lower is better. Good for comparing clustering algorithms. |
| Elbow Method | Optimal K | To find the inflection point (optimal K) in K-Means. |