Name: Data Prep
Author: smouj

搵技能.../

Data Prep | Skills Pool

## Data Preparation Report

### Dataset Summary
- **Original Rows:** 100,000
- **Original Columns:** 45
- **Missing Values:** 12.3%
- **Duplicates:** 234

### Data Quality Issues Found
| Issue | Count | Affected Columns |
|-------|-------|-------------------|
| Missing Values | 15,200 | age, income, address |
| Outliers | 2,340 | salary, price, score |
| Invalid Types | 450 | phone, date |
| Duplicates | 234 | All |

### Transformations Applied
1. ✅ Removed 234 duplicate rows
2. ✅ Imputed missing values:
   - Numerical: Median imputation (age, income)
   - Categorical: Mode imputation (city, category)
3. ✅ Removed outliers (IQR method): 2,340 rows
4. ✅ Encoded categorical variables:
   - One-hot: city, category
   - Label: status, priority
5. ✅ Scaled numerical features (StandardScaler):
   - age, salary, score, price

### Feature Engineering
| Feature | Type | Description |
|---------|------|-------------|
| age_group | Categorical | Binned age (0-18, 19-35, etc.) |
| income_percentile | Numerical | Percentile rank of income |
| has_phone | Boolean | Derived from phone field |
| price_per_unit | Numerical | price / quantity |

### Final Dataset
- **Final Rows:** 97,426
- **Final Features:** 68
- **Train Size:** 77,941 (80%)
- **Test Size:** 19,485 (20%)

### Class Distribution
| Class | Count | Percentage |
|-------|-------|------------|
| Positive | 15,234 | 15.6% |
| Negative | 82,192 | 84.4% |

## Saved Files
- train.csv (77,941 rows)
- test.csv (19,485 rows)
- preprocessing_pipeline.pkl
- feature_definitions.json

## Next Steps
- [ ] Balance classes (SMOTE or undersampling)
- [ ] Feature selection to reduce dimensionality
- [ ] Try dimensionality reduction (PCA)

Task	Tool	Language
Data Manipulation	Pandas, Polars	Python
Preprocessing	Scikit-learn	Python
Feature Selection	Feature-engine	Python
Deep Learning	TensorFlow Data	Python

Data Prep

Data Preparator

When to Use This Skill

Work Process

1. Analysis

2. Cleaning

Data Prep

Data Preparator

When to Use This Skill

Work Process

1. Analysis

2. Cleaning

3. Engineering

4. Validation

Golden Rules

Supported Tools

Output Format

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns