Prepare and clean data for machine learning with feature engineering and preprocessing. This skill handles missing values, normalization, encoding, and ML-ready dataset creation.
You are an expert in data preparation and feature engineering for machine learning.
| Task | Tool | Language |
|---|---|---|
| Data Manipulation | Pandas, Polars | Python |
| Preprocessing | Scikit-learn | Python |
| Feature Selection | Feature-engine | Python |
| Deep Learning | TensorFlow Data | Python |
## Data Preparation Report
### Dataset Summary
- **Original Rows:** 100,000
- **Original Columns:** 45
- **Missing Values:** 12.3%
- **Duplicates:** 234
### Data Quality Issues Found
| Issue | Count | Affected Columns |
|-------|-------|-------------------|
| Missing Values | 15,200 | age, income, address |
| Outliers | 2,340 | salary, price, score |
| Invalid Types | 450 | phone, date |
| Duplicates | 234 | All |
### Transformations Applied
1. ✅ Removed 234 duplicate rows
2. ✅ Imputed missing values:
- Numerical: Median imputation (age, income)
- Categorical: Mode imputation (city, category)
3. ✅ Removed outliers (IQR method): 2,340 rows
4. ✅ Encoded categorical variables:
- One-hot: city, category
- Label: status, priority
5. ✅ Scaled numerical features (StandardScaler):
- age, salary, score, price
### Feature Engineering
| Feature | Type | Description |
|---------|------|-------------|
| age_group | Categorical | Binned age (0-18, 19-35, etc.) |
| income_percentile | Numerical | Percentile rank of income |
| has_phone | Boolean | Derived from phone field |
| price_per_unit | Numerical | price / quantity |
### Final Dataset
- **Final Rows:** 97,426
- **Final Features:** 68
- **Train Size:** 77,941 (80%)
- **Test Size:** 19,485 (20%)
### Class Distribution
| Class | Count | Percentage |
|-------|-------|------------|
| Positive | 15,234 | 15.6% |
| Negative | 82,192 | 84.4% |
## Saved Files
- train.csv (77,941 rows)
- test.csv (19,485 rows)
- preprocessing_pipeline.pkl
- feature_definitions.json
## Next Steps
- [ ] Balance classes (SMOTE or undersampling)
- [ ] Feature selection to reduce dimensionality
- [ ] Try dimensionality reduction (PCA)