Enforces structured, highly documented storage for code and data projects. Use when working on machine learning scripts, data processing, code creation, or script modification that should preserve clear structure and documentation.
Ensures all created or processed content follows strict organizational and documentation standards with structured storage, comprehensive comments, and complete project documentation.
Use this skill for tasks like:
Required inputs: If modifying existing projects, must first read and understand the original structure.
1. Structured Directory Layout
project-name/
├── README.md # Project overview and directory guide
├── src/ # Source code with detailed comments
│ ├── main.py # Main entry point
│ └── utils.py # Utility functions
├── data/ # Data files
│ ├── raw/ # Original data
│ ├── processed/ # Cleaned/transformed data
│ └── DATA_DICTIONARY.md # Data field descriptions
├── docs/ # Documentation
│ ├── PROCESS.md # Step-by-step process description
│ └── CHANGELOG.md # Modification history
├── outputs/ # Results, models, reports
└── requirements.txt # Dependencies
2. Code Documentation Standards
3. Required Documentation Files
README.md must include:
PROCESS.md must include:
DATA_DICTIONARY.md (for data projects) must include:
CHANGELOG.md (for modifications) must include:
4. Modification Protocol
When modifying existing structured projects:
Pattern 1: ML Training Project Structure
ml-training-project/
├── README.md # Project overview
├── src/
│ ├── train.py # Training script with detailed comments
│ ├── model.py # Model architecture
│ ├── data_loader.py # Data loading utilities
│ └── evaluate.py # Evaluation metrics
├── data/
│ ├── raw/ # Original datasets
│ ├── processed/ # Preprocessed data
│ └── DATA_DICTIONARY.md # Feature descriptions
├── models/ # Saved model checkpoints
├── logs/ # Training logs
├── docs/
│ ├── TRAINING_PROCESS.md # Training methodology
│ └── MODEL_ARCHITECTURE.md # Model design decisions
└── requirements.txt
Pattern 2: Data Cleaning Project Structure
data-cleaning-project/
├── README.md
├── src/
│ ├── clean.py # Main cleaning script
│ ├── validators.py # Data validation functions
│ └── transformers.py # Transformation utilities
├── data/
│ ├── raw/ # Original data
│ ├── processed/ # Cleaned data
│ ├── DATA_DICTIONARY.md # Field descriptions
│ └── QUALITY_REPORT.md # Data quality metrics
├── docs/
│ └── CLEANING_PROCESS.md # Cleaning steps and rationale
└── requirements.txt
Pattern 3: Code Comment Template
"""
Module: data_processor.py
Purpose: Process and transform raw sensor data into analysis-ready format
Main components:
- DataLoader: Reads raw CSV files
- DataCleaner: Handles missing values and outliers
- DataTransformer: Applies normalization and feature engineering
"""
def clean_sensor_data(df, threshold=0.95):
"""
Clean sensor data by removing outliers and handling missing values.
Args:
df (pd.DataFrame): Raw sensor data with columns [timestamp, sensor_id, value]
threshold (float): Completeness threshold (0-1) for keeping sensors
Returns:
pd.DataFrame: Cleaned data with outliers removed and missing values imputed
Process:
1. Remove sensors with >5% missing data
2. Detect outliers using IQR method (1.5 * IQR)
3. Impute remaining missing values with forward fill
"""
# Remove sensors with insufficient data
# Threshold of 0.95 means sensor must have 95% valid readings
completeness = df.groupby('sensor_id')['value'].count() / len(df)
valid_sensors = completeness[completeness >= threshold].index
df = df[df['sensor_id'].isin(valid_sensors)]
# Detect and remove outliers using IQR method
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR # Standard outlier detection threshold
upper_bound = Q3 + 1.5 * IQR
df = df[(df['value'] >= lower_bound) & (df['value'] <= upper_bound)]
# Forward fill remaining missing values
# Assumes temporal continuity in sensor readings
df = df.sort_values(['sensor_id', 'timestamp'])
df['value'] = df.groupby('sensor_id')['value'].fillna(method='ffill')
return df
Pattern 4: CHANGELOG.md Entry Template
## [Version 1.2.0] - 2026-01-19
### Changed
- Modified `train.py:45-67` to add early stopping mechanism
- Reason: Prevent overfitting on small validation sets
- Added `patience` parameter (default=10 epochs)
- Monitors validation loss instead of training loss
### Added
- New function `evaluate.py:calculate_confusion_matrix()`
- Provides detailed classification metrics
- Outputs confusion matrix visualization
### Fixed
- Fixed data loader bug in `data_loader.py:123`
- Issue: Incorrect handling of missing timestamps
- Solution: Added explicit timestamp validation and interpolation
### Files Affected
- `src/train.py` (lines 45-67, 89-92)
- `src/evaluate.py` (new function added)
- `src/data_loader.py` (line 123)
- `docs/TRAINING_PROCESS.md` (updated early stopping section)
Input: "Create a script to train a neural network for image classification"
Steps:
src/train.py with comprehensive docstrings and inline commentsREADME.md with project overview and directory structuredocs/TRAINING_PROCESS.md describing training methodologydocs/MODEL_ARCHITECTURE.md explaining model designrequirements.txt with all dependenciesExpected output: Complete project structure with all documentation files, heavily commented code, and clear organization.
Input: "Write a script to clean customer transaction data"
Steps:
src/clean.py with detailed comments explaining each cleaning stepdata/DATA_DICTIONARY.md describing all fields before and after cleaningdocs/CLEANING_PROCESS.md with step-by-step cleaning methodologydata/QUALITY_REPORT.md with data quality metrics (completeness, validity)README.md with usage instructions and directory guiderequirements.txtExpected output: Structured project with comprehensive documentation of data transformations and quality metrics.
Input: "Update the training script to add learning rate scheduling"
Steps:
src/train.py to understand current implementationdocs/TRAINING_PROCESS.md with new scheduling sectionExpected output: Modified code with preserved structure, updated documentation, and comprehensive change log.
references/documentation-standards.md: Detailed documentation requirementsreferences/directory-templates.md: Standard directory structures for different project typesreferences/comment-guidelines.md: Code commenting best practicesassets/templates/: Ready-to-use project templates