/generate-test-design — CRISP-DM 4.2: Generate Test Design

Phase: 4. Modeling | Task: 4.2 Generate Test Design

"Before we actually build a model, we need to generate a procedure or mechanism to test the model's quality and validity. The test design also needs to consider the appropriate evaluation criteria."

Purpose

This skill defines the experimental framework for model building: how data is split, how models are evaluated, what metrics matter, and how experiments are tracked. It ensures rigorous, reproducible evaluation that prevents data leakage and aligns technical metrics with business success criteria.

Output Location

This skill produces two artifacts:

Jupyter notebook (primary): notebooks/4.2-test-design.ipynb — contains splitting strategy implementation, split validation code, distribution checks across splits, inline outputs, and markdown narrative. This is the working artifact where the test design is implemented and validated.

/generate-test-design — CRISP-DM 4.2: Generate Test Design

Phase: 4. Modeling | Task: 4.2 Generate Test Design

"Before we actually build a model, we need to generate a procedure or mechanism to test the model's quality and validity. The test design also needs to consider the appropriate evaluation criteria."

Purpose

Output Location

This skill produces two artifacts:

Jupyter notebook (primary): notebooks/4.2-test-design.ipynb — contains splitting strategy implementation, split validation code, distribution checks across splits, inline outputs, and markdown narrative. This is the working artifact where the test design is implemented and validated.

### Data Leakage Prevention | Risk | Mitigation | |------|-----------| | Future information in features | [how temporal features are computed — always using only past data] | | Target leakage | [features excluded or lagged appropriately] | | Preprocessing leakage | [fit on training only, transform validation/test] | | Group leakage | [all records for an entity in same split] | --- ## Evaluation Metrics ### Primary Metric | Attribute | Value | |-----------|-------| | **Name** | [e.g., MAE, RMSE, MAPE, wMAPE] | | **Formula** | [mathematical formula] | | **Why chosen** | [connection to business objective from 1.1/1.3] | | **Success threshold** | [from 1.3] | | **Aggregation** | [overall / per store / per section] | | **Implementation** | [library.function or custom code reference] | ### Secondary Metrics | Metric | Formula | Why Included | Threshold | |--------|---------|-------------|-----------| | [metric] | [formula] | [rationale] | [threshold if any] | ### Business Translation | Technical Metric | Business Meaning | Example | |-----------------|-----------------|---------| | [e.g., MAE of 5 units] | [e.g., workforce misallocated by ~30 min/day] | [concrete scenario] | --- ## Baseline Definition | Attribute | Value | |-----------|-------| | **Technique** | [e.g., seasonal naive — last week's same day] | | **Expected performance** | [estimated metric values based on EDA] | | **Minimum improvement** | [how much better a model must be to justify deployment] | | **Justification** | [why this is the right baseline — not too easy, not too hard] | --- ## Experiment Tracking Plan (MLflow) ### Naming Conventions - **Experiment name:** `[project-name]-[phase]` (e.g., `store-capacity-modeling`) - **Run name:** `[technique]-[variant]-[timestamp]` (e.g., `lgbm-default-20260330`) ### What to Log | Category | Items | |----------|-------| | **Parameters** | [list: hyperparameters, data version, split config, feature set version] | | **Metrics** | [list: all evaluation metrics at all aggregation levels] | | **Artifacts** | [list: trained model, feature importance, residual plots, prediction samples] | | **Tags** | [list: technique family, phase, data version, author] | ### Experiment Comparison - Models are compared on the **validation set** during development - The **test set** is used only for the final selected model(s) - All comparisons must include the baseline run as reference --- ## Reproducibility Requirements - [ ] Random seed: [value, e.g., 42] — used consistently across all experiments - [ ] Data version: tracked via DVC hash - [ ] Code version: tracked via git commit SHA - [ ] Environment: tracked via `requirements.txt` or `conda.yml` - [ ] Split logic: deterministic given the above --- ## To Be Clarified [List any items that need further investigation or domain expert input. Remove this section if everything is complete.] --- ## Source Documents - 1.3 Data Mining Goals: `docs/crisp-dm/1-business-understanding/1.3-data-mining-goals.md` - 4.1 Modeling Techniques: `docs/crisp-dm/4-modeling/4.1-modeling-techniques.md` - 2.3 Data Exploration: `docs/crisp-dm/2-data-understanding/2.3-data-exploration.md` - 2.2 Data Description: `docs/crisp-dm/2-data-understanding/2.2-data-description.md` --- ## Sign-off | Role | Name | Date | Status | |------|------|------|--------| | Lead Data Scientist | | | Pending | | Domain Expert | | | Pending |

Generate Test Design

/generate-test-design — CRISP-DM 4.2: Generate Test Design

Purpose

Output Location

Generate Test Design

/generate-test-design — CRISP-DM 4.2: Generate Test Design

Purpose

Output Location

Workflow

Step 1: Check for Existing Artifacts

Step 2: Extract Information from Source Documents

Step 3: Design the Splitting Strategy

Step 4: Define Evaluation Metrics

Step 5: Define the Experiment Tracking Plan

Step 6: Present and Confirm

Step 7: Create the Notebook and Generate the Output Document

Step 8: Summary and Next Steps

Quality Checks

Update Skills

Eval Harness

Ecc Tools Cost Audit

Code Tour

Rules Distill

Design System