CRISP-DM 4.2 — Generate Test Design. Defines train/validation/test splitting strategy, cross-validation approach, evaluation metrics, baseline definition, and experiment tracking plan. Produces a structured test design document in docs/crisp-dm/4-modeling/.
Phase: 4. Modeling | Task: 4.2 Generate Test Design
"Before we actually build a model, we need to generate a procedure or mechanism to test the model's quality and validity. The test design also needs to consider the appropriate evaluation criteria."
This skill defines the experimental framework for model building: how data is split, how models are evaluated, what metrics matter, and how experiments are tracked. It ensures rigorous, reproducible evaluation that prevents data leakage and aligns technical metrics with business success criteria.
This skill produces two artifacts:
notebooks/4.2-test-design.ipynb — contains splitting strategy implementation, split validation code, distribution checks across splits, inline outputs, and markdown narrative. This is the working artifact where the test design is implemented and validated.docs/crisp-dm/4-modeling/4.2-test-design.md — a structured summary of the test design extracted from the notebook. This is the CRISP-DM documentation artifact.Before starting, check if output artifacts already exist:
notebooks/4.2-test-design.ipynb (the primary notebook)docs/crisp-dm/4-modeling/4.2-test-design.md (the summary document)Also check if prerequisite documents exist:
docs/crisp-dm/1-business-understanding/1.3-data-mining-goals.md
docs/crisp-dm/4-modeling/4.1-modeling-techniques.md
docs/crisp-dm/2-data-understanding/2.3-data-exploration.md
docs/crisp-dm/2-data-understanding/2.2-data-description.md
From the ingested documents, extract:
Present what was extracted and what is missing. Ask the user to fill gaps.
Based on the problem type and data characteristics, design the data splitting approach:
For time series problems:
For non-time-series problems:
Document:
Map data mining success criteria from 1.3 to specific evaluation metrics:
For each metric, document:
Also define:
Document how experiments will be logged in MLflow:
Present the full test design to the user and ask:
Wait for the user's response before finalizing.
After gathering all information, first create the Jupyter notebook at notebooks/4.2-test-design.ipynb using the NotebookEdit tool. The notebook is the primary artifact — splitting strategy implementation and validation code happens here.
Notebook structure:
Use the NotebookEdit tool to create and populate the notebook cell by cell. Run code cells to generate outputs inline.
Then create the summary document.
mkdir -p docs/crisp-dm/4-modeling
Write the file docs/crisp-dm/4-modeling/4.2-test-design.md using this template:
# 4.2 Test Design
> **Project:** [project name]
> **Date:** [current date]
> **CRISP-DM Phase:** 4. Modeling
> **Status:** Draft | Review | Approved
---
## Design Overview
- **Problem Type:** [time series forecasting / regression / classification]
- **Data Date Range:** [start] to [end]
- **Total Records:** [count]
- **Granularity:** [per store, per day, per section, etc.]
- **Forecast Horizon:** [N days/weeks ahead]
- **Selected Techniques:** [from 4.1]
---
## Data Splitting Strategy
### Approach: [Temporal Split / Expanding Window CV / Stratified K-Fold / etc.]
**Rationale:** [why this approach was chosen — must reference the problem type and data characteristics]
### Split Definition
| Set | Date Range / Criteria | Records | Proportion | Purpose |
|-----|----------------------|---------|------------|---------|
| Training | [range/criteria] | [count] | [%] | Model fitting |
| Validation | [range/criteria] | [count] | [%] | Hyperparameter tuning & model selection |
| Test | [range/criteria] | [count] | [%] | Final unbiased evaluation |
### Cross-Validation Design
- **Strategy:** [expanding window / sliding window / grouped K-fold / etc.]
- **Number of folds/windows:** [N]
- **Window sizes:** Training = [size], Validation = [size], Gap = [size if applicable]
- **Group handling:** [how store/section groups are handled]
[ASCII diagram showing the CV windows/folds]
### Data Leakage Prevention
| Risk | Mitigation |
|------|-----------|
| Future information in features | [how temporal features are computed — always using only past data] |
| Target leakage | [features excluded or lagged appropriately] |
| Preprocessing leakage | [fit on training only, transform validation/test] |
| Group leakage | [all records for an entity in same split] |
---
## Evaluation Metrics
### Primary Metric
| Attribute | Value |
|-----------|-------|
| **Name** | [e.g., MAE, RMSE, MAPE, wMAPE] |
| **Formula** | [mathematical formula] |
| **Why chosen** | [connection to business objective from 1.1/1.3] |
| **Success threshold** | [from 1.3] |
| **Aggregation** | [overall / per store / per section] |
| **Implementation** | [library.function or custom code reference] |
### Secondary Metrics
| Metric | Formula | Why Included | Threshold |
|--------|---------|-------------|-----------|
| [metric] | [formula] | [rationale] | [threshold if any] |
### Business Translation
| Technical Metric | Business Meaning | Example |
|-----------------|-----------------|---------|
| [e.g., MAE of 5 units] | [e.g., workforce misallocated by ~30 min/day] | [concrete scenario] |
---
## Baseline Definition
| Attribute | Value |
|-----------|-------|
| **Technique** | [e.g., seasonal naive — last week's same day] |
| **Expected performance** | [estimated metric values based on EDA] |
| **Minimum improvement** | [how much better a model must be to justify deployment] |
| **Justification** | [why this is the right baseline — not too easy, not too hard] |
---
## Experiment Tracking Plan (MLflow)
### Naming Conventions
- **Experiment name:** `[project-name]-[phase]` (e.g., `store-capacity-modeling`)
- **Run name:** `[technique]-[variant]-[timestamp]` (e.g., `lgbm-default-20260330`)
### What to Log
| Category | Items |
|----------|-------|
| **Parameters** | [list: hyperparameters, data version, split config, feature set version] |
| **Metrics** | [list: all evaluation metrics at all aggregation levels] |
| **Artifacts** | [list: trained model, feature importance, residual plots, prediction samples] |
| **Tags** | [list: technique family, phase, data version, author] |
### Experiment Comparison
- Models are compared on the **validation set** during development
- The **test set** is used only for the final selected model(s)
- All comparisons must include the baseline run as reference
---
## Reproducibility Requirements
- [ ] Random seed: [value, e.g., 42] — used consistently across all experiments
- [ ] Data version: tracked via DVC hash
- [ ] Code version: tracked via git commit SHA
- [ ] Environment: tracked via `requirements.txt` or `conda.yml`
- [ ] Split logic: deterministic given the above
---
## To Be Clarified
[List any items that need further investigation or domain expert input. Remove this section if everything is complete.]
---
## Source Documents
- 1.3 Data Mining Goals: `docs/crisp-dm/1-business-understanding/1.3-data-mining-goals.md`
- 4.1 Modeling Techniques: `docs/crisp-dm/4-modeling/4.1-modeling-techniques.md`
- 2.3 Data Exploration: `docs/crisp-dm/2-data-understanding/2.3-data-exploration.md`
- 2.2 Data Description: `docs/crisp-dm/2-data-understanding/2.2-data-description.md`
---
## Sign-off
| Role | Name | Date | Status |
|------|------|------|--------|
| Lead Data Scientist | | | Pending |
| Domain Expert | | | Pending |
After writing both artifacts, present a summary:
Test Design complete. Two artifacts created:
- Notebook:
notebooks/4.2-test-design.ipynb— splitting and validation code with inline outputs- Summary:
docs/crisp-dm/4-modeling/4.2-test-design.md— structured reportSummary:
- Splitting strategy: [approach — e.g., temporal split with expanding window CV]
- Training period: [range], Validation: [range], Test: [range]
- Primary metric: [metric] (threshold: [value])
- Baseline: [technique name]
- [N] data leakage mitigations documented
- [N] items still to be clarified (if any)
Next step in CRISP-DM: Run
/build-modelto start building the baseline and candidate models according to this test design (Task 4.3).
Also update the CRISP-DM phase tracker in .claude/CLAUDE.md to add the 4.2 artifact link.
Before finalizing, verify:
notebooks/4.2-test-design.ipynb with splitting and validation code and inline outputs