CRISP-DM 3.1 — Select Data. Decides which datasets, tables, columns, and rows to include for analysis/modeling. Documents inclusion/exclusion rationale, relevance to data mining goals, and any data constraints. Produces a structured data selection report in docs/crisp-dm/3-data-preparation/.
Phase: 3. Data Preparation | Task: 3.1 Select Data
"Decide on the data to be used for analysis. Criteria include relevance to the data mining goals, quality, and technical constraints such as limits on data volume or data types."
This skill decides which datasets, fields, and records from the collected data (Phase 2) will be carried forward into the modeling dataset. It documents the rationale for every inclusion and exclusion decision. It produces three outputs:
This skill produces two artifacts:
notebooks/3.1-select-data.ipynb — contains all data selection analysis code, coverage computations, inline outputs, and markdown narrative. This is the working artifact where selection decisions are analyzed.docs/crisp-dm/3-data-preparation/3.1-select-data.md — a structured summary of the data selection report extracted from the notebook. This is the CRISP-DM documentation artifact.Before starting, check if output artifacts already exist:
notebooks/3.1-select-data.ipynb (the primary notebook)docs/crisp-dm/3-data-preparation/3.1-select-data.md (the summary document)Also check prerequisite documents:
docs/crisp-dm/1-business-understanding/1.3-data-mining-goals.md
docs/crisp-dm/2-data-understanding/2.1-data-collection.md
/collect-initial-data first."docs/crisp-dm/2-data-understanding/2.2-data-description.md
docs/crisp-dm/2-data-understanding/2.3-data-exploration.md
docs/crisp-dm/2-data-understanding/2.4-data-quality.md
Check if the user provided a reference (via $ARGUMENTS or in conversation):
Read ALL provided sources before proceeding.
Based on the Phase 2 documents and any user input, analyze the available data and propose selections:
Dataset-level analysis:
Field-level analysis:
Record-level analysis:
Present the proposed selection in a structured summary:
Dataset Selection:
Dataset Decision Reason Data Mining Goal [name] Include / Exclude [reason] [goal from 1.3] Field Selection (per included dataset):
Field Decision Role Reason [name] Include / Exclude / Transform Target / Feature / ID / Context [reason] Record Selection:
Criterion Value Reason Records Affected Date range [start] to [end] [reason] [count or %] Store filter [criteria] [reason] [count or %] Data Leakage Risks:
- [list any fields or patterns that could leak target information]
Coverage Analysis:
- Total records after selection: [count] ([%] of original)
- Total fields after selection: [count] ([%] of original)
- Date range: [start] to [end]
Ask the user to confirm or adjust the selection.
Wait for the user's response before continuing.
If the user's response raises new questions:
First create the Jupyter notebook at notebooks/3.1-select-data.ipynb using the NotebookEdit tool. The notebook is the primary artifact — all selection analysis code happens here.
Notebook structure:
Use the NotebookEdit tool to create and populate the notebook cell by cell. Run code cells to generate outputs inline.
Then create the summary document. Create the output directory and write the document.
mkdir -p docs/crisp-dm/3-data-preparation
Write the file docs/crisp-dm/3-data-preparation/3.1-select-data.md using this template:
# 3.1 Data Selection Report
> **Project:** [project name]
> **Date:** [current date]
> **CRISP-DM Phase:** 3. Data Preparation
> **Status:** Draft | Review | Approved
---
## Selection Overview
- **Purpose:** [what the selected data will be used for — reference data mining goals]
- **Selection approach:** [top-down from goals / bottom-up from available data / hybrid]
- **Key decisions:** [1-3 sentence summary of the most impactful selection choices]
---
## Dataset Selection
| # | Dataset | Source (from 2.1) | Decision | Rationale | Data Mining Goal (from 1.3) |
|---|---------|-------------------|----------|-----------|---------------------------|
| 1 | [name] | [source] | Include / Exclude | [reason] | [goal] |
### Excluded Datasets
[For each excluded dataset, explain why it was excluded and whether it might be reconsidered later.]
---
## Field Selection
### [Dataset 1 Name]
| # | Field | Data Type | Decision | Role | Rationale |
|---|-------|-----------|----------|------|-----------|
| 1 | [field] | [type] | Include / Exclude / Defer | Target / Feature / ID / Context / Excluded | [reason] |
**Summary:** [N] of [M] fields selected ([%])
[Repeat for each included dataset]
### Data Leakage Assessment
| # | Field / Pattern | Risk | Mitigation |
|---|----------------|------|------------|
| 1 | [field or pattern] | [how it could leak target information] | [exclusion / temporal guard / other] |
---
## Record Selection
### Inclusion Criteria
| # | Criterion | Value | Rationale | Records Affected |
|---|-----------|-------|-----------|-----------------|
| 1 | Date range | [start] to [end] | [reason] | [count or %] |
| 2 | Store filter | [criteria] | [reason] | [count or %] |
| 3 | Completeness threshold | [min fields non-null] | [reason] | [count or %] |
### Exclusion Criteria
| # | Criterion | Value | Rationale | Records Removed |
|---|-----------|-------|-----------|----------------|
| 1 | [anomalous periods] | [dates/conditions] | [reason] | [count or %] |
### Coverage Analysis
- **Total records before selection:** [count]
- **Total records after selection:** [count] ([%] retained)
- **Total fields after selection:** [count] ([%] retained)
- **Date range:** [start] to [end]
- **Stores covered:** [count] out of [total]
- **Sections covered:** [list]
---
## Selection Dependencies
| # | Decision | Depends On | Notes |
|---|----------|-----------|-------|
| 1 | [selection decision] | [what it depends on — e.g., quality remediation in 3.2] | [notes] |
---
## To Be Clarified
[List any selection decisions that could not be finalized. Remove this section if everything is complete.]
---
## Source Documents
- 1.3 Data Mining Goals: `docs/crisp-dm/1-business-understanding/1.3-data-mining-goals.md`
- 2.1 Data Collection: `docs/crisp-dm/2-data-understanding/2.1-data-collection.md`
- 2.2 Data Description: `docs/crisp-dm/2-data-understanding/2.2-data-description.md`
- 2.3 Data Exploration: `docs/crisp-dm/2-data-understanding/2.3-data-exploration.md`
- 2.4 Data Quality: `docs/crisp-dm/2-data-understanding/2.4-data-quality.md`
---
## Sign-off
| Role | Name | Date | Status |
|------|------|------|--------|
| Domain Expert | | | Pending |
| Data Scientist | | | Pending |
After writing both artifacts, present a summary:
Data Selection complete. Two artifacts created:
- Notebook:
notebooks/3.1-select-data.ipynb— full selection analysis code with inline outputs- Summary:
docs/crisp-dm/3-data-preparation/3.1-select-data.md— structured reportSummary:
- [N] datasets selected out of [M] available
- [N] fields selected out of [M] total
- [N] records retained ([%] of original)
- Date range: [start] to [end]
- [N] data leakage risks identified and mitigated
- [N] items still to be clarified (if any)
Next step in CRISP-DM: Run
/clean-datato handle missing values, outliers, and quality issues in the selected data (Task 3.2).
Also update the CRISP-DM phase tracker in .claude/CLAUDE.md to mark "Data Preparation" as "In Progress" and add the 3.1 artifact link.
Before finalizing, verify:
notebooks/3.1-select-data.ipynb with all analysis code and inline outputs