/select-data — CRISP-DM 3.1: Select Data

Phase: 3. Data Preparation | Task: 3.1 Select Data

"Decide on the data to be used for analysis. Criteria include relevance to the data mining goals, quality, and technical constraints such as limits on data volume or data types."

Purpose

This skill decides which datasets, fields, and records from the collected data (Phase 2) will be carried forward into the modeling dataset. It documents the rationale for every inclusion and exclusion decision. It produces three outputs:

Dataset Selection — which datasets from 2.1 are included/excluded, with reasons
Field Selection — which columns from each dataset are included/excluded, mapped to data mining goals
Record Selection — which rows are included/excluded (date ranges, store subsets, filters), with coverage analysis

Output Location

/select-data — CRISP-DM 3.1: Select Data

Phase: 3. Data Preparation | Task: 3.1 Select Data

"Decide on the data to be used for analysis. Criteria include relevance to the data mining goals, quality, and technical constraints such as limits on data volume or data types."

Purpose

Dataset Selection — which datasets from 2.1 are included/excluded, with reasons
Field Selection — which columns from each dataset are included/excluded, mapped to data mining goals
Record Selection — which rows are included/excluded (date ranges, store subsets, filters), with coverage analysis

Output Location

# 3.1 Data Selection Report > **Project:** [project name] > **Date:** [current date] > **CRISP-DM Phase:** 3. Data Preparation > **Status:** Draft | Review | Approved --- ## Selection Overview - **Purpose:** [what the selected data will be used for — reference data mining goals] - **Selection approach:** [top-down from goals / bottom-up from available data / hybrid] - **Key decisions:** [1-3 sentence summary of the most impactful selection choices] --- ## Dataset Selection | # | Dataset | Source (from 2.1) | Decision | Rationale | Data Mining Goal (from 1.3) | |---|---------|-------------------|----------|-----------|---------------------------| | 1 | [name] | [source] | Include / Exclude | [reason] | [goal] | ### Excluded Datasets [For each excluded dataset, explain why it was excluded and whether it might be reconsidered later.] --- ## Field Selection ### [Dataset 1 Name] | # | Field | Data Type | Decision | Role | Rationale | |---|-------|-----------|----------|------|-----------| | 1 | [field] | [type] | Include / Exclude / Defer | Target / Feature / ID / Context / Excluded | [reason] | **Summary:** [N] of [M] fields selected ([%]) [Repeat for each included dataset] ### Data Leakage Assessment | # | Field / Pattern | Risk | Mitigation | |---|----------------|------|------------| | 1 | [field or pattern] | [how it could leak target information] | [exclusion / temporal guard / other] | --- ## Record Selection ### Inclusion Criteria | # | Criterion | Value | Rationale | Records Affected | |---|-----------|-------|-----------|-----------------| | 1 | Date range | [start] to [end] | [reason] | [count or %] | | 2 | Store filter | [criteria] | [reason] | [count or %] | | 3 | Completeness threshold | [min fields non-null] | [reason] | [count or %] | ### Exclusion Criteria | # | Criterion | Value | Rationale | Records Removed | |---|-----------|-------|-----------|----------------| | 1 | [anomalous periods] | [dates/conditions] | [reason] | [count or %] | ### Coverage Analysis - **Total records before selection:** [count] - **Total records after selection:** [count] ([%] retained) - **Total fields after selection:** [count] ([%] retained) - **Date range:** [start] to [end] - **Stores covered:** [count] out of [total] - **Sections covered:** [list] --- ## Selection Dependencies | # | Decision | Depends On | Notes | |---|----------|-----------|-------| | 1 | [selection decision] | [what it depends on — e.g., quality remediation in 3.2] | [notes] | --- ## To Be Clarified [List any selection decisions that could not be finalized. Remove this section if everything is complete.] --- ## Source Documents - 1.3 Data Mining Goals: `docs/crisp-dm/1-business-understanding/1.3-data-mining-goals.md` - 2.1 Data Collection: `docs/crisp-dm/2-data-understanding/2.1-data-collection.md` - 2.2 Data Description: `docs/crisp-dm/2-data-understanding/2.2-data-description.md` - 2.3 Data Exploration: `docs/crisp-dm/2-data-understanding/2.3-data-exploration.md` - 2.4 Data Quality: `docs/crisp-dm/2-data-understanding/2.4-data-quality.md` --- ## Sign-off | Role | Name | Date | Status | |------|------|------|--------| | Domain Expert | | | Pending | | Data Scientist | | | Pending |

Select Data

/select-data — CRISP-DM 3.1: Select Data

Purpose

Output Location

Select Data

/select-data — CRISP-DM 3.1: Select Data

Purpose

Output Location

Workflow

Step 1: Check for Existing Artifacts

Step 2: Ingest Source Information

Step 3: Analyze and Propose Selection

Step 4: Present Selection Plan and Ask for Confirmation

Step 5: Clarification Round (if needed)

Step 6: Create the Notebook and Generate the Output Document

Step 7: Summary and Next Steps

Quality Checks

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns

Criterion	Value	Reason	Records Affected
Date range	[start] to [end]	[reason]	[count or %]
Store filter	[criteria]	[reason]	[count or %]