Use when `analysis/research-question.md` exists but `analysis/data-preparation.md` does not
The student has an approved research question (analysis/research-question.md exists) and is ready to prepare the analytic dataset.
Read analysis/research-question.md. If it does not exist, tell the student:
"Before preparing your data, you need to define your research question. Please use the
research-questionskill first." Do not proceed.
If analysis/data-preparation.md already exists, acknowledge it, read its contents, and ask which part the student wants to revisit. Overwrite on gate-out.
str() (R) or (SAS) for each dataset you're using."PROC CONTENTSanalytic_v2.rds).Check whether analysis/research-question.md contains the dataset schema. If it does, use it and skip this step. Otherwise, ask:
"To help you prepare your analytic dataset, I need to see the variables in each dataset you'll be using. Please paste the output of
str()(R) orPROC CONTENTS(SAS) for each dataset."
Wait for the schema. Do not proceed until it is provided.
Q1 — Unit of observation:
"What is your unit of observation — what does one row represent in your raw data? For example: one person, one clinical visit, one geographic region. And is the unit consistent across all the datasets you're using?"
Good answer: Names the entity and its identifier ("Each row is one participant, identified by PATID. Both datasets use the same structure").
Weak answer: "People" or "patients" — too vague, doesn't address whether the unit is consistent across datasets.
Probing: "What does one row represent in each of your raw datasets? Could a single patient appear in more than one row — for example, one row per visit?"
Concept Block trigger: Student confuses the unit of observation with the unit of analysis (e.g., says "patients" when the raw data has one row per visit).
★ Concept: The unit of observation is what one row represents in the raw data. The unit of analysis is the entity you draw conclusions about. If your raw data has one row per hospital visit but you're studying patients, you'll need to collapse visits to one row per patient before analysis. Mismatching these inflates your sample size and introduces statistical dependence.
Q2 — Variable sources and merge keys:
"Which variables in your research question come from which datasets? And what is the unique identifier that links your datasets together?"
Good answer: Lists specific column names from each dataset, names the join key, and states whether the key is unique in both datasets ("Exposure smoking_status is in cohort.csv, outcome cancer_dx is in claims.csv, joined on PATID which is unique in both").
Weak answer: "I'll merge them by patient ID" — doesn't confirm which datasets, which columns, or whether the key is unique.
Probing: "Is the join key guaranteed to be unique in both datasets? What do you expect the cardinality to be — one row per patient in each dataset, or could one patient have multiple rows in either?"
Concept Block trigger: Student assumes the merge will produce the right number of rows without checking.
★ Concept: Before merging, always verify whether the join key is unique in each dataset. A one-to-many join (one row in dataset A matches multiple rows in dataset B) silently multiplies rows and inflates your sample size. Check for duplicates on the join key before merging, then check the row count after — it should match your expectation.
Q3 — Inclusion and exclusion criteria:
"What inclusion and exclusion criteria define your analytic sample? For each criterion, which specific variable captures it, and what value or range defines the cutoff?"
Good answer: Lists criteria with specific variable names and values, in the order they'll be applied ("Include adults: age >= 18. Exclude if missing outcome: !is.na(cancer_dx). Exclude prior cancer: prior_cancer == 0").
Weak answer: "I'll include adults with the disease" — no variable names, no values, no order.
Probing: "Which variable in your dataset captures that criterion? What value or range defines it?"
Concept Block trigger: Student lists criteria without thinking about order or tracking counts.
★ Concept: Apply exclusion criteria in a consistent, pre-specified order and count how many observations are removed at each step. This CONSORT-style flow table is required for your Gate Out summary and for reproducing your analytic sample. The order you apply criteria can affect your final N — document it before you run code.
Ask Q4 for each variable named in analysis/research-question.md (outcome, exposure, confounders, etc.). Work through them one at a time.
Q4 — Variable measurement and transformation:
"For [variable name]: how is it measured in the raw data — what column, what are the possible values, and what is the coding? Does it need any transformation to operationalize the concept as you defined it in your research question?"
Good answer: Names the raw column, shows its coding, and specifies any transformation rule ("Age is in age_days as a numeric integer — I'll divide by 365.25 to get years. Smoking is in smk_status coded 1=never, 2=former, 3=current — I'll dichotomize to 0=never, 1=ever").
Weak answer: "It looks fine" or "it's already coded correctly" without showing the raw values.
Probing: "What are the possible values of that column in the raw data? Does the raw coding match how you described the variable in your research question?"
Concept Block trigger: Student proposes a recode without committing to a rule before seeing the data.
★ Concept: Define all transformation rules before you inspect your results. Changing variable definitions after seeing estimates — even with good intentions — is a form of p-hacking. Write the rule down first: what values map to what, what the cutoffs are, how you handle edge cases.
Q5 — Missing data assumption and approach:
"Will you be investigating the potential bias from missing data in this analysis? If yes, what assumption will you be making about why data is missing — MCAR, MAR, or MNAR — and how will you handle it?"
Good answer: States the assumption, justifies it with reasoning ("I think missingness in income is related to age and education, both of which I measure — so I'll assume MAR and use multiple imputation"), names the method.
Weak answer: "I'll just drop missing rows" with no justification, or "I don't think there's much missingness" without checking.
Probing: "Why do you think data is missing on that variable? Is the probability of missingness likely to be related to the variable's own value, or only to other measured variables in your dataset?"
Concept Block trigger: Student chooses complete-case analysis without stating the MCAR assumption.
★ Concept: Complete-case analysis (dropping rows with missing data) is only unbiased if data is Missing Completely At Random (MCAR) — meaning missingness is unrelated to any variable, observed or unobserved. If missingness is related to other measured variables (MAR), multiple imputation gives less biased estimates. If missingness is related to the unobserved value itself (MNAR), neither approach fully corrects for bias; sensitivity analyses are needed.
Based on the student's answers, present this structured plan and wait for their explicit approval before writing any code:
Unit of observation: [entity and identifier, e.g. "one row per participant (PATID)"]
Datasets used: [list with file/library names]
Merge strategy: [join type, key, expected cardinality — e.g. "left join on PATID, one-to-one expected"]
Inclusion/exclusion criteria (in order):
1. [criterion] — variable: [name], rule: [value/range]
2. ...
Variable transformations:
- [varname]: [raw coding/values] → [derived coding]
- (none) if no transformations needed
Missing data: [assumption (MCAR/MAR/MNAR) + method (complete case / imputation / sensitivity)]
Do not save the artifact or write any code until the student says "yes" or otherwise explicitly confirms.
analytic_v2.rds)Ask the student to confirm:
Save these to analysis/data-preparation.md:
# Data Preparation Summary
**Frozen analytic file:** [filename and path]
**Final sample size:** [N]
**Exclusion counts:**
| Criterion | N excluded | N remaining |
|---|---|---|
| Starting N | — | [N] |
| [criterion 1] | [N] | [N] |
| [criterion 2] | [N] | [N] |
**Variables created:**
- [variable name]: [derivation description]
**Missing data approach:** [assumption + method]
The next skill (descriptive-analysis) will not begin until this file exists.
analytic_v2.rds)