Use when a student needs to define their epidemiological research question and `analysis/research-question.md` does not exist
The student has a dataset and needs to formalize their research question before any analysis begins. No prior skill output is required.
Check whether analysis/research-question.md exists in the project directory.
Ask the student to paste their dataset schema before any other questions:
"To help you define your research question, I need to see what variables you're working with. Please paste your variable list — names, types, and a few sample values for each."
Wait for the schema. Do not proceed until it is provided.
Q1 — Question type:
"What type of research question is this — descriptive (what is the prevalence/distribution?), predictive (what predicts my outcome?), associational (is X associated with Y?), or causal (does X cause Y)?"
Record the answer. It determines which Phase 3 branch to follow.
Good answer: Names a type and frames it using their variables ("I want to know whether smoking is associated with lung cancer").
Weak answer: "I want to look at the relationship between X and Y" — doesn't commit to a type.
Probing: "Relationship can mean different things — are you asking whether X predicts Y, whether X is associated with Y, or whether X causes Y?"
Concept Block trigger: Student says "causes" but the dataset is cross-sectional.
★ Concept: Causal claims require ruling out reverse causality and all confounding — typically not possible with cross-sectional data. With a single time-point dataset, an associational question is usually more defensible.
Q2 — Primary outcome:
"What is your primary outcome of interest? Looking at the variables you shared, is your outcome included as a single variable in this dataset, or will you need to modify or combine variables to create it?"
Good answer: Names a specific column from the schema and notes whether it needs transformation.
Weak answer: "The health outcome" or a concept name without a variable ("their disease status").
Probing: "Looking at the variables you shared, which specific column would you use to measure that?"
Concept Block trigger: Student wants to combine multiple columns without explaining why.
★ Concept: Composite outcomes must be defined before analysis, not derived from the results. If you combine variables, write down the rule — what counts as "positive" — before you look at any numbers.
Q3 — Outcome measurement type:
"Is your primary outcome of interest a continuous variable, a dichotomous (binary) variable, an ordinal categorical variable, an unordered categorical variable, or something else?"
Good answer: Correctly classifies the variable AND explains why ("It's dichotomous — coded 0/1 for absent/present").
Weak answer: "It's a number" or "it's yes or no" without naming the type.
Probing: "What are the possible values? Is there a meaningful order to them?"
Concept Block trigger: Student confuses ordinal with continuous.
★ Concept: An ordinal variable has ordered categories without equal spacing (e.g., none/mild/severe). A continuous variable has equal intervals and can take any value in a range (e.g., BMI). The distinction determines which statistical models are appropriate.
After Q3, follow the path that matches the student's Q1 answer.
No additional questions. Proceed to Plan Gate.
Q4 — Predictors:
"Which variables will you include to help predict your outcome, and why? (Do not frame them as confounders — that is causal thinking. In a predictive model, we select variables based on their predictive value.)"
Good answer: Lists specific column names and explains each one's predictive relevance.
Weak answer: Lists variables without justification, or justifies them in causal terms ("to control for age").
Probing: "Why do you expect [variable] to improve prediction of [outcome]?"
Concept Block trigger: Student justifies predictor selection using causal language ("to adjust for confounding").
★ Concept: In predictive modeling, we include variables because they improve prediction accuracy — not because of their causal relationship to the outcome. Confounder adjustment is a causal concept; for prediction, ask whether adding a variable helps the model predict better.
Proceed to Plan Gate.
Q4 — Primary exposure:
"What is your primary exposure of interest? Looking at the variables you shared, is your exposure included as a single variable in this dataset, or will you need to modify or combine variables to create it?"
Good answer: Names a specific column and notes whether it needs transformation.
Weak answer: A concept name without a variable ("their smoking history").
Probing: "Which specific column in the dataset captures that? Is it already coded the way you'd want?"
Q5 — Exposure measurement type:
"Is your primary exposure of interest a continuous variable, a dichotomous variable, an ordinal categorical variable, an unordered categorical variable, or something else?"
Good answer: Names the type and the coding ("Dichotomous — 1 for exposed, 0 for unexposed").
Weak answer: "It's a category" without specifying ordered vs. unordered.
Probing: "Does the order of the categories carry meaning? For example, is 'high' more than 'medium'?"
Q6 — DAG: Follow the draw-a-dag auxiliary skill to guide the student through building their causal diagram. This is a multi-turn process — do not condense it into a single question. When the student has confirmed the DAG, continue to Q7.
Q7 — Effect modifiers and mediators:
"Are there effect modifiers or mediators you want to examine? An effect modifier is a variable that changes the magnitude of the exposure-outcome association in different subgroups. A mediator is a variable on the causal path from exposure to outcome."
Good answer: Correctly identifies a candidate variable and explains which role it plays and why.
Weak answer: "I don't think so" without reasoning, or names a confounder.
Probing: "Do you expect the association between [exposure] and [outcome] to be stronger or weaker in any particular subgroup?"
Concept Block trigger: Student confuses mediator with confounder.
★ Concept: A mediator is ON the causal path from exposure to outcome — it explains how the exposure has its effect. A confounder is OUTSIDE the causal path and causes both the exposure and the outcome. Adjusting for a mediator when you want the total effect is a methodological error.
Proceed to Plan Gate.
Based on the student's answers, present this summary and wait for explicit approval before saving:
Question type: [type]
Research question: [one sentence]
Outcome: [variable name — measurement type]
Exposure: [variable name — measurement type, or N/A]
Confounders: [list, or N/A]
Effect modifiers / mediators: [list, or N/A]
DAG: [will be embedded in the saved file]
Do not save the artifact until the student says "yes" or otherwise explicitly confirms.
Save the approved content to analysis/research-question.md.
For descriptive or predictive questions, use this template:
# Research Question
**Question type:** [descriptive | predictive]
**Research question:** [one sentence]
**Variables:**
- Outcome: [variable name and how it's measured]
- Predictors: [list, or N/A]
For associational or causal questions, include the DAG section:
# Research Question
**Question type:** [associational | causal]
**Research question:** [one sentence]
**Variables:**
- Outcome: [variable name and how it's measured]
- Exposure: [variable name and how it's measured]
- Confounders: [list, or N/A]
- Effect modifiers / mediators: [list, or N/A]
## DAG
```mermaid
graph LR
exposure["[Exposure]"] --> outcome["[Outcome]"]
confounder1["[Confounder]"] --> exposure
confounder1 --> outcome
Arrows represent hypothesized causal relationships. Confounders cause both exposure and outcome.
The next skill (`data-preparation`) will not begin until this file exists.
---
## Common Mistakes
- Answering the Socratic questions instead of asking them
- Proceeding without the dataset schema — leads to generic, useless variable questions
- Asking multiple questions at once
- Accepting vague answers and moving on — probe until the answer is specific enough to put in the artifact
- Skipping the DAG for associational or causal questions
- Letting data availability drive the research question instead of scientific reasoning
- Student says "just run the analysis" — decline: "Defining the question first prevents wasted effort. What's your instinct about [current question]?"