Statistical reasoning, data interpretation, visualization principles, insight extraction, and discriminating signal from noise
Invoke when: interpreting data, evaluating statistical claims, choosing appropriate analysis methods, assessing chart/visualization quality, or turning raw numbers into actionable understanding.
Data analysis is not arithmetic. It is structured reasoning about what a dataset can and cannot tell you, combined with careful communication of that reasoning. Most mistakes in data analysis are not calculation errors — they are interpretation errors.
Before touching numbers, ask:
Know what each measure captures:
| Statistic | What it measures | Caveat |
|---|
| Mean | Central tendency — sensitive to outliers | Can be misleading with skewed distributions |
| Median | Central tendency — robust to outliers | Better for income, housing prices, latency |
| Mode | Most frequent value | Mainly useful for categorical data |
| Range | Spread — max minus min | Sensitive to extremes; hides internal distribution |
| Variance / Std Dev | Average deviation from mean | Sensitive to outliers; describes spread in original units (StdDev) |
| IQR | Middle 50% spread | Robust spread metric; use with median |
| Percentiles | Position in distribution | More informative than min/max for skewed data |
Always ask: Is the mean an appropriate summary, or does the distribution shape make it misleading? (Income distributions, system latency, and anything with long tails: use median + IQR.)
Real data is not normally distributed by default. Recognize:
Before fitting a model, look at the actual distribution. Plot a histogram. Checking distributional assumptions is not optional.
Base rate neglect is one of the most common reasoning errors. Always ask: how common is this thing in the underlying population before applying any conditional probability?
Bayes' theorem (intuition form): a positive test for a rare disease is probably a false positive, even with a "highly accurate" test, because the base rate of disease is low.
| Trap | Description |
|---|---|
| Simpson's Paradox | A trend appears in groups but reverses when groups are combined or vice versa. Always check for lurking variables. |
| Survivorship bias | Analyzing only outcomes that "survived" to be observed misses the full distribution (e.g., studying successful companies, not all startups) |
| Selection bias | The sample is not representative of the population of interest |
| Confounding | A third variable explains the relationship between two measured variables |
| Multiple comparisons | Running many tests means some will be significant by chance; p-values must be adjusted (Bonferroni or similar) |
| Overfitting | A model that fits training data extremely well may generalize poorly |
| Ecological fallacy | Conclusions about groups applied incorrectly to individuals |
Name the applicable trap whenever interpreting a statistical claim.
Correlation is necessary but not sufficient for causation. Establish causation requires:
When a claim implies causation from observational data, name this explicitly: "This shows a correlation; it does not establish causation because [confounders / no experimental design / etc.]."
| Data type | Appropriate chart |
|---|---|
| Distribution (continuous) | Histogram, density plot, box plot |
| Comparison across categories | Bar chart (vertical or horizontal) |
| Change over time (few series) | Line chart |
| Correlation between two variables | Scatter plot |
| Composition (parts of a whole) | Stacked bar chart; pie chart only when values sum to a meaningful whole and there are few slices |
| Relationship + third variable | Scatter plot with color/size encoding |
Pie charts are generally poor unless: there are ≤4 slices, values add to 100%, and the point is about one slice being dominant or negligible. Otherwise: bar chart.
For any substantive data analysis:
A table of numbers without interpretation is not analysis. Interpretation without numbers is speculation. Show both, and connect them.