Name: Statistics
Author: michaelsvanbeek

Core Principles

Question first, method second — Define what you're trying to learn before selecting a test. The question determines the method, not the other way around.
Check assumptions before computing — Every test has assumptions (normality, independence, equal variance). Violating them silently produces misleading results.
Effect size matters more than p-values — A statistically significant result can be practically meaningless. Always report the magnitude of the effect alongside significance.
Confidence intervals over point estimates — A mean of 42 is less useful than "42 ± 5 (95% CI)". Always quantify uncertainty.
Don't test what you can describe — If the question is "what does the data look like?", use descriptive statistics. Hypothesis tests answer "is this difference real?", not "what is the data?"
One question, one test — Running multiple tests on the same data inflates false positives. If you must run multiple comparisons, correct for it (Bonferroni, Holm, or FDR).
Honest reporting — Report all results, not just significant ones. Pre-register your hypothesis when possible. State limitations clearly.

Core Principles

Question first, method second — Define what you're trying to learn before selecting a test. The question determines the method, not the other way around.
Check assumptions before computing — Every test has assumptions (normality, independence, equal variance). Violating them silently produces misleading results.
Effect size matters more than p-values — A statistically significant result can be practically meaningless. Always report the magnitude of the effect alongside significance.
Confidence intervals over point estimates — A mean of 42 is less useful than "42 ± 5 (95% CI)". Always quantify uncertainty.
Don't test what you can describe — If the question is "what does the data look like?", use descriptive statistics. Hypothesis tests answer "is this difference real?", not "what is the data?"
One question, one test — Running multiple tests on the same data inflates false positives. If you must run multiple comparisons, correct for it (Bonferroni, Holm, or FDR).
Honest reporting — Report all results, not just significant ones. Pre-register your hypothesis when possible. State limitations clearly.

Question Type	What You're Asking	Go To
Describe	What does the data look like?	Descriptive Statistics
Compare	Are these groups different?	Comparison Tests
Relate	Do these variables move together?	Correlation & Regression
Predict	What will happen next?	Regression & Modeling (beyond this skill — use ML)
Classify	Does this differ from expected?	Goodness-of-Fit Tests

Factor	Options	Why It Matters
Number of groups	1, 2, or 3+	Determines which test family to use
Paired or independent?	Same subjects measured twice vs. different subjects	Paired tests are more powerful but require matched data
Variable type	Continuous (interval/ratio) vs. categorical (nominal/ordinal)	Determines parametric vs. non-parametric
Sample size	Small (< 30) vs. large (≥ 30)	Small samples need non-parametric or exact tests
Distribution shape	Normal vs. skewed vs. unknown	Parametric tests assume normality

Measure	When to Use	Sensitive To
Mean	Symmetric, roughly normal data	Outliers (a single extreme value shifts it)
Median	Skewed data or when outliers are present	Nothing — robust by design
Mode	Categorical data or multimodal distributions	Ties, small samples

Measure	When to Use	Notes
Standard deviation	Normal-ish data; pair with mean	Same units as data
IQR (Q3 − Q1)	Skewed data; pair with median	Robust to outliers
Range	Quick sense of spread	Useless with outliers
Coefficient of variation (CV)	Comparing spread across datasets with different scales	CV = std / mean

Check	Method	Interpretation
Normality	Shapiro-Wilk (n < 5000), Anderson-Darling, or Q-Q plot	p < 0.05 → not normal
Skewness	`scipy.stats.skew`	\|skew\| > 1 = substantially skewed
Kurtosis	`scipy.stats.kurtosis`	> 0 = heavy tails, < 0 = light tails

Statistics

Core Principles

Statistics

Core Principles

Method Selection Guide

Step 1: Classify Your Question

Step 2: Identify Your Data Structure

Descriptive Statistics

Central Tendency

Spread

Distribution Shape

Comparison Tests

Choosing the Right Test

Test Reference

Post-Hoc Tests (After ANOVA / Kruskal-Wallis)

Correlation and Association

Effect Size

Sample Size and Power

A/B Testing

Before the Test

During the Test

After the Test

Assumption Validation Checklist

Process

Step 1: State the Question

Step 2: Profile the Data

Step 3: Select the Method

Step 4: Compute and Interpret

Output Format

Regression Basics

When to Consider Bayesian Methods

Tooling Reference

Common Pitfalls

Notion

Feishu Wiki

Gemini

Obsidian Vault Maintainer

Openclaw Pr Maintainer

Wiki Maintainer

Test	Compares	Assumptions	Non-Parametric Alternative
One-sample t-test	Sample mean vs. known value	Normal, continuous	Wilcoxon signed-rank
Independent t-test	Means of 2 independent groups	Normal, equal variance, independent	Mann-Whitney U
Welch's t-test	Means of 2 independent groups	Normal, independent (no equal variance needed)	Mann-Whitney U
Paired t-test	Means of 2 paired measurements	Normal differences, continuous	Wilcoxon signed-rank
One-way ANOVA	Means of 3+ independent groups	Normal, equal variance, independent	Kruskal-Wallis
Chi-squared test	Observed vs. expected frequencies	Expected count ≥ 5 in each cell	Fisher's exact test

Test	When to Use
Tukey HSD	All pairwise comparisons, equal sample sizes
Bonferroni correction	Few planned comparisons
Dunn's test	Post-hoc for Kruskal-Wallis

Method	Variable Types	Measures	Assumptions
Pearson's r	Both continuous	Linear relationship strength	Normal, linear, no outliers
Spearman's ρ	Ordinal or continuous	Monotonic relationship strength	None (rank-based)
Kendall's τ	Ordinal, small samples	Monotonic relationship strength	None (rank-based, more robust than Spearman for small n)
Chi-squared test of independence	Both categorical	Whether variables are associated	Expected counts ≥ 5
Point-biserial	One binary, one continuous	Correlation	Normal continuous variable

Context	Measure	Small	Medium	Large
2-group comparison	Cohen's d	0.2	0.5	0.8
ANOVA	Eta-squared (η²)	0.01	0.06	0.14
Correlation	r or R²	0.1	0.3	0.5
Chi-squared	Cramér's V	0.1	0.3	0.5
Binary outcome	Odds ratio	1.5	2.5	4.0

Assumption	How to Check	If Violated
Normality	Shapiro-Wilk test, Q-Q plot, histogram	Use non-parametric alternative
Equal variance	Levene's test, F-test	Use Welch's t-test or rank-based test
Independence	Study design (not a statistical test)	Use paired/repeated measures test
Linearity (regression)	Scatter plot of residuals vs. fitted	Transform variables or use non-linear model
No multicollinearity (regression)	VIF < 5 for each predictor	Remove or combine correlated predictors

Outcome Variable	Predictor(s)	Method	Use When
Continuous	1 continuous	Simple linear regression	Predicting one variable from another (e.g., revenue from ad spend)
Continuous	Multiple	Multiple linear regression	Predicting with several factors; controlling for confounders
Binary (0/1)	Any	Logistic regression	Predicting yes/no outcomes (e.g., churn, conversion)
Count (0, 1, 2...)	Any	Poisson regression	Predicting event counts (e.g., support tickets per day)

Signal	Why Bayesian Helps
Small sample size (n < 30)	Priors regularize unstable estimates
You have strong prior knowledge	Incorporating domain expertise improves estimates
You need P(hypothesis \| data), not P(data \| hypothesis)	Credible intervals answer "what's the probability the effect is > X?" directly
Sequential testing / continuous monitoring	Bayesian A/B tests allow peeking without inflating error rates
Stakeholders struggle with p-values	"95% probability the effect is between 2% and 8%" is more intuitive

Library	Language	Best For
`scipy.stats`	Python	Hypothesis tests, distributions, descriptive stats
`statsmodels`	Python	Regression, ANOVA, time series, assumption diagnostics
`pingouin`	Python	Clean API for t-tests, ANOVA, correlation, effect sizes
`scikit-learn`	Python	Train/test splits, cross-validation, preprocessing
`pymc`	Python	Bayesian modeling and inference
`power_analysis` / `statsmodels.stats.power`	Python	Sample size and power calculations

Pitfall	Why It Fails
p-hacking	Running multiple tests and reporting only significant results inflates false positives. Pre-register your hypothesis.
Ignoring effect size	p = 0.001 with Cohen's d = 0.05 means "we're very sure about a trivially small difference." Not actionable.
Small sample, big claims	A study with n = 12 that finds p = 0.04 is fragile. One outlier changes the conclusion.
Violating independence	Using the same users in both groups, or multiple measurements without paired tests. Results are invalid.
Peeking at A/B tests	Checking daily and stopping when p < 0.05 dramatically inflates false positive rate. Use sequential testing if you must peek.
Treating non-significant as "no effect"	Absence of evidence ≠ evidence of absence. You may be underpowered. Report power.
Applying parametric tests to ordinal data	A Likert scale (1–5) is not continuous. Use non-parametric methods.
Confusing correlation with causation	Pearson r = 0.8 does not mean X causes Y. It means they move together.
Cherry-picking subgroups	"It wasn't significant overall, but it was significant for users aged 25–30 on Tuesdays." This is noise.
Reporting without uncertainty	"Conversion rate is 4.2%" is less useful than "4.2% ± 0.8% (95% CI)." Always show the interval.