Statistical modeling: OLS/WLS/GLS, GLM (logit, probit, Poisson), time series (ARIMA, VAR), mixed effects, diagnostics. Formula API. Use for regressions without fixed effects, GLMs, or time series. For FE/DiD use pyfixest; panel/IV use linearmodels.
statsmodels general-purpose statistical modeling library for Python. Covers OLS/WLS/GLS, GLM (logit, probit, Poisson, negative binomial), discrete choice models, time series (ARIMA, SARIMAX, VAR), mixed effects (MixedLM), robust regression, hypothesis tests, and comprehensive diagnostics. Supports R-style formula API. Use when fitting regressions without fixed effects, running GLMs or logit/probit, analyzing time series, or using formula syntax. For fixed effects or DiD, use pyfixest; for panel/IV/system models, use linearmodels.
Comprehensive skill for statistical modeling with statsmodels. Use decision trees below to find the right guidance, then load detailed references.
What is statsmodels?
statsmodels is the general-purpose statistical modeling library for Python:
Two APIs: Formula API (smf.ols("y ~ x1 + x2", data=df)) for R-style modeling, and array API (sm.OLS(y, X)) for programmatic control
New to statsmodels? Start with quickstart.md then linear-models.md
Need GLM or logit/probit? Read quickstart.md then glm-discrete.md
Time series analysis? Read quickstart.md then time-series.md
Checking model assumptions? Read diagnostics.md
Coming from R? Read quickstart.md (formula API mirrors R syntax)
Related Skills
pyfixest: Use instead of statsmodels when your model needs absorbed fixed effects, IV with FE, or difference-in-differences. pyfixest is faster for FE models; statsmodels is broader for everything else
linearmodels: Use for panel data models (FE, RE, between, first difference, Fama-MacBeth), IV/GMM without FE (2SLS, LIML, GMM), system estimation (SUR, 3SLS), and asset pricing. Built on top of statsmodels; extends it for structured data
svy: Use for survey-weighted regression and estimation with complex survey designs. Important: statsmodels WLS is NOT equivalent to survey-weighted regression — WLS handles heteroscedastic errors but does not account for stratification, clustering, or finite population corrections. If your data comes from a complex probability survey (NHANES, ACS PUMS, CPS, ECLS-K, etc.), load the svy skill instead
data-scientist: Provides methodology guidance (when to use which model, assumption checking protocol, interpretation). Load alongside statsmodels for the "why"; statsmodels provides the "how"
polars: Data manipulation before modeling. statsmodels accepts pandas DataFrames; convert with df.to_pandas() if using Polars
plotnine: Publication-quality visualization of model results and diagnostics
What time series task?
├─ Forecast a single series
│ ├─ ARIMA / SARIMAX → ./references/time-series.md
│ └─ Exponential smoothing → ./references/time-series.md
├─ Multiple interrelated series
│ └─ VAR / VECM → ./references/time-series.md
├─ Test for stationarity
│ ├─ ADF test → ./references/time-series.md
│ └─ KPSS test → ./references/time-series.md
├─ Examine autocorrelation
│ └─ ACF / PACF → ./references/time-series.md
└─ Structural time series
└─ Unobserved components → ./references/time-series.md
"I need to check model assumptions"
What assumption to check?
├─ Heteroskedasticity → ./references/diagnostics.md
│ ├─ Breusch-Pagan test
│ └─ White test
├─ Normality of residuals → ./references/diagnostics.md
│ ├─ Jarque-Bera test
│ └─ Shapiro-Wilk test
├─ Specification / functional form → ./references/diagnostics.md
│ └─ RESET test
├─ Multicollinearity → ./references/diagnostics.md
│ ├─ VIF
│ └─ Condition number
├─ Influential observations → ./references/diagnostics.md
│ ├─ Cook's distance
│ └─ Leverage / DFFITS
├─ Serial correlation → ./references/diagnostics.md
│ └─ Durbin-Watson / Breusch-Godfrey
└─ All of the above → ./references/diagnostics.md
"I need to test hypotheses"
What kind of test?
├─ Single coefficient significance → ./references/hypothesis-testing.md
├─ Joint significance (F-test) → ./references/hypothesis-testing.md
├─ Linear restrictions (Wald) → ./references/hypothesis-testing.md
├─ Compare nested models (LR test) → ./references/hypothesis-testing.md
├─ Multiple comparisons correction → ./references/hypothesis-testing.md
└─ Chi-squared test → ./references/hypothesis-testing.md
Important: In data research pipelines (see CLAUDE.md), statsmodels analyses are executed through script files, not interactively. This ensures auditability and reproducibility.
The pattern:
Write model code to scripts/stage8_analysis/{step}_{model-name}.py
Execute via Bash with automatic output capture wrapper script
Validation results get automatically embedded in scripts as comments
If failed, create versioned copy for fixes
Closely read agent_reference/SCRIPT_EXECUTION_REFERENCE.md for the mandatory file-first execution protocol covering complete code file writing, output capture, and file versioning rules.
See:
agent_reference/SCRIPT_EXECUTION_REFERENCE.md — Script execution protocol and format with validation
The examples below show statsmodels syntax. In research workflows, wrap them in scripts following the file-first pattern.
Quick Reference
Essential Imports
import statsmodels.api as sm # Array API
import statsmodels.formula.api as smf # Formula API (R-style)
from statsmodels.stats.outliers_influence import variance_inflation_factor
Breusch-Pagan
sm.stats.diagnostic.het_breuschpagan(resid, exog)
Formula Syntax
# Additive terms
"y ~ x1 + x2 + x3"
# Interaction (with main effects)
"y ~ x1 * x2" # equivalent to x1 + x2 + x1:x2
# Interaction only (no main effects)
"y ~ x1 : x2"
# Categorical variable
"y ~ C(region)" # treatment coding (default)
"y ~ C(region, Treatment(reference='West'))" # explicit reference
# Suppress intercept
"y ~ x1 + x2 - 1"
# Polynomial
"y ~ x1 + I(x1**2)" # I() protects Python operators
Topic Index
Topic
Reference File
Installation
./references/quickstart.md
Formula vs array API
./references/quickstart.md
Reading summary output
./references/quickstart.md
Comparison to pyfixest
./references/quickstart.md
OLS regression
./references/linear-models.md
Weighted least squares
./references/linear-models.md
GLS
./references/linear-models.md
Robust regression (RLM)
./references/linear-models.md
Quantile regression
./references/linear-models.md
Interactions and polynomials
./references/linear-models.md
GLM framework
./references/glm-discrete.md
Logit / probit
./references/glm-discrete.md
Multinomial logit
./references/glm-discrete.md
Poisson / negative binomial
./references/glm-discrete.md
Zero-inflated models
./references/glm-discrete.md
Marginal effects
./references/glm-discrete.md
Exposure / offset
./references/glm-discrete.md
ARIMA / SARIMAX
./references/time-series.md
VAR / VECM
./references/time-series.md
Exponential smoothing
./references/time-series.md
Unit root tests
./references/time-series.md
ACF / PACF
./references/time-series.md
Forecasting
./references/time-series.md
State space models
./references/time-series.md
Heteroskedasticity tests
Citation
When this library is used as a primary analytical tool, include in the report's
Software & Tools references:
Seabold, S. & Perktold, J. (2010). "Statsmodels: Econometric and Statistical Modeling with Python." Proceedings of the 9th Python in Science Conference.
Cite when: statsmodels is used for GLM estimation, time series modeling, or statistical hypothesis testing central to the analysis.
Do not cite when: Only used for post-estimation diagnostics supporting another library's primary estimation.
For method-specific citations (e.g., individual estimators or techniques),
consult the reference files in this skill and agent_reference/CITATION_REFERENCE.md.