Statistical modeling: OLS/WLS/GLS, GLM (logit, probit, Poisson), time series (ARIMA, VAR), mixed effects, diagnostics. Formula API. Use for regressions without fixed effects, GLMs, or time series. For FE/DiD use pyfixest; panel/IV use linearmodels.
statsmodels general-purpose statistical modeling library for Python. Covers OLS/WLS/GLS, GLM (logit, probit, Poisson, negative binomial), discrete choice models, time series (ARIMA, SARIMAX, VAR), mixed effects (MixedLM), robust regression, hypothesis tests, and comprehensive diagnostics. Supports R-style formula API. Use when fitting regressions without fixed effects, running GLMs or logit/probit, analyzing time series, or using formula syntax. For fixed effects or DiD, use pyfixest; for panel/IV/system models, use linearmodels.
Comprehensive skill for statistical modeling with statsmodels. Use decision trees below to find the right guidance, then load detailed references.
statsmodels is the general-purpose statistical modeling library for Python:
smf.ols("y ~ x1 + x2", data=df)) for R-style modeling, and array API (sm.OLS(y, X)) for programmatic control| File | Purpose | When to Read |
|---|---|---|
quickstart.md | Installation, formula vs array API, first model | Starting with statsmodels |
linear-models.md | OLS, WLS, GLS, robust regression, quantile regression | Fitting linear models |
glm-discrete.md | GLM families, logit/probit, count models, zero-inflated | Non-linear models, binary/count outcomes |
time-series.md | ARIMA, SARIMAX, VAR, exponential smoothing, unit root tests | Analyzing temporal data |
diagnostics.md | Heteroskedasticity, normality, VIF, influence, residuals | Checking model assumptions |
hypothesis-testing.md | t-tests, F-tests, Wald tests, multiple comparisons | Testing coefficients and comparing models |
gotchas.md | Constant term, convergence, predict pitfalls, pyfixest boundary | Debugging issues |
quickstart.md then linear-models.mdquickstart.md then glm-discrete.mdquickstart.md then time-series.mddiagnostics.mdquickstart.md (formula API mirrors R syntax)svy skill insteaddf.to_pandas() if using PolarsWhat kind of regression?
├─ Linear (continuous outcome)
│ ├─ Basic OLS → ./references/linear-models.md
│ ├─ Weighted least squares → ./references/linear-models.md
│ │ (⚠ WLS ≠ survey-weighted regression — for complex surveys, use `svy` skill)
│ ├─ Correlated errors (GLS) → ./references/linear-models.md
│ ├─ Robust to outliers (M-estimator) → ./references/linear-models.md
│ └─ Quantile regression → ./references/linear-models.md
├─ Binary outcome (0/1)
│ ├─ Logit → ./references/glm-discrete.md
│ └─ Probit → ./references/glm-discrete.md
├─ Count outcome (0, 1, 2, ...)
│ ├─ Poisson → ./references/glm-discrete.md
│ ├─ Negative binomial → ./references/glm-discrete.md
│ └─ Zero-inflated → ./references/glm-discrete.md
├─ Multinomial (3+ categories)
│ └─ Multinomial logit → ./references/glm-discrete.md
├─ GLM (custom family/link)
│ └─ GLM framework → ./references/glm-discrete.md
└─ Need fixed effects?
└─ Use pyfixest instead (faster FE absorption)
What time series task?
├─ Forecast a single series
│ ├─ ARIMA / SARIMAX → ./references/time-series.md
│ └─ Exponential smoothing → ./references/time-series.md
├─ Multiple interrelated series
│ └─ VAR / VECM → ./references/time-series.md
├─ Test for stationarity
│ ├─ ADF test → ./references/time-series.md
│ └─ KPSS test → ./references/time-series.md
├─ Examine autocorrelation
│ └─ ACF / PACF → ./references/time-series.md
└─ Structural time series
└─ Unobserved components → ./references/time-series.md
What assumption to check?
├─ Heteroskedasticity → ./references/diagnostics.md
│ ├─ Breusch-Pagan test
│ └─ White test
├─ Normality of residuals → ./references/diagnostics.md
│ ├─ Jarque-Bera test
│ └─ Shapiro-Wilk test
├─ Specification / functional form → ./references/diagnostics.md
│ └─ RESET test
├─ Multicollinearity → ./references/diagnostics.md
│ ├─ VIF
│ └─ Condition number
├─ Influential observations → ./references/diagnostics.md
│ ├─ Cook's distance
│ └─ Leverage / DFFITS
├─ Serial correlation → ./references/diagnostics.md
│ └─ Durbin-Watson / Breusch-Godfrey
└─ All of the above → ./references/diagnostics.md
What kind of test?
├─ Single coefficient significance → ./references/hypothesis-testing.md
├─ Joint significance (F-test) → ./references/hypothesis-testing.md
├─ Linear restrictions (Wald) → ./references/hypothesis-testing.md
├─ Compare nested models (LR test) → ./references/hypothesis-testing.md
├─ Multiple comparisons correction → ./references/hypothesis-testing.md
└─ Chi-squared test → ./references/hypothesis-testing.md
Common issues?
├─ Missing constant / intercept → ./references/gotchas.md
├─ Convergence warnings → ./references/gotchas.md
├─ predict() errors → ./references/gotchas.md
├─ Formula parsing issues → ./references/gotchas.md
├─ summary() formatting → ./references/gotchas.md
├─ statsmodels vs pyfixest → ./references/gotchas.md
└─ General troubleshooting → ./references/gotchas.md
Important: In data research pipelines (see CLAUDE.md), statsmodels analyses are executed through script files, not interactively. This ensures auditability and reproducibility.
The pattern:
scripts/stage8_analysis/{step}_{model-name}.pyClosely read agent_reference/SCRIPT_EXECUTION_REFERENCE.md for the mandatory file-first execution protocol covering complete code file writing, output capture, and file versioning rules.
See:
agent_reference/SCRIPT_EXECUTION_REFERENCE.md — Script execution protocol and format with validationThe examples below show statsmodels syntax. In research workflows, wrap them in scripts following the file-first pattern.
import statsmodels.api as sm # Array API
import statsmodels.formula.api as smf # Formula API (R-style)
| Operation | Code |
|---|---|
| OLS (formula) | smf.ols("y ~ x1 + x2", data=df).fit() |
| OLS (array) | sm.OLS(y, sm.add_constant(X)).fit() |
| Logit | smf.logit("y ~ x1 + x2", data=df).fit() |
| Probit | smf.probit("y ~ x1 + x2", data=df).fit() |
| Poisson | smf.poisson("y ~ x1 + x2", data=df).fit() |
| GLM (custom) | smf.glm("y ~ x1", data=df, family=sm.families.Binomial()).fit() |
| WLS | smf.wls("y ~ x1", data=df, weights=w).fit() |
| Robust (HC1) | fit = smf.ols(...).fit(cov_type='HC1') |
| ARIMA | sm.tsa.ARIMA(y, order=(p,d,q)).fit() |
| Summary | results.summary() |
| Predict | results.predict(new_data) |
| Confidence intervals | results.conf_int(alpha=0.05) |
| Marginal effects | results.get_margeff(at='overall') |
| VIF | from statsmodels.stats.outliers_influence import variance_inflation_factor |
| Breusch-Pagan | sm.stats.diagnostic.het_breuschpagan(resid, exog) |
# Additive terms
"y ~ x1 + x2 + x3"
# Interaction (with main effects)
"y ~ x1 * x2" # equivalent to x1 + x2 + x1:x2
# Interaction only (no main effects)
"y ~ x1 : x2"
# Categorical variable
"y ~ C(region)" # treatment coding (default)
"y ~ C(region, Treatment(reference='West'))" # explicit reference
# Suppress intercept
"y ~ x1 + x2 - 1"
# Polynomial
"y ~ x1 + I(x1**2)" # I() protects Python operators
| Topic | Reference File |
|---|---|
| Installation | ./references/quickstart.md |
| Formula vs array API | ./references/quickstart.md |
| Reading summary output | ./references/quickstart.md |
| Comparison to pyfixest | ./references/quickstart.md |
| OLS regression | ./references/linear-models.md |
| Weighted least squares | ./references/linear-models.md |
| GLS | ./references/linear-models.md |
| Robust regression (RLM) | ./references/linear-models.md |
| Quantile regression | ./references/linear-models.md |
| Interactions and polynomials | ./references/linear-models.md |
| GLM framework | ./references/glm-discrete.md |
| Logit / probit | ./references/glm-discrete.md |
| Multinomial logit | ./references/glm-discrete.md |
| Poisson / negative binomial | ./references/glm-discrete.md |
| Zero-inflated models | ./references/glm-discrete.md |
| Marginal effects | ./references/glm-discrete.md |
| Exposure / offset | ./references/glm-discrete.md |
| ARIMA / SARIMAX | ./references/time-series.md |
| VAR / VECM | ./references/time-series.md |
| Exponential smoothing | ./references/time-series.md |
| Unit root tests | ./references/time-series.md |
| ACF / PACF | ./references/time-series.md |
| Forecasting | ./references/time-series.md |
| State space models | ./references/time-series.md |
| Heteroskedasticity tests | ./references/diagnostics.md |
| Normality tests | ./references/diagnostics.md |
| Specification tests (RESET) | ./references/diagnostics.md |
| VIF / multicollinearity | ./references/diagnostics.md |
| Influence measures | ./references/diagnostics.md |
| Residual analysis | ./references/diagnostics.md |
| Durbin-Watson | ./references/diagnostics.md |
| t-tests and F-tests | ./references/hypothesis-testing.md |
| Wald tests | ./references/hypothesis-testing.md |
| Likelihood ratio tests | ./references/hypothesis-testing.md |
| Multiple comparison corrections | ./references/hypothesis-testing.md |
| Comparing nested models | ./references/hypothesis-testing.md |
| Serial correlation tests | ./references/diagnostics.md |
| Diagnostic checklist | ./references/diagnostics.md |
| Chi-squared tests | ./references/hypothesis-testing.md |
| Joint significance tests | ./references/hypothesis-testing.md |
| Ordered logit / probit | ./references/glm-discrete.md |
| Mixed effects (MixedLM) | ./references/linear-models.md |
| Constant term pitfall | ./references/gotchas.md |
| Convergence warnings | ./references/gotchas.md |
| predict() issues | ./references/gotchas.md |
| Formula parsing (patsy) | ./references/gotchas.md |
| summary() vs summary2() | ./references/gotchas.md |
| NaN / missing data | ./references/gotchas.md |
| DataFrame index issues | ./references/gotchas.md |
| statsmodels vs pyfixest | ./references/gotchas.md |
When this library is used as a primary analytical tool, include in the report's Software & Tools references:
Seabold, S. & Perktold, J. (2010). "Statsmodels: Econometric and Statistical Modeling with Python." Proceedings of the 9th Python in Science Conference.
Cite when: statsmodels is used for GLM estimation, time series modeling, or statistical hypothesis testing central to the analysis. Do not cite when: Only used for post-estimation diagnostics supporting another library's primary estimation.
For method-specific citations (e.g., individual estimators or techniques),
consult the reference files in this skill and agent_reference/CITATION_REFERENCE.md.