Statistical models library for Python. Use when you need specific model classes (OLS, GLM, mixed models, ARIMA) with detailed diagnostics, residuals, and inference. Best for econometrics, time series, rigorous inference with coefficient tables. For guided statistical test selection with APA reporting use statistical-analysis.
Statsmodels is Python's premier library for statistical modeling, providing tools for estimation, inference, and diagnostics across a wide range of statistical methods. Apply this skill for rigorous statistical analysis, from simple linear regression to complex time series models and econometric analyses.
This skill should be used when:
import statsmodels.api as sm
import numpy as np
import pandas as pd
# Prepare data - ALWAYS add constant for intercept
X = sm.add_constant(X_data)
# Fit OLS model
model = sm.OLS(y, X)
results = model.fit()
# View comprehensive results
print(results.summary())
# Key results
print(f"R-squared: {results.rsquared:.4f}")
print(f"Coefficients:\\n{results.params}")
print(f"P-values:\\n{results.pvalues}")
# Predictions with confidence intervals
predictions = results.get_prediction(X_new)
pred_summary = predictions.summary_frame()
print(pred_summary) # includes mean, CI, prediction intervals
# Diagnostics
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan p-value: {bp_test[1]:.4f}")
# Visualize residuals
import matplotlib.pyplot as plt
plt.scatter(results.fittedvalues, results.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.show()
from statsmodels.discrete.discrete_model import Logit
# Add constant
X = sm.add_constant(X_data)
# Fit logit model
model = Logit(y_binary, X)
results = model.fit()
print(results.summary())
# Odds ratios
odds_ratios = np.exp(results.params)
print("Odds ratios:\\n", odds_ratios)
# Predicted probabilities
probs = results.predict(X)
# Binary predictions (0.5 threshold)
predictions = (probs > 0.5).astype(int)
# Model evaluation
from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_binary, predictions))
print(f"AUC: {roc_auc_score(y_binary, probs):.4f}")
# Marginal effects
marginal = results.get_margeff()
print(marginal.summary())
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Check stationarity
from statsmodels.tsa.stattools import adfuller
adf_result = adfuller(y_series)
print(f"ADF p-value: {adf_result[1]:.4f}")
if adf_result[1] > 0.05:
# Series is non-stationary, difference it
y_diff = y_series.diff().dropna()
# Plot ACF/PACF to identify p, q
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(y_diff, lags=40, ax=ax1)
plot_pacf(y_diff, lags=40, ax=ax2)
plt.show()
# Fit ARIMA(p,d,q)
model = ARIMA(y_series, order=(1, 1, 1))
results = model.fit()
print(results.summary())
# Forecast
forecast = results.forecast(steps=10)
forecast_obj = results.get_forecast(steps=10)
forecast_df = forecast_obj.summary_frame()
print(forecast_df) # includes mean and confidence intervals
# Residual diagnostics
results.plot_diagnostics(figsize=(12, 8))
plt.show()
import statsmodels.api as sm
# Poisson regression for count data
X = sm.add_constant(X_data)
model = sm.GLM(y_counts, X, family=sm.families.Poisson())
results = model.fit()
print(results.summary())
# Rate ratios (for Poisson with log link)
rate_ratios = np.exp(results.params)
print("Rate ratios:\\n", rate_ratios)
# Check overdispersion
overdispersion = results.pearson_chi2 / results.df_resid
print(f"Overdispersion: {overdispersion:.2f}")
if overdispersion > 1.5:
# Use Negative Binomial instead
from statsmodels.discrete.count_model import NegativeBinomial
nb_model = NegativeBinomial(y_counts, X)
nb_results = nb_model.fit()
print(nb_results.summary())
Comprehensive suite of linear models for continuous outcomes with various error structures.
Available models:
Key features:
When to use: Continuous outcome variable, want inference on coefficients, need diagnostics
Reference: See references/linear_models.md for detailed guidance on model selection, diagnostics, and best practices.
Flexible framework extending linear models to non-normal distributions.
Distribution families:
Link functions:
Key features:
When to use: Non-normal outcomes, need flexible variance and link specifications
Reference: See references/glm.md for family selection, link functions, interpretation, and diagnostics.
Models for categorical and count outcomes.
Binary models:
Multinomial models:
Count models:
Key features:
When to use: Binary, categorical, or count outcomes
Reference: See references/discrete_choice.md for model selection, interpretation, and evaluation.
Comprehensive time series modeling and forecasting capabilities.
Univariate models:
Multivariate models:
Advanced models:
Key features:
When to use: Time-ordered data, forecasting, understanding temporal dynamics
Reference: See references/time_series.md for model selection, diagnostics, and forecasting methods.
Extensive testing and diagnostic capabilities for model validation.
Residual diagnostics:
Influence and outliers:
Hypothesis testing:
Multiple comparisons:
Effect sizes and power:
Robust inference:
When to use: Validating assumptions, detecting problems, ensuring robust inference
Reference: See references/stats_diagnostics.md for comprehensive testing and diagnostic procedures.
Statsmodels supports R-style formulas for intuitive model specification:
import statsmodels.formula.api as smf
# OLS with formula
results = smf.ols('y ~ x1 + x2 + x1:x2', data=df).fit()
# Categorical variables (automatic dummy coding)
results = smf.ols('y ~ x1 + C(category)', data=df).fit()
# Interactions
results = smf.ols('y ~ x1 * x2', data=df).fit() # x1 + x2 + x1:x2
# Polynomial terms
results = smf.ols('y ~ x + I(x**2)', data=df).fit()
# Logit
results = smf.logit('y ~ x1 + x2 + C(group)', data=df).fit()
# Poisson
results = smf.poisson('count ~ x1 + x2', data=df).fit()
# ARIMA (not available via formula, use regular API)
# Compare models using AIC/BIC
models = {
'Model 1': model1_results,
'Model 2': model2_results,
'Model 3': model3_results
}
comparison = pd.DataFrame({
'AIC': {name: res.aic for name, res in models.items()},
'BIC': {name: res.bic for name, res in models.items()},
'Log-Likelihood': {name: res.llf for name, res in models.items()}
})
print(comparison.sort_values('AIC'))
# Lower AIC/BIC indicates better model
# For nested models (one is subset of the other)
from scipy import stats
lr_stat = 2 * (full_model.llf - reduced_model.llf)
df = full_model.df_model - reduced_model.df_model
p_value = 1 - stats.chi2.cdf(lr_stat, df)
print(f"LR statistic: {lr_stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Full model significantly better")