Performs stratified and subgroup analyses for clinical trial data. Covers Mantel-Haenszel pooling, Breslow-Day homogeneity testing, interaction terms in regression, multiple comparisons correction, and forest plot visualization. Use when analyzing treatment effects across patient subgroups or controlling for stratification variables.
Reference examples tested with: statsmodels 0.14+, scipy 1.12+, numpy 1.26+, pandas 2.1+, matplotlib 3.8+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Analyze treatment effects across subgroups" -> Test whether treatment effects differ across patient subgroups using stratified analysis, interaction terms, and multiplicity-adjusted comparisons.
statsmodels.stats.contingency_tables.StratifiedTable(), statsmodels.formula.api.logit()Goal: Estimate a pooled treatment effect across strata while allowing for different baseline rates.
Approach: Construct per-stratum 2x2 tables and compute the Mantel-Haenszel weighted odds ratio.
from statsmodels.stats.contingency_tables import StratifiedTable
import pandas as pd
import numpy as np
tables = []
for stratum in df['subgroup'].unique():
sub = df[df['subgroup'] == stratum]
t = pd.crosstab(sub['treatment'], sub['outcome']).values
if t.shape == (2, 2):
tables.append(t)
st = StratifiedTable(tables)
print(st.summary())
print(st.oddsratio_pooled) # MH pooled OR
print(st.oddsratio_pooled_confint()) # 95% CI
print(st.test_null_odds()) # H0: common OR = 1
print(st.test_equal_odds()) # Breslow-Day: H0: all stratum ORs equal
The test_equal_odds() method tests whether stratum-specific odds ratios are equal. A significant result (p < 0.05) suggests effect modification across strata.
Breslow-Day has low power with few strata or small stratum sizes. A non-significant result does not prove homogeneity -- it may reflect insufficient power to detect heterogeneity. With many strata, visual assessment via forest plot should supplement the formal test before drawing conclusions.
Goal: Test whether the treatment effect varies by a subgroup variable.
Approach: Fit a single logistic model with an interaction term, where the interaction coefficient tests effect modification directly.
import statsmodels.formula.api as smf
# Single model with interaction -- the correct approach
model = smf.logit(
'outcome ~ C(treatment, Treatment(reference="Placebo")) * C(age_group)', data=df
).fit()
# The interaction coefficient tests whether treatment effect differs by age group
# Extract subgroup-specific ORs for reporting
for group in df['age_group'].unique():
sub_model = smf.logit(
'outcome ~ C(treatment, Treatment(reference="Placebo"))',
data=df[df['age_group'] == group]
).fit()
or_val = np.exp(sub_model.params.iloc[1])
ci = np.exp(sub_model.conf_int().iloc[1])
print(f'{group}: OR={or_val:.3f} ({ci[0]:.3f}-{ci[1]:.3f})')
The proper way to test for subgroup effects is via interaction terms in a single model, NOT by comparing p-values from separate per-subgroup models. Separate models have different power and comparing their p-values is statistically invalid.
Not every baseline variable warrants an interaction test. Variables should be tested for effect modification when there is prior scientific reason to expect the treatment effect to differ across levels. Common biologically motivated effect modifiers include: disease severity (treatment may work only in severe disease), genetic variants affecting drug metabolism (pharmacogenomic subgroups), biomarkers of the targeted pathway, and age when pharmacokinetics differ substantially. Testing every available demographic variable without scientific rationale inflates false positives and produces uninterpretable results, even with multiplicity correction.
Logistic regression tests multiplicative interaction (ratio of ORs). Null multiplicative interaction does not imply null additive interaction. For public health decisions, additive interaction is often more relevant.
RERI (Relative Excess Risk due to Interaction) measures additive interaction on the multiplicative scale:
# From a model: outcome ~ treatment + subgroup_indicator + treatment:subgroup_indicator
# OR_11 = OR for treated subjects in the subgroup
# OR_10 = OR for treated subjects not in the subgroup
# OR_01 = OR for untreated subjects in the subgroup
reri = or_11 - or_10 - or_01 + 1
RERI = 0 indicates no additive interaction. Positive RERI indicates synergism (combined effect exceeds sum of individual effects). RERI CIs require the delta method or bootstrap, since RERI is a nonlinear function of the ORs.
Quantitative interaction means the treatment effect varies in magnitude across subgroups but remains in the same direction. Qualitative (crossover) interaction means the effect reverses direction -- beneficial in one subgroup, harmful in another. This distinction is critical for regulatory decisions: qualitative interaction may warrant restricting the indication to the benefiting subgroup. The Gail-Simon test formally tests for qualitative interaction.
Detecting a treatment-by-subgroup interaction requires approximately 4 times the sample size needed to detect the main treatment effect. A trial powered to detect OR = 0.6 overall cannot reliably detect subgroup differences of similar magnitude. Non-significant interaction tests should be interpreted cautiously -- absence of evidence is not evidence of absence, especially in underpowered subgroup analyses.
Goal: Control error rates when testing treatment effects across multiple subgroups.
Approach: Apply FWER or FDR correction to the set of subgroup-specific p-values.
from statsmodels.stats.multitest import multipletests
subgroup_pvalues = [0.03, 0.15, 0.04, 0.22, 0.01]
# FWER control (appropriate for regulatory/confirmatory)
reject_fwer, adjusted_fwer, _, _ = multipletests(subgroup_pvalues, method='holm')
# FDR control (appropriate for exploratory)
reject_fdr, adjusted_fdr, _, _ = multipletests(subgroup_pvalues, method='fdr_bh')
| Method | Controls | Use case |
|---|---|---|
| Holm (step-down Bonferroni) | FWER | Confirmatory/regulatory subgroup tests |
| Hochberg (step-up) | FWER | Less conservative than Holm; valid only under independence or positive regression dependency (PRDS) |
| Benjamini-Hochberg | FDR | Exploratory subgroup screening |
| Aspect | Pre-specified | Post-hoc |
|---|---|---|
| Timing | Before unblinding, in SAP | After seeing data |
| Credibility | High (if biologically justified) | Low (hypothesis-generating only) |
| Regulatory weight | Can support labeling claims | Cannot support claims alone |
| Multiplicity adjustment | Required per SAP | Required + heavy skepticism |
EMA credibility criteria for subgroup claims: (1) pre-specified and biologically plausible, (2) significant interaction test, (3) consistent across related endpoints, (4) ideally replicated in an independent study.
A statistically significant interaction does not establish a real subgroup effect. Before concluding that treatment works differently in a subgroup, ask: (1) Is there a known biological mechanism? A drug targeting estrogen receptors showing differential efficacy by sex has mechanistic support. An age subgroup effect for an antibiotic does not. (2) Is the direction consistent with the mechanism? (3) Are related biomarkers concordant? Absent biological rationale, a "significant" subgroup finding among many tested subgroups is more likely a false positive than a real effect, regardless of the p-value.
Goal: Visualize point estimates and confidence intervals across subgroups on a common scale.
Approach: Plot subgroup-specific ORs with error bars on a log-scaled axis with a reference line at OR = 1.
import matplotlib.pyplot as plt
import numpy as np
def subgroup_forest_plot(labels, ors, lower_cis, upper_cis, overall_or=None, figsize=(8, 6)):
fig, ax = plt.subplots(figsize=figsize)
y_pos = range(len(labels))
ax.errorbar(ors, y_pos,
xerr=[np.array(ors) - np.array(lower_cis),
np.array(upper_cis) - np.array(ors)],
fmt='D', color='black', capsize=3, markersize=5)
ax.axvline(x=1.0, color='gray', linestyle='--', linewidth=0.8)
if overall_or is not None:
ax.axvline(x=overall_or, color='blue', linestyle=':', linewidth=0.8, alpha=0.5)
ax.set_yticks(y_pos)
ax.set_yticklabels(labels)
ax.set_xlabel('Odds Ratio (95% CI)')
ax.set_xscale('log')
plt.tight_layout()
return fig