Statistical methods for analyzing A/B test (controlled experiment) results. Covers hypothesis testing for different metric types, multiple comparison corrections, and effect size calculations. Use when analyzing experiment data with control and treatment groups.
This skill provides guidance on correctly analyzing A/B test results using appropriate statistical methods.
A/B testing compares a control group against a treatment group to determine if a change has a statistically significant effect. The key challenge is choosing the right statistical test based on the metric type.
For metrics that are 0/1 outcomes (did user convert?), use a two-proportion z-test:
from statsmodels.stats.proportion import proportions_ztest
# counts: number of successes in each group
# nobs: total observations in each group
counts = [treatment_conversions, control_conversions]
nobs = [treatment_total, control_total]
# Two-sided test
stat, p_value = proportions_ztest(counts, nobs, alternative='two-sided')
Why not chi-squared? The two-proportion z-test is mathematically equivalent for 2x2 tables but directly gives you the z-statistic which can be useful for confidence intervals.
For continuous measurements, use Welch's t-test (does not assume equal variances):
from scipy import stats
# Two-sided Welch's t-test
stat, p_value = stats.ttest_ind(
treatment_values,
control_values,
equal_var=False # Welch's t-test
)
Important: Use equal_var=False to get Welch's t-test, which is more robust than Student's t-test when sample sizes or variances differ between groups.
| Metric Type | Examples | Test |
|---|---|---|
| Binary (0/1) | Conversion, Click, Purchase | Two-proportion z-test |
| Continuous | Revenue, Time, Page views | Welch's t-test |
| Count data | Number of items | Welch's t-test (if mean > 5) |
When testing multiple hypotheses, the probability of at least one false positive increases. Apply corrections:
The simplest and most conservative approach:
# If testing k hypotheses at significance level alpha:
adjusted_alpha = alpha / k
# A result is significant only if p_value < adjusted_alpha
significant = p_value < (0.05 / num_tests)
Example: Testing 3 metrics with α = 0.05:
Apply Bonferroni when:
Do NOT apply across independent experiments if you accept some false positives.
The most common way to express A/B test results:
# Relative lift = (treatment - control) / control
relative_lift = (treatment_mean - control_mean) / control_mean
Interpretation: A lift of 0.15 means the treatment is 15% better than control.
import pandas as pd
# For a dataframe with 'variant' and 'converted' columns
control_data = df[df['variant'] == 'control']
treatment_data = df[df['variant'] == 'treatment']
control_rate = control_data['converted'].mean()
treatment_rate = treatment_data['converted'].mean()
import pandas as pd
df = pd.read_csv('experiment.csv')
control = df[df['variant'] == 'control']
treatment = df[df['variant'] == 'treatment']
from statsmodels.stats.proportion import proportions_ztest
# Calculate rates
control_rate = control['converted'].mean()
treatment_rate = treatment['converted'].mean()
# Run test
counts = [treatment['converted'].sum(), control['converted'].sum()]
nobs = [len(treatment), len(control)]
_, p_value = proportions_ztest(counts, nobs, alternative='two-sided')
# Calculate lift
lift = (treatment_rate - control_rate) / control_rate
from scipy import stats
# Calculate means
control_mean = control['revenue'].mean()
treatment_mean = treatment['revenue'].mean()
# Run Welch's t-test
_, p_value = stats.ttest_ind(
treatment['revenue'],
control['revenue'],
equal_var=False
)
# Calculate lift
lift = (treatment_mean - control_mean) / control_mean
num_tests = 3 # e.g., conversion, revenue, duration
adjusted_alpha = 0.05 / num_tests # 0.0167
# Determine significance
is_significant = p_value < adjusted_alpha
Power analysis helps determine how many additional samples are needed to detect an effect. Use this when a result is not statistically significant but you want to know if more data could help.
from statsmodels.stats.power import TTestIndPower
import numpy as np
def additional_samples_needed(control_data, treatment_data, alpha, power=0.8):
"""Calculate additional samples needed for significance."""
control_mean = control_data.mean()
treatment_mean = treatment_data.mean()
pooled_std = np.sqrt((control_data.var() + treatment_data.var()) / 2)
if pooled_std == 0 or control_mean == treatment_mean:
return 0
# Cohen's d effect size
effect_size = abs(treatment_mean - control_mean) / pooled_std
power_analysis = TTestIndPower()
required_n = power_analysis.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided'
)
current_n = (len(control_data) + len(treatment_data)) / 2
return max(0, int(np.ceil(required_n - current_n)))
from statsmodels.stats.power import zt_ind_solve_power
import numpy as np
def additional_samples_proportion(control_prop, treatment_prop, n_control, n_treatment, alpha, power=0.8):
"""Calculate additional samples needed for proportion test."""
if control_prop == treatment_prop:
return 0
# Cohen's h effect size for proportions
effect_size = 2 * (np.arcsin(np.sqrt(treatment_prop)) - np.arcsin(np.sqrt(control_prop)))
required_n = zt_ind_solve_power(
effect_size=abs(effect_size),
alpha=alpha,
power=power,
alternative='two-sided'
)
current_n = (n_control + n_treatment) / 2
return max(0, int(np.ceil(required_n - current_n)))
(mean1 - mean2) / pooled_std2 * (arcsin(√p1) - arcsin(√p2))pip install scipy statsmodels pandas numpy
Key imports:
scipy.stats.ttest_ind - Welch's t-teststatsmodels.stats.proportion.proportions_ztest - Two-proportion z-test