A comprehensive guide for designing, analyzing, and decision-making in A/B testing (Online Controlled Experiments). Use this skill for sample size calculation, power analysis, statistical significance testing, Bayesian analysis, or when users need to interpret experiment results.
This skill provides a rigorous framework for Online Controlled Experiments (OCE), combining methodology based on Kohavi's principles with practical calculation tools.
from ab_test_calc import ABTestCalculator
calc = ABTestCalculator()
# Test significance
result = calc.test_significance(
control_visitors=10000,
control_conversions=500,
variant_visitors=10000,
variant_conversions=550
)
print(f"Significant: {result['significant']}")
print(f"P-value: {result['p_value']:.4f}")
print(f"Lift: {result['lift']:.2%}")
# Test significance
python ab_test_calc.py --test 10000 500 10000 550
# Calculate sample size
python ab_test_calc.py --sample-size --baseline 0.05 --mde 0.10 --power 0.8
# Power analysis
python ab_test_calc.py --power-analysis --baseline 0.05 --mde 0.10 --samples 5000
# Bayesian analysis
python ab_test_calc.py --bayesian 10000 500 10000 550
Before running an experiment, determine if the design has sufficient statistical power.
The MDE is a business constraint, not a statistical output. It answers: What is the smallest lift that justifies the cost of this feature?
result = calc.calculate_sample_size(
baseline_rate=0.05, # Current 5% conversion
minimum_detectable_effect=0.10, # 10% relative improvement
power=0.8, # 80% power
alpha=0.05 # 5% significance level
)
print(f"Need {result['sample_size_per_variant']:,} visitors per variant")
print(f"Total: {result['total_sample_size']:,} visitors")
duration = calc.estimate_duration(
daily_visitors=5000,
baseline_rate=0.03,
minimum_detectable_effect=0.15
)
print(f"Test will take ~{duration['days']} days")
CRITICAL: Before interpreting any metric movements, validate experiment integrity.
Check if actual user ratio matches design (e.g., 50/50).
"Any figure that looks interesting or different is usually wrong."
If a metric moves by a massive amount (e.g., +50% conversion in a mature product), assume it is a data error until proven otherwise.
See references/PITFALLS_AND_CHECKS.md for complete checklist.
Best for rigorous binary decisions (Ship/No Ship) with controlled error rates.
result = calc.test_significance(
control_visitors=10000,
control_conversions=500,
variant_visitors=10000,
variant_conversions=550,
test="chi_square" # or "z_test"
)
# Returns:
{
"significant": True,
"p_value": 0.0234,
"control_rate": 0.05,
"variant_rate": 0.055,
"lift": 0.10,
"confidence_interval": {"lower": 0.02, "upper": 0.18},
"recommendation": "Variant shows significant improvement"
}
Best when you need "What is the probability B is better than A?"
result = calc.bayesian_analysis(
control_visitors=10000,
control_conversions=500,
variant_visitors=10000,
variant_conversions=550
)
# Returns:
{
"prob_variant_better": 0.9523,
"prob_control_better": 0.0477,
"expected_lift": 0.098,
"credible_interval_95": [0.02, 0.18]
}
| Framework | Best Used When | Key Metric |
|---|---|---|
| Frequentist | Rigorous binary decision, high-risk changes | P-value, Confidence Interval |
| Bayesian | "Probability B beats A?", optional stopping | Probability to be Best, Expected Loss |
See references/STATISTICAL_FRAMEWORKS.md for detailed methodology.
Test multiple variants with correction for multiple comparisons:
result = calc.test_multiple_variants(
control=(10000, 500),
variants=[
(10000, 550), # Variant A
(10000, 520), # Variant B
(10000, 480) # Variant C
],
correction="bonferroni" # or "holm", "none"
)
print(f"Winner: {result['winner']}")
Decision-making should not rely solely on statistical significance.
Check "Non-Negotiable" metrics (e.g., Latency, Error Rates). If these degrade beyond threshold, kill the experiment regardless of wins elsewhere.
class ABTestCalculator:
def __init__(self, alpha: float = 0.05)
# Significance testing
def test_significance(self, control_visitors, control_conversions,
variant_visitors, variant_conversions,
test="chi_square") -> dict
# Sample size calculation
def calculate_sample_size(self, baseline_rate, minimum_detectable_effect,
power=0.8, alpha=0.05) -> dict
# Power analysis
def calculate_power(self, baseline_rate, minimum_detectable_effect,
sample_size, alpha=0.05) -> dict
# Confidence interval
def confidence_interval(self, visitors, conversions,
confidence=0.95) -> dict
# Bayesian analysis
def bayesian_analysis(self, control_visitors, control_conversions,
variant_visitors, variant_conversions,
simulations=100000) -> dict
# Multiple variants
def test_multiple_variants(self, control, variants,
correction="bonferroni") -> dict
# Duration estimation
def estimate_duration(self, daily_visitors, baseline_rate,
minimum_detectable_effect, power=0.8) -> dict
calc = ABTestCalculator()
# 1. Estimate required sample size
sample = calc.calculate_sample_size(
baseline_rate=0.03,
minimum_detectable_effect=0.15,
power=0.8
)
print(f"Need {sample['sample_size_per_variant']:,} visitors per variant")
# 2. Estimate test duration
duration = calc.estimate_duration(
daily_visitors=5000,
baseline_rate=0.03,
minimum_detectable_effect=0.15
)
print(f"Test will take ~{duration['days']} days")
calc = ABTestCalculator()
# 1. Test significance
result = calc.test_significance(15000, 450, 15000, 525)
# 2. Get Bayesian probability
bayes = calc.bayesian_analysis(15000, 450, 15000, 525)
print(f"P-value: {result['p_value']:.4f}")
print(f"Lift: {result['lift']:.2%}")
print(f"Probability variant wins: {bayes['prob_variant_better']:.1%}")
scipy>=1.10.0
numpy>=1.24.0
Install: pip install scipy numpy