Name: A/B Test Setup
Author: sharkitect-solutions

A/B Test Setup

When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking. For CRO strategy and test ideas, see page-cro. For statistical methods beyond experimentation, see statistical-analysis.

sharkitect-solutions0 星標2026年3月13日

職業
分類: 實驗室工具

File Index

File	What It Contains	Load When
`SKILL.md`	Test design decisions, hypothesis framework, analysis methodology, anti-patterns	Always loaded (you are here)
`statistical-pitfalls.md`	SRM detection, multiple comparisons, Bayesian vs frequentist, sequential testing, when calculators lie	User asks about statistics, significance, sample size issues, or test results seem wrong
`platform-implementation.md`	Tool comparison (PostHog/Optimizely/VWO/LaunchDarkly/Statsig/Eppo), client-side vs server-side gotchas, CDN interference, feature flag migration	User asks about implementation, tool selection, or debugging test setup
`test-velocity-management.md`	Prioritization frameworks (ICE/PIE gotchas), test backlog management, velocity benchmarks, when NOT to test, rollout procedures	User asks about test prioritization, program management, or scaling experimentation

Do NOT load companion files unless the user's question specifically requires that depth. Most test design questions are answerable from this file alone.

File Index

File	What It Contains	Load When
`SKILL.md`	Test design decisions, hypothesis framework, analysis methodology, anti-patterns	Always loaded (you are here)
`statistical-pitfalls.md`	SRM detection, multiple comparisons, Bayesian vs frequentist, sequential testing, when calculators lie	User asks about statistics, significance, sample size issues, or test results seem wrong
`platform-implementation.md`	Tool comparison (PostHog/Optimizely/VWO/LaunchDarkly/Statsig/Eppo), client-side vs server-side gotchas, CDN interference, feature flag migration	User asks about implementation, tool selection, or debugging test setup
`test-velocity-management.md`	Prioritization frameworks (ICE/PIE gotchas), test backlog management, velocity benchmarks, when NOT to test, rollout procedures	User asks about test prioritization, program management, or scaling experimentation

Do NOT load companion files unless the user's question specifically requires that depth. Most test design questions are answerable from this file alone.

Topic	This Skill	Not This Skill (Use Instead)
Designing experiments	YES
Hypothesis formulation	YES
Sample size decisions	YES
Test analysis methodology	YES
Statistical significance	YES	statistical-analysis (for general stats)
Bayesian A/B testing	YES	statistical-analysis (for Bayesian theory)
CRO test ideas		page-cro, signup-flow-cro, popup-cro
Event tracking setup		analytics-tracking
Multivariate copy testing		copywriting
Feature flag management (non-experiment)		Use engineering judgment
Pricing experiments	Hypothesis + measurement	pricing-strategy (for pricing decisions)

Signal	Test Type	Why
Major page redesign or new flow	Split URL test	Different URLs avoid client-side modification complexity. Cleaner implementation for structural changes
Multiple simultaneous element changes + >50K weekly visitors	Multivariate (MVT)	Tests interaction effects between changes. Requires 4-10x traffic of simple A/B. Most teams overestimate their traffic for MVT
Testing a new feature rollout	Feature flag with measurement	Not every rollout needs full A/B infrastructure. Binary on/off with pre/post comparison is sufficient when effect size is large (>30%)
Comparing 3+ creative approaches	A/B/n	Multiple variants, single change dimension. Requires Bonferroni or Holm correction for multiple comparisons (most tools don't do this automatically)
Single change hypothesis	A/B (50/50 split)	Default. Simple, fast to reach significance, easiest to analyze correctly

Level	Example	Problem
1 (Weak)	"Let's test a new button color"	No reasoning, no prediction, no measurement criteria
2 (Directional)	"A green button will increase clicks"	Has prediction but no reasoning or quantification
3 (Reasoned)	"Because heatmaps show users miss the CTA, a higher-contrast button will increase clicks"	Has reasoning + prediction but no quantification
4 (Quantified)	"Because heatmaps show 60% of users never scroll to the CTA, moving it above the fold will increase click-through rate by 15%+ for new visitors"	Has observation + specific change + quantified prediction + audience + metric
5 (Falsifiable)	Level 4 + "We'll measure over 14 days with 5K visitors per variant. If CTR increase is <10%, we'll reject the hypothesis"	Adds pre-committed sample size, duration, and rejection criteria

Assumption	Reality	Impact
Fixed sample size, single analysis	Most teams peek at results	Inflates false positive rate from 5% to 20-30%. Use sequential testing or commit to NO peeking
Single primary metric	Teams track 5-10 metrics	Multiple comparisons: 5 metrics at 95% confidence = 23% chance of at least one false positive. Apply Bonferroni correction or pre-commit to ONE primary metric
Equal variance across segments	Mobile/desktop, new/returning have different baselines	Segment-level effects can be real but opposite in direction (Simpson's paradox). Pre-stratify or analyze segments independently
Stable baseline during test	Seasonality, promotions, PR events shift baselines	Run tests for full business cycles (minimum 1 week, ideally 2). Don't start tests on Black Friday
No interference between variants	Users may switch devices, share URLs, use multiple accounts	Cookie-based assignment leaks. Use user-ID assignment when possible. Accept ~5% contamination as unavoidable

Split Deviation	With 10K Users	Assessment
<0.5%	50.2/49.8	Normal random variation
0.5-1.5%	50.5/49.5 to 51.5/48.5	Check SRM (chi-squared test, p<0.01 = problem)
>1.5%	51.5/48.5+	Almost certainly SRM. DO NOT trust results. Debug assignment logic

A/B Test Setup

File Index

A/B Test Setup

File Index

Scope Boundary

Test Type Decision

Hypothesis Quality Ladder

Hypothesis Template

Sample Size Reality Check

Sample Ratio Mismatch (SRM)

Analysis Decision Framework

The Peeking Problem (Quantified)

When NOT to Test

Anti-Patterns

The Peek-and-Ship

The Frankentest

The Metric Fisher

The Perpetual Tester

The Segment Slicer

The Underpowered Optimist

Rationalizations That Signal Bad Testing

Red Flags

NEVER

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio

Scenario	Action	Common Mistake
Reached sample size, p<0.05, meaningful effect	Ship the winner	Waiting for "more data" after reaching pre-committed sample size. You committed to a methodology -- honor it
Reached sample size, p<0.05, tiny effect (1-2%)	Consider implementation cost	A 1.5% conversion lift on 10K monthly visitors = 150 more conversions. Worth it if implementation is a CSS change. Not worth it if it's a 2-week refactor
Reached sample size, p>0.05	Keep control, document learning	Calling it "inconclusive" and re-running. If you reached your pre-committed sample size and didn't detect an effect, the effect is likely smaller than your MDE. That IS a result
Haven't reached sample size, variant looks like it's winning	Keep running	Peeking + stopping = false positive inflation. The apparent winner at 40% of sample size reverts ~30% of the time
Haven't reached sample size, variant looks harmful	Check guardrail metrics	If guardrails are significantly negative (p<0.01), stop the test. This is the ONE exception to "don't stop early"
Conflicting segment results	Report the overall result, note segments	Cherry-picking the segment where the variant won. Post-hoc segments are hypotheses for the NEXT test, not conclusions

When You Peek and Stop	Actual False Positive Rate (vs stated 5%)
After 25% of sample	~26%
After 50% of sample	~16%
After 75% of sample	~10%
At pre-committed sample size only	~5% (as intended)

Situation	Why Skip A/B Testing	Do Instead
<1K weekly visitors to test page	Will take months to reach significance for any reasonable MDE	Make the change, compare pre/post with caution. Or test on a higher-traffic page first
Fixing an obvious bug	You don't A/B test whether to fix broken things	Fix it. Monitor for regression
Legal/compliance requirement	No choice in implementation	Implement. Measure impact but don't delay for a test
Already 95%+ confident in direction	The "test everything" dogma wastes resources	Ship it. Save testing capacity for genuine uncertainty
Effect is binary (works/doesn't)	No spectrum of outcomes to measure	Feature flag with monitoring, not A/B test