Use this skill when the user asks to "design an experiment", "set up an A/B test", "calculate sample size", "plan a test", "figure out how long to run a test", "prioritize my tests", "decide how to split users", "launch an experiment", "monitor test data", "call a test", "segment test results", "analyze experiment results", "iterate on a test", or needs help with the mechanics of running a rigorous experiment from test design through result analysis.
Design, run, and analyze rigorous experiments. Covers the full test lifecycle: defining test parameters, calculating statistical requirements, minimizing build complexity, launching cleanly, monitoring data, calling tests complete, evaluating outcomes across segments, and iterating based on learnings.
Before using this skill, the PM should have:
Before starting, ask: Output format: What format for the final document? (docx, md, or pdf). Default: docx.
For each test, establish the statistical foundation:
Baseline metric: The current performance of your primary metric for the control group. This is your starting point. Get this from historical data — the more data, the more accurate.
Example: If your primary metric is signup rate and historically 10% of homepage visitors sign up, your baseline metric is 10%.
Minimum Detectable Effect (MDE): The smallest relative improvement you'd consider worth implementing. This is a business decision, not a statistical one.
Considerations:
Example: With a 10% baseline signup rate and a 10% MDE, you're testing whether the solution can move the signup rate from 10% to at least 11%.
P-value threshold: The probability of a false positive (calling a test positive when there's actually no difference). Standard is 0.05 (5%). Rarely need to change this.
Statistical power: The probability of correctly detecting a real difference. Standard is 0.80 (80%), meaning a 20% chance of missing a real effect. Rarely need to change this.
Sample size calculation for percentage metrics (signup rate, conversion rate, retention rate):
Sample size per variation = (Z_α/2 + Z_β)² × [p₁(1-p₁) + p₂(1-p₂)] / (p₂ - p₁)²
Where:
- Z_α/2 = 1.96 (for p-value of 0.05)
- Z_β = 0.84 (for 80% power)
- p₁ = baseline metric (e.g., 0.10)
- p₂ = baseline × (1 + MDE) (e.g., 0.11)
Sample size calculation for continuous metrics (average order value, time on page):
Sample size per variation = (Z_α/2 + Z_β)² × 2σ² / (MDE × baseline)²
Where:
- σ = standard deviation of the metric (from historical data)
Help the PM calculate this and understand the implications: total samples needed, daily traffic available, estimated test duration.
Output: Baseline metric, MDE, p-value threshold, power, required sample size per variation, estimated test duration.
When multiple tests are ready, prioritize using Return on Time Invested (ROTI):
ROTI = (Estimated opportunity size) / (Time to build + Time to get results)
Time to get results = Sample size per variation × number of variations / daily eligible traffic
Estimated opportunity size = (Current metric value) × (Expected improvement %) × (Revenue or user impact per unit of metric) × (Time horizon)
Time to build = Engineering + design + QA time in days
Compare ROTI across tests. Higher ROTI = do it first. But also factor in:
Sequential vs. simultaneous testing: Running tests simultaneously against the same control requires fewer total samples (you reuse the control group). But it introduces interaction risk — if the tests affect similar parts of the experience, they can interfere with each other. Default to simultaneous when tests are independent, sequential when they overlap.
The purpose of a test is to learn fast. Over-building test solutions is one of the most common mistakes.
Four questions to reduce scope:
What platforms do I need to test on? If 80% of traffic is mobile web, maybe skip the desktop build for the test. Test where the data is.
What integrations are needed? Can you mock or manually handle integrations during the test period instead of building full integrations?
What resources are needed from other teams? Can you reduce dependencies by using simpler implementations? (e.g., static content instead of dynamic, manual process instead of automated)
How can we ship the test? Can you use a feature flag, an experimentation platform, or a simple code change? Don't build production-grade infrastructure for a test.
The test solution should be good enough to accurately represent the user experience being tested, but nothing more. Polish can come after the test validates the hypothesis.
How you split users between control and solution groups is critical to data quality.
Core principle: Split users as close to the test experience as possible. Only include users who will actually encounter the change. Users who are eligible for the test but never see it are "dead weight" — they dilute your sample and extend your timeline.
Example: If you're testing a new checkout page, don't include all site visitors in your test — only users who reach checkout. This dramatically reduces the sample size needed and speeds up results.
Randomization: Users must be randomly assigned. Any systematic pattern (first 50% get control, next 50% get solution) can introduce bias. Use proper randomization tools.
Common assignment challenges:
B2B products with tiers: Should you split by individual user, team, or account?
Social/viral products: When users in the solution group can influence control group users (e.g., sharing features, messaging improvements), consider:
Marketplaces with fixed supply: When one side of the marketplace seeing the solution can affect the other side's results, you may need geographic or time-based splits rather than individual randomization.
Client-side vs. server-side testing:
Three categories of launch errors to prevent:
1. Data interference — external factors that bias your results
Four types to check:
Mitigation: Map interference periods on a calendar. Launch during clean windows. If a test spans a cycle, run it for full cycle intervals (whole weeks, whole months).
2. Launch authority — who can greenlight a test launch
Default toward "beg forgiveness" over "ask permission." Centralized launch authority from senior leaders slows everything down and kills the iteration speed that makes experimentation valuable.
Better approach: peer review before launch (fellow PMs or data analysts review the test plan), decentralized launch authority (the PM or experiment owner launches), and no reverting to "ask permission" culture after a mistake — instead, fix the root cause.
3. Infrastructure errors — bugs in the test implementation
Use ramping to catch these: launch to a small % of users first (e.g., 10%), verify data is tracking correctly, then ramp to full traffic over 1-2 days.
When to ramp:
When to skip ramping:
Ramping mechanics: Create three groups — solution variation, control variation, and a "not in test" group. The "not in test" group shrinks as you ramp. Don't change the split ratio between control and solution (always 50/50 between them).
A/A tests: Use only as a diagnostic tool when you suspect infrastructure problems — not as a routine practice. Running an A/A before every test is wasteful. If you need to validate infrastructure while also running a real test, use an A/A/B design (two control groups + one solution group).
The primary rule: Run the test until you reach your calculated sample size, then evaluate. Do not call a test early based on gut feeling.
Why peeking is dangerous: Every time you check the p-value before reaching full sample size, the measured p-value understates the actual probability of a Type 1 error. Continuously monitoring and stopping as soon as p < 0.05 gives you a 70-80% chance of a false positive — not 5%.
What to monitor during the test (without calling it):
When you can statistically justify calling early — Sequential Sampling:
After every sample, calculate the absolute difference in win rate between solution and control. If the difference exceeds the threshold:
Threshold = ± 2 × √(N) / N
Where N = calculated sample size per variation
If the observed difference exceeds this threshold at any point, the test can be called early with statistical validity. This must be established as the calling method before the test starts. You cannot retroactively decide to use sequential sampling.
Three reasons to extend a test beyond the calculated sample size:
Calling the test: After reaching the calculated sample size, evaluate:
Turning outcomes into action — use the outcome × tradeoff matrix:
| Primary metric result | Tradeoff positive/neutral | Tradeoff negative |
|---|---|---|
| Positive impact | Implement | Implement only if primary gain outweighs tradeoff loss. Consider iterating to preserve gain while mitigating tradeoff. |
| Null impact | Check for non-outcome benefits (performance, UX quality, behavior changes). Implement only if benefit justifies cost. | Don't implement. |
| Negative impact | Don't implement. Learn why. | Don't implement. Learn why. |
For null results with a positive secondary metric: investigate before claiming a win. Can you link the improvement to the test? Is there a better way to move that secondary metric directly?
Aggregate results hide important patterns. Segment your data to find them.
Pre-test vs. post-test segmentation:
Key segments to check: traffic source, geography, visit frequency, actions taken, engagement level, demographics, platform, browser.
What to do with conflicting segments:
Watch for Simpson's Paradox: When every subgroup shows one trend, but the aggregate shows the opposite. This happens when subgroups have very different sizes across variations. Always segment — if you only look at aggregates, you can make exactly the wrong decision.
Don't stop at "it worked" or "it didn't." Run through this learning checklist for every test:
Test outcomes (did the solution work as expected?):
Customer problem (was the problem correctly defined?): 5. How should the "what" of the problem be updated? 6. How should the "who" be updated? 7. How should the "where" be updated? 8. How should the "why" be updated?
Test parameters (was the test well-designed?): 9. Did sample size and time to results match expectations? If not, why? 10. Did time to build match expectations? If not, why? 11. Does the estimated opportunity size match what we observed? If not, why? 12. Whose predictions were most/least accurate? How were their assumptions different?
Three areas to iterate on — evaluate bottom-up (problem first, not test design first):
1. Customer problem (check first): Do the learnings suggest the problem was mis-defined? Is the "who," "what," "where," or "why" wrong or incomplete? If yes → redefine the problem, generate new solutions, redesign the test.
2. Solution (check second): Is the problem still valid, but the solution approach is wrong? If yes → generate alternative solutions for the same problem, redesign the test.
3. Test design (check last): Is the solution approach sound, but the test implementation had issues? Design flaws, infrastructure bugs, UX problems in the test version? If yes → fix the implementation and retest.
Most teams default to top-down iteration (fix the test design first) because it's fastest. This is a trap — it leads to repeatedly tweaking implementations while the underlying problem definition remains wrong.
Iterating on wins (don't skip this):
Produce the document in the user's chosen format (default: docx). If docx, use the docx skill. If pdf, use the pdf skill. If md, write as markdown. Use the following structure:
# Experiment: [Name]
Date: [date]
Author: [PM name]
Strategic opportunity: [link to experiment-strategy doc]
Status: [Design / Live / Complete / Iterating]
## Hypothesis
If we [change], then [who] will [behavior], resulting in [metric improvement] because [rationale].
## Test Parameters
- Baseline metric: [value] ([metric name])
- MDE: [%]
- P-value threshold: [0.05]
- Statistical power: [0.80]
- Sample size per variation: [calculated]
- Number of variations: [2 or more]
- Estimated daily eligible traffic: [number]
- Estimated test duration: [days]
- ROTI: [calculated]
## Metrics
- **Primary**: [metric — this determines win/loss/null]
- **Secondary**: [metrics — additional positive signals]
- **Tradeoff**: [metrics — watch for negative impact]
- **Leading indicators**: [metrics — early signals]
## User Assignment
- Assignment level: [individual / team / account / geographic]
- Split: [50/50 / other]
- Eligibility criteria: [which users are included/excluded]
- Client-side or server-side: [choice + rationale]
## Launch Plan
- Ramp schedule: [immediate 100% / 10% → 50% → 100% over X days]
- Interference check: [seasonal, promotional, competitive factors]
- Anti-metrics to monitor during ramp: [metrics + thresholds for pausing]
- Calling method: [primary rule / sequential sampling]
## Build Plan
- Engineering effort: [days]
- Design effort: [days]
- Dependencies: [teams, tools, approvals]
- Scope cuts for test: [what's excluded from the test version]
## Pre-Mortem
| Risk | Preparation | Mitigation if triggered |
|------|------------|----------------------|
| [risk 1] | [prep] | [mitigation] |
| [risk 2] | [prep] | [mitigation] |
## Results (fill after test)
- Total samples: [control / solution]
- Duration: [days]
- Primary metric: [control value] → [solution value] (p = [value])
- Secondary metrics: [results]
- Tradeoff metrics: [results]
- Outcome: [Positive / Negative / Null]
## Segment Analysis
| Segment | Control | Solution | Difference | Significant? |
|---------|---------|----------|------------|-------------|
| [segment] | [value] | [value] | [%] | [yes/no] |
## Learning Checklist
[Answers to the 12 questions from Step 9]
## Decision
- **Action**: [Implement / Don't implement / Iterate]
- **Rationale**: [why]
- **Iteration plan**: [if iterating — what changes and why]
- **Next experiment**: [what follows from this learning]
No baseline metric: Testing without knowing the current performance. You can't calculate sample size, MDE, or interpret results without a baseline.
MDE too small: Setting a 1% MDE "to detect any improvement" requires enormous sample sizes. Be realistic about what improvement justifies implementation.
Peeking and stopping early: The #1 statistical sin. If you continuously monitor p-value and stop at 0.05, your actual false positive rate is 70-80%. Use the primary rule or establish sequential sampling upfront.
Including irrelevant users: Every user in the test who never sees the change is noise that extends your timeline. Split as close to the experience as possible.
No tradeoff metrics: A positive primary result that destroys a tradeoff metric is worse than a null result. Always define what could go wrong.
Top-down iteration: After a loss, defaulting to "let's tweak the design" instead of questioning whether the customer problem was wrong. Always check problem → solution → test design, in that order.
Ignoring segment differences: Aggregate results that hide opposite effects in different user groups (Simpson's Paradox). Always segment.
Not learning from wins: Declaring victory and moving on without extracting learnings, refining the solution, or identifying the next bottleneck.
Not documenting: If the test isn't documented, the learning doesn't persist. Future teams will repeat the same experiments.
The value of an experiment is not in its outcome — it's in the learning. A well-designed experiment that produces a null result teaches you something real about your users. A poorly designed experiment that produces a "win" teaches you nothing you can trust. Invest in the rigor of the design, and the learning will follow.