Turn product hunches into rigorous experiments — write testable hypotheses, define success metrics, estimate sample sizes, and establish stopping rules to reduce bias and accelerate learning. Trigger on: product experiment, A/B test, hypothesis, experiment design, test this idea, how do we validate, sample size, success metric.
Product teams constantly face decisions shaped by anecdotes, debates, and intuition. This skill transforms those moments—a customer call where three users mentioned friction, a sprint review where the team split on a feature direction, a support ticket spike on a specific flow—into structured experiments. By grounding hypotheses in observation, defining what "success" looks like before launch, and calculating statistical confidence upfront, teams reduce the risk of shipping changes that look good in meetings but flop in production. The skill mines conversation data from customer calls, sprint discussions, and team reviews to surface experiment candidates, then guides you through the rigorous design process so you ship faster with confidence.
Identify the trigger from recent conversations
Write a testable hypothesis in If/Then/Because format
Define three levels of metrics
Estimate sample size and duration
Score and prioritize using ICE
Set a decision rule upfront
Create a launch checklist
Document statistical guardrails
Get alignment before launch
Launch, monitor, and interpret
# Experiment: [Experiment Name]
## Hypothesis
- **If**: [Specific change]
- **Then**: [Expected outcome]
- **Because**: [User behavior or mechanism]
## Metrics
### Primary Metric
- **Name**: [e.g., Time-to-Export (seconds)]
- **Baseline**: [Current value; e.g., 47 seconds]
- **Target Lift**: [e.g., 25% → 35 seconds]
- **How to Measure**: [e.g., time elapsed from button click to download complete]
### Guardrail Metrics
1. **[Metric Name]**: Baseline [X], must not drop below [acceptable threshold]
2. **[Metric Name]**: Baseline [X], must not drop below [acceptable threshold]
### Secondary Metrics
1. **[Metric Name]**: [What this tells us]
2. **[Metric Name]**: [What this tells us]
## Sample Size & Duration
- **Daily Active Users in Experiment Group**: [e.g., 500 at 10% rollout]
- **Baseline Primary Metric Rate**: [e.g., 68% completion]
- **Target Effect Size**: [e.g., 25% improvement → 85% completion]
- **Statistical Power**: 80% (probability of detecting true effect if it exists)
- **Significance Level (α)**: 0.05 (5% false-positive risk)
- **Estimated Runtime**: [e.g., 14 days at 10% rollout, 7 days at 25% rollout]
- **Key Assumptions**: [e.g., traffic stable, no seasonal events during experiment window]
## Prioritization (ICE Score)
- **Impact**: [1–10] — [Reasoning]
- **Confidence**: [1–10] — [Reasoning]
- **Ease**: [1–10] — [Reasoning]
- **ICE Score**: [Impact] × [Confidence] ÷ [Ease] = **[Score]**
## Decision Rules
- **Primary Success**: If [metric] lifts [X]% with p < 0.05, then [action: roll out 100%, keep experiment, iterate]
- **Primary Failure**: If [metric] drops [X]% or stays flat after [duration], then [action: revert or iterate]
- **Guardrail Trigger**: If [guardrail metric] declines [X]% at any point, [action: stop immediately and revert]
## Launch Checklist
- [ ] Logging in place for all metrics; validated in staging
- [ ] Dashboard created and shared with team (real-time updates)
- [ ] Customer success & support briefed on experiment intent
- [ ] Rollout plan defined (start at 5–10%, ramp if healthy)
- [ ] Decision-maker identified; communication cadence set
- [ ] Monitoring alert configured for guardrail metrics
## Context & Motivation
- **Trigger**: [Where did this idea come from? Customer calls, support spike, sprint debate, past data?]
- **Expected Impact**: [Why this matters to the business or users]
- **Risk**: [What could go wrong? Unintended side effects?]
## Notes
[Any additional context, alternative hypotheses considered, or open questions]
Context: Headwind (a task management tool for remote teams) noticed that export feature adoption was low. In customer calls this month, two enterprise clients and one mid-market prospect mentioned the export button was hard to find. The product team debated whether moving it would help or whether the real problem was missing use cases.