Purpose

Product teams constantly face decisions shaped by anecdotes, debates, and intuition. This skill transforms those moments—a customer call where three users mentioned friction, a sprint review where the team split on a feature direction, a support ticket spike on a specific flow—into structured experiments. By grounding hypotheses in observation, defining what "success" looks like before launch, and calculating statistical confidence upfront, teams reduce the risk of shipping changes that look good in meetings but flop in production. The skill mines conversation data from customer calls, sprint discussions, and team reviews to surface experiment candidates, then guides you through the rigorous design process so you ship faster with confidence.

When to Use

After customer calls where 2+ users independently mention the same friction point or request
During sprint planning when the team debates whether a change will move the needle (and opinions diverge)
In product review meetings when you're unsure if a shipped feature is actually working or if early wins are just a honeymoon effect
Before major redesigns to establish a baseline and hypothesis so you can measure impact (not guess)

Purpose

When to Use

After customer calls where 2+ users independently mention the same friction point or request
During sprint planning when the team debates whether a change will move the needle (and opinions diverge)
In product review meetings when you're unsure if a shipped feature is actually working or if early wins are just a honeymoon effect
Before major redesigns to establish a baseline and hypothesis so you can measure impact (not guess)

# Experiment: [Experiment Name] ## Hypothesis - **If**: [Specific change] - **Then**: [Expected outcome] - **Because**: [User behavior or mechanism] ## Metrics ### Primary Metric - **Name**: [e.g., Time-to-Export (seconds)] - **Baseline**: [Current value; e.g., 47 seconds] - **Target Lift**: [e.g., 25% → 35 seconds] - **How to Measure**: [e.g., time elapsed from button click to download complete] ### Guardrail Metrics 1. **[Metric Name]**: Baseline [X], must not drop below [acceptable threshold] 2. **[Metric Name]**: Baseline [X], must not drop below [acceptable threshold] ### Secondary Metrics 1. **[Metric Name]**: [What this tells us] 2. **[Metric Name]**: [What this tells us] ## Sample Size & Duration - **Daily Active Users in Experiment Group**: [e.g., 500 at 10% rollout] - **Baseline Primary Metric Rate**: [e.g., 68% completion] - **Target Effect Size**: [e.g., 25% improvement → 85% completion] - **Statistical Power**: 80% (probability of detecting true effect if it exists) - **Significance Level (α)**: 0.05 (5% false-positive risk) - **Estimated Runtime**: [e.g., 14 days at 10% rollout, 7 days at 25% rollout] - **Key Assumptions**: [e.g., traffic stable, no seasonal events during experiment window] ## Prioritization (ICE Score) - **Impact**: [1–10] — [Reasoning] - **Confidence**: [1–10] — [Reasoning] - **Ease**: [1–10] — [Reasoning] - **ICE Score**: [Impact] × [Confidence] ÷ [Ease] = **[Score]** ## Decision Rules - **Primary Success**: If [metric] lifts [X]% with p < 0.05, then [action: roll out 100%, keep experiment, iterate] - **Primary Failure**: If [metric] drops [X]% or stays flat after [duration], then [action: revert or iterate] - **Guardrail Trigger**: If [guardrail metric] declines [X]% at any point, [action: stop immediately and revert] ## Launch Checklist - [ ] Logging in place for all metrics; validated in staging - [ ] Dashboard created and shared with team (real-time updates) - [ ] Customer success & support briefed on experiment intent - [ ] Rollout plan defined (start at 5–10%, ramp if healthy) - [ ] Decision-maker identified; communication cadence set - [ ] Monitoring alert configured for guardrail metrics ## Context & Motivation - **Trigger**: [Where did this idea come from? Customer calls, support spike, sprint debate, past data?] - **Expected Impact**: [Why this matters to the business or users] - **Risk**: [What could go wrong? Unintended side effects?] ## Notes [Any additional context, alternative hypotheses considered, or open questions]

Design Product Experiment

Purpose

When to Use

Design Product Experiment

Purpose

When to Use

Instructions

Output Format

Example

Hypothesis

Metrics

Primary Metric

Guardrail Metrics

Secondary Metrics

Sample Size & Duration

Prioritization (ICE Score)

Decision Rules

Launch Checklist

Context & Motivation

Notes

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio