When to use: Before launching any experiment, when metrics feel unreliable, or when experiment results are confusing
Framework source: Aakash Gupta's "How to Choose the Right Metrics to Evaluate Experiments"
The STEDII Framework
Choose experiment metrics that are:
Sensitive
Timely
Efficient
Debuggable
Interpretable
Isolated
1. Sensitive (Detects Small But Meaningful Changes)
What it means: The metric moves when your feature actually improves the experience
Bad example:
Metric: Monthly Active Users (MAU)
Related Skills
Problem: Too coarse. A good onboarding improvement might not move MAU for months.
Good example:
Metric: Day 7 activation rate
Why: Sensitive enough to detect onboarding improvements within a week
How to check:
Ask: "If this experiment succeeds, will this metric move within the experiment window?"
Common mistake: Using metrics that are too aggregated (MAU, total revenue) when you need something more granular (daily activation, conversion rate by cohort).
2. Timely (Results Available Quickly)
What it means: You get signal fast enough to make decisions
Bad example:
Metric: 90-day retention
Problem: Takes 90 days to know if your experiment worked
Good example:
Metric: Day 7 retention + leading indicators
Why: Faster feedback, correlates with long-term retention
Tradeoff alert: Sometimes you NEED slow metrics (LTV, annual retention). In those cases:
Use leading indicators to get fast signal
Run smaller experiments to validate
Accept longer experiment duration for critical decisions
How to check:
Ask: "Can I get actionable results within [1 week / 2 weeks / 1 month]?"
3. Efficient (High Statistical Power)
What it means: You can detect the effect with reasonable sample size and time
Bad example:
Metric: Revenue per user
Problem: High variance, need massive sample sizes
Good example:
Metric: Conversion rate
Why: Lower variance, reaches significance faster
Statistical power explained:
Power = ability to detect a real effect
Higher variance metrics = lower power = longer experiments
What I look for: Variance of potential metrics, time-to-signal data
How I use it: Validate metrics are Sensitive and Timely with real data
Example: "Metric X has 12% variance historically, so needs N=5000 sample size"
3. Check for Metric Conflicts with Guardrails
Source:context-library/metrics/, company guardrails
What I look for: Metrics that must not decline, company KPIs
How I use it: Ensure secondary metrics include guardrails
Example: "NPS is a company guardrail, must include in secondary metrics"
4. Reference Past Experiments for Benchmarks
Source:context-library/metrics/, A/B test results
What I look for: What worked in past experiments, surprising metric learnings
How I use it: Suggest metrics that detected real impacts before
Example: "In past experiments, page load time was poorly Sensitive, don't use it"
5. Route to Experiment Decision Framework
Source: Connection to /experiment-decision skill
What I look for: Is testing even the right call?
How I use it: If you should ship without testing, auto-flag before selecting metrics
Example: "CSS changes are reversible, don't need this full STEDII analysis"
Output Quality Self-Check
Before presenting output to the PM, verify:
Context was checked: Reviewed context-library/metrics/ for existing experiments and baselines, and context-library/prds/ for pre-defined success metrics
Each metric evaluated against all 6 STEDII dimensions: Every candidate metric has a score (0-3) for Sensitive, Timely, Efficient, Debuggable, Interpretable, and Isolated, with reasoning for each score
Sample size requirements calculated: The output includes a minimum sample size estimate for the primary metric based on expected effect size and variance
Metric sensitivity analysis included: The output states whether the expected change is detectable given current traffic, variance, and experiment duration
Guardrail metrics identified: At least 3 guardrail metrics are defined with acceptable ranges to prevent unintended harm
No vanity metrics without justification: If any metric could be considered a vanity metric (e.g., page views, total signups), the output explains why it is valid for this specific experiment