AI-first usability testing. Rapid iteration, Maze AI, heuristic evaluation, A/B testing for AI features.
Test real prototypes immediately. AI-assisted analysis. Rapid iteration.
See also:
agents/ux-research/SKILL.md,agents/analytics/SKILL.md
Before testing:
| Dimension | Spectrum |
|---|---|
| Rapid (hours) ←→ Rigorous (weeks) |
| Depth | Surface heuristics ←→ Deep qualitative |
| Scale | Few participants ←→ Statistical sample |
| Assistance | AI-automated ←→ Human-moderated |
| Fidelity | Paper prototype ←→ Production code |
| If Context Is... | Then Consider... |
|---|---|
| Quick validation needed | Unmoderated test in Maze, 5 users |
| Deep insights needed | Moderated sessions with follow-ups |
| No prototype yet | Heuristic evaluation + Attention Insight |
| High traffic product | A/B test with statistical significance |
| Testing AI features | Include output variability + trust metrics |
| Same-day iteration | Morning build → afternoon test → evening fix |
| Method | Speed | Depth | AI Assist |
|---|---|---|---|
| Unmoderated Testing | Fast | Medium | High |
| Moderated Testing | Slow | Deep | Medium |
| Heuristic Eval | Very Fast | Surface | High |
| A/B Testing | Medium | Quantitative | High |
| Traditional | AI-First 2025 |
|---|---|
| Plan extensively | Build → Test immediately |
| Recruit for weeks | Test within hours |
| Analyze manually | AI summarizes findings |
| Report after weeks | Fix same day |
| Research team only | Whole team participates |
AI handles:
✅ Real-time transcription
✅ Automatic summarization
✅ Friction pattern detection
✅ Fix suggestions
Humans focus on:
✅ Strategy and interpretation
✅ Nuanced insights
✅ Product decisions
✅ Stakeholder communication
| Tool | AI Features | Best For | Price |
|---|---|---|---|
| Maze AI | Auto-summarize, friction detection | Quick prototype tests | $99+/mo |
| UserTesting | AI Insights, video highlights | Enterprise | $$$$ |
| PlaybookUX | Transcription, sentiment | Mixed methods | $99/mo |
| Sprig | In-product surveys, replays | Live product | $175+/mo |
| Tool | AI Features | Best For |
|---|---|---|
| Attention Insight | Heatmap prediction | Visual design |
| UXtweak | Behavioral ML | Live site analysis |
| Claude/ChatGPT | Heuristic review | Quick checks |
| Tool | AI Features | Best For |
|---|---|---|
| PostHog | Session replay + analytics | Product teams |
| Statsig | Auto-analysis | Feature flags |
| Amplitude | AI insights | Enterprise |
| Optimizely | Full experimentation | Enterprise |
✅ Good for:
- Task completion testing
- Navigation and findability
- First-click testing
- Preference testing
- Quick feedback on prototypes
❌ Not for:
- Complex emotional insights
- "Why" questions
- Exploratory research
1. Upload prototype (Figma, URL, or images)
2. Define tasks and success paths
3. Write clear task prompts
4. Set completion criteria
5. Launch (results in 1-24 hours)
6. AI summarizes findings
7. Fix issues same day
8. Retest
Task Prompt Template:
You are booking a flight for next Friday.
Starting from the homepage, find a flight from
New York to Los Angeles and complete the booking.
When you're done, click "Task Complete."
| Metric | What It Means |
|---|---|
| Task Completion | % who finished successfully |
| Time on Task | How long it took |
| Misclick Rate | Wrong clicks before right one |
| Drop-off Points | Where users gave up |
| Direct Success | First-try success rate |
✅ Good for:
- Complex flows
- Sensitive topics
- Need follow-up questions
- Exploratory research
- Early concept testing
| Before | During | After |
|---|---|---|
| Generate discussion guide | Real-time transcription | Auto-summarize |
| Screen participant | AI note suggestions | Theme extraction |
| Schedule sessions | Auto-tagging moments | Highlight reel |
## Intro (2 min)
- Thanks for joining
- Recording notice + consent
- No right/wrong answers
- Think aloud
## Warm-up (3 min)
- Tell me about your experience with [domain]
- How do you currently [solve problem]?
## Tasks (20-30 min)
- Task 1: [Scenario and goal]
- Probing: What did you expect to happen?
- Follow-up: What would make this easier?
- Task 2: [Scenario and goal]
- ...
## Wrap-up (5 min)
- Overall impression?
- Anything confusing?
- What would you change?
- Any questions for me?
| # | Heuristic | Question to Ask |
|---|---|---|
| 1 | Visibility of status | Does user know what's happening? |
| 2 | Match real world | Does language match expectations? |
| 3 | User control | Can user undo/escape? |
| 4 | Consistency | Same patterns throughout? |
| 5 | Error prevention | Are mistakes prevented? |
| 6 | Recognition over recall | Is everything visible/discoverable? |
| 7 | Flexibility | Shortcuts for experts? |
| 8 | Aesthetic minimalism | Is it cluttered? |
| 9 | Error recovery | Clear error messages? |
| 10 | Help | Documentation available? |
1. Screenshot prototype
2. Run through Attention Insight (heatmap prediction)
3. Identify attention gaps
4. Ask Claude for heuristic review:
"Review this UI screenshot against Nielsen's 10
usability heuristics. For each issue found, rate
severity (1-4) and suggest a fix."
5. Prioritize fixes by severity
6. Test with real users to validate
✅ Good for:
- Quantitative validation
- High-traffic features
- Before major launches
- Comparing two clear options
❌ Not for:
- Low traffic (no statistical power)
- Radically different concepts
- When you need "why" insights
// Feature flag setup
import { usePostHog } from 'posthog-js/react'
function CheckoutButton() {
const posthog = usePostHog()
// Get variant
const variant = posthog.getFeatureFlag('checkout-button-experiment')
if (variant === 'control') {
return <button>Checkout</button>
}
if (variant === 'treatment') {
return <button className="bg-green-500">Complete Purchase →</button>
}
}
// Track conversion
function handlePurchase() {
posthog.capture('purchase_completed', {
revenue: cart.total,
items: cart.items.length,
})
}
Minimum sample size = 16 × (p × (1-p)) / MDE²
Where:
- p = baseline conversion rate
- MDE = minimum detectable effect
Example:
- 5% baseline conversion
- Want to detect 10% improvement (0.5% absolute)
- Need ~6,000 per variant
| Challenge | How to Test |
|---|---|
| Output variability | Test same prompt multiple times |
| Hallucinations | Verify claims, show sources |
| Speed | Measure time to first response |
| Trust | Ask trust/confidence ratings |
| Error recovery | Test when AI fails |
interface AIFeatureMetrics {
// Task success
taskCompletion: number // % who achieved goal
timeToComplete: number // Seconds
aiAssistUsage: number // % who used AI feature
// Quality
outputAccuracy: number // % correct outputs
userSatisfaction: number // 1-5 rating
trustScore: number // "I trust this AI" 1-5
// Error handling
errorRate: number // % that showed errors
errorRecoveryRate: number // % who recovered
// Comparison
aiVsManual: {
speed: number // AI time / manual time
accuracy: number // AI accuracy / manual accuracy
}
}
1. Happy Path
- AI gives correct answer first try
- User achieves goal quickly
2. Partial Success
- AI gives mostly correct answer
- User needs to edit/refine
3. AI Failure
- AI gives wrong or no answer
- Can user recover?
4. Edge Cases
- Unusual inputs
- Long/short prompts
- Ambiguous requests
Morning: