Scientifically validate Desk changes that affect Reeve's behavior. Use BEFORE committing non-trivial changes to CLAUDE.md, skills, Goals/, or Responsibilities/. Runs isolated test pulses with positive/negative cases to verify the change produces desired behavior under realistic conditions.
Scientific validation for any Desk change that affects Reeve's behavior.
"I want Reeve to behave differently. How do I know it actually will?"
DO use when:
DON'T use for:
| Skill | When to Use |
|---|
/context-engineering | Before designing changes - understand where information should live |
/session-analyzer | After tests complete - analyze session metrics, tool usage, feedback signals |
Key Insight: Behavior must be:
CRITICAL: Ensure no pulses are active before testing
list_upcoming_pulses(limit=10)
If pulses exist:
cancel_pulse(pulse_id=X)Why: Active pulses may modify Desk, contaminating results.
git stash # If uncommitted work
BASELINE=$(git log -1 --format="%H")
echo "$BASELINE" > /tmp/behavior-validation-baseline.txt
echo "Baseline: $BASELINE"
Articulate before writing tests:
**Change**: [What am I modifying?]
**Desired Behavior**: [What should happen?]
**Trigger Conditions**: [When should it activate?]
**Boundary Conditions**: [When should it NOT activate?]
**Observable Evidence**: [How do I verify it worked?]
Requirements:
Hard Test Design:
| Test Type | Design Goal |
|---|---|
| Hard Positive | Should trigger despite: no explicit keywords, casual tone, mixed content |
| Hard Negative | Should NOT trigger despite: action words, urgency language, false patterns |
CRITICAL: No Leakage Rule
Test prompts must NOT:
Only instruction: "RESPOND TO THIS MESSAGE NATURALLY."
5a. Apply change:
# Make changes, commit so test pulses see them
git add -A && git commit -m "VALIDATION: [description]"
5b. Schedule staggered pulses:
schedule_pulse(scheduled_at="now", prompt=test_1, tags=["validation", "pos_1", "run_1"])
schedule_pulse(scheduled_at="in 3 minutes", prompt=test_1, tags=["validation", "pos_1", "run_2"])
schedule_pulse(scheduled_at="in 6 minutes", prompt=test_2, tags=["validation", "neg_1", "run_1"])
5c. Analyze with subagents (protect context):
Use /session-analyzer patterns to check what the test pulse actually did:
Task(
subagent_type="Explore",
model="haiku",
prompt="""Analyze validation test {test_id}, run {run}.
Expected: {evidence}
Check file changes:
- Tasks/Open.md, Knowledge/Diary/, Responsibilities/
Use session-analyzer to check:
- Session JSONL for tool_use calls (Write, Edit)
- Feedback signals in user messages
Report: PASS or FAIL with evidence."""
)
5d. Fail-fast on failure:
cancel_pulse(pulse_id=X)5e. Success criteria:
ALWAYS reset, regardless of outcome:
BASELINE=$(cat /tmp/behavior-validation-baseline.txt)
git reset --hard $BASELINE
git status
rm /tmp/behavior-validation-baseline.txt
If tests passed → Re-apply with proper commit message
test_id: "pos_implied_action"