Guides systematic root-cause investigation. Use any time something isn't working and you're about to change code — whether it looks like a bug, a test failure, or just something that needs a quick fix. Load before investigating or changing anything.
Random fixes waste time and create new bugs. Quick patches mask underlying issues.
Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
If you haven't completed the Root Cause Investigation step, you cannot propose fixes. Complete each phase before proceeding to the next.
| Phase | Key Activities | Output |
|---|---|---|
| 1. Root Cause | Read errors, reproduce, check changes, gather evidence | Understand WHAT and WHY |
| 2. Pattern | Find working examples, compare | Identify differences |
| 3. Hypothesis | Form theory, test minimally |
| Confirmed or new hypothesis |
| 4. Implementation | Create test, fix, verify | Bug resolved, tests pass |
Use for ANY technical issue: test failures, bugs, unexpected behavior, performance problems, build failures, integration issues.
Use ESPECIALLY when:
BEFORE attempting ANY fix:
Read Error Messages Carefully
Reproduce Consistently
Check Recent Changes
Gather Evidence in Multi-Component Systems
WHEN system has multiple components (CI → build → signing, API → service → database):
For EACH component boundary, add diagnostic logging:
Run once → analyse evidence showing WHERE it breaks → investigate that specific component.
Trace Data Flow
WHEN error is deep in call stack:
See references/root-cause-tracing.md for the complete backward tracing technique.
Find the pattern before fixing:
Scientific method:
Fix the root cause, not the symptom:
Create Failing Test Case — Simplest reproduction, automated if possible. Use the test-driven-development skill (see ../test-driven-development/SKILL.md).
Implement Single Fix — ONE change addressing root cause. No "while I'm here" improvements.
Verify Fix — Test passes? No other tests broken? Issue resolved?
If Fix Doesn't Work
If 3+ Fixes Failed: Question Architecture
Pattern indicating architectural problem:
STOP and question fundamentals:
Discuss with the user before attempting more fixes. This is NOT a failed hypothesis — this is a wrong architecture.
If No Root Cause Found After Full Investigation
If systematic investigation reveals issue is truly environmental, timing-dependent, or external:
But: 95% of "no root cause" cases are incomplete investigation.
If you catch yourself thinking any of these, STOP. Return to Root Cause Investigation.
If 3+ fixes failed: Question the architecture (see Step 4, item 5).
When you see these: STOP. Return to Root Cause Investigation.
| Excuse | Reality |
|---|---|
| "Issue is simple, don't need process" | Process is fast for simple bugs. |
| "Emergency, no time for process" | Systematic is FASTER than guess-and-check thrashing. |
| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
| "I'll write test after confirming fix works" | Untested fixes don't stick. Test first proves it. |
| "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. |
| "Reference too long, I'll adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
| "I see the problem, let me fix it" | Seeing symptoms ≠ understanding root cause. |
| "I can SEE the bug, no investigation needed" | You see the symptom, not all the callers, edge cases, or recent changes. |
| "User explicitly told me to skip investigation" | User is in pain, not debugging. Your job is to find root cause, not comply with panic. |
| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, don't fix again. |
references/root-cause-tracing.md — Trace bugs backward through call stack to find original triggerreferences/defense-in-depth.md — Add validation at multiple layers after finding root causereferences/condition-based-waiting.md — Replace arbitrary timeouts with condition polling