Causal reasoning methods — Causal analysis (cause-effect mechanisms, causal graphs, confounders) and Counterfactual reasoning (what-if scenarios, alternative histories, intervention analysis). Use when the user invokes `/think causal` or `/think counterfactual`, or asks about cause-and-effect relationships, confounders, or "what would have happened if..." questions.
$ARGUMENTS
Parse these arguments. The first word should be causal or counterfactual. The rest is the problem to reason about. If invoked via the think router, $ARGUMENTS is the same string the user originally typed after /think.
This category skill contains two closely related but distinct methods: Causal Analysis (building causal graphs, identifying mechanisms and confounders) and Counterfactual Reasoning (what-if scenarios, alternative histories, intervention analysis).
Causal analysis builds an explicit model of cause-effect relationships between variables. It goes beyond identifying correlations to explaining the mechanism by which one variable produces changes in another. Central concerns are (1) distinguishing correlation from causation, (2) identifying confounders that create spurious associations, and (3) tracing the causal chain — or chains — from a root cause to an observed effect.
Do not use Causal when:
Correlation means two variables move together. Causation means one variable produces a change in the other through a mechanism. The key tests:
Confounders are the most common source of misleading correlations. A confounder C affects two or more nodes in the causal graph, making them appear causally linked when the real arrows both run from C, not between A and B. Identifying and adjusting for confounders is essential before concluding that A causes B.
Causal chains vs. common causes:
See reference/output-formats/causal.md for the authoritative JSON schema.
{
"mode": "causal",
"causalGraph": {
"nodes": [
{ "id": "<id>", "name": "<name>", "type": "cause|effect|mediator|confounder", "description": "<description>" }
],
"edges": [
{ "from": "<id>", "to": "<id>", "strength": 0.0, "confidence": 0.0, "mechanism": "<how A produces B>" }
]
},
"mechanisms": [
{ "from": "<id>", "to": "<id>", "description": "<full path description>", "type": "direct|indirect|feedback" }
],
"confounders": [
{ "nodeId": "<id>", "affects": ["<nodeId>", "<nodeId>"], "description": "<how this confounder creates spurious correlation>" }
],
"interventions": [
{
"nodeId": "<id>",
"action": "<what we would do>",
"expectedEffects": [{ "nodeId": "<id>", "expectedChange": "<description>", "confidence": 0.0 }]
}
]
}
mode is exactly "causal"causalGraph.nodes has at least two entriescausalGraph.edges has at least one entryfrom and to references a valid node idmechanism — a blank mechanism is a sign the causal claim is not yet justifiedconfounders is present whenever two variables are correlated but the direct arrow is uncertainstrength is in [−1, 1]; each confidence is in [0, 1]mechanisms array documents at least the primary causal pathInput: "Did the cache eviction policy change cause the p99 latency spike last week? Database query counts and server memory also changed at the same time."
Analysis: Server memory is a confounder — the same deployment that changed the eviction policy also changed memory limits, and both independently affect latency. Simply correlating eviction policy change with latency spike would conflate the two causes.
Output:
{
"mode": "causal",
"causalGraph": {
"nodes": [
{ "id": "eviction", "name": "Cache Eviction Policy", "type": "cause", "description": "Changed from LRU to LFU in deployment v4.2.1" },
{ "id": "hitrate", "name": "Cache Hit Rate", "type": "mediator", "description": "Percentage of requests served from cache" },
{ "id": "dbqueries", "name": "Database Query Count", "type": "mediator", "description": "Number of queries reaching the database per second" },
{ "id": "memlimit", "name": "Server Memory Limit", "type": "confounder", "description": "Reduced from 16 GB to 12 GB in same deployment v4.2.1" },
{ "id": "latency", "name": "p99 Latency", "type": "effect", "description": "99th-percentile response time in ms" }
],
"edges": [
{ "from": "eviction", "to": "hitrate", "strength": -0.6, "confidence": 0.85, "mechanism": "LFU evicts recently-added but infrequent keys; for bursty workloads this reduces hit rate vs LRU" },
{ "from": "hitrate", "to": "dbqueries", "strength": -0.9, "confidence": 0.95, "mechanism": "Each cache miss results in exactly one additional database query" },
{ "from": "dbqueries", "to": "latency", "strength": 0.8, "confidence": 0.9, "mechanism": "Each additional DB query adds ~40 ms median; under load, connection pool contention amplifies tail latency" },
{ "from": "memlimit", "to": "hitrate", "strength": -0.5, "confidence": 0.8, "mechanism": "Smaller memory limit reduces cache capacity, directly increasing eviction rate regardless of policy" },
{ "from": "memlimit", "to": "latency", "strength": 0.4, "confidence": 0.7, "mechanism": "Reduced memory triggers OS-level swapping under high load, adding I/O wait to request processing" }
]
},
"mechanisms": [
{
"from": "eviction",
"to": "latency",
"description": "Eviction policy change → lower cache hit rate → more DB queries → higher p99 latency (indirect, through hitrate and dbqueries)",
"type": "indirect"
},
{
"from": "memlimit",
"to": "latency",
"description": "Memory limit reduction independently raises latency through two paths: reduced cache capacity and OS swapping",
"type": "indirect"
}
],
"confounders": [
{
"nodeId": "memlimit",
"affects": ["hitrate", "latency"],
"description": "The same deployment v4.2.1 changed both eviction policy and memory limits. Without controlling for memory, naive correlation overstates the causal effect of the eviction policy alone."
}
]
}
Natural-language summary: "The eviction policy change does causally contribute to the latency spike — through the hitrate → dbqueries path — but it is not the sole cause. The memory limit reduction is a confounder: it was changed in the same deployment and independently raises latency through two mechanisms. Attributing the entire spike to the eviction policy would be incorrect. To isolate the policy's effect, run a controlled experiment: revert the policy with the new memory limit held fixed."
Counterfactual reasoning asks: "What would have happened if some past condition had been different?" It is distinct from causal analysis in direction and purpose — causal analysis explains what did happen and why; counterfactual reasoning imagines an alternative history by varying one or more conditions while holding others fixed, then traces through consequences.
Do not use Counterfactual when:
The critical technique in counterfactual reasoning is holding other variables fixed while varying only the counterfactual condition. This prevents confabulation — if you vary multiple conditions simultaneously, you cannot isolate which change would have mattered.
Intervention analysis: A counterfactual introduces a hypothetical do-operator: "what if we had done X?" This is not the same as observing that X occurred. Intervening on X severs its incoming causal arrows and asks downstream effects to propagate from the new value.
See reference/output-formats/counterfactual.md for the authoritative JSON schema.
{
"mode": "counterfactual",
"actual": {
"id": "actual",
"name": "<name of actual scenario>",
"description": "<what actually happened>",
"conditions": [
{ "factor": "<condition name>", "value": "<actual value>" }
],
"outcomes": [
{ "description": "<what resulted>", "impact": "positive|negative|neutral", "magnitude": 0.0 }
]
},
"counterfactuals": [
{
"id": "cf1",
"name": "<name of alternative scenario>",
"description": "<the hypothetical>",
"conditions": [
{ "factor": "<same condition>", "value": "<changed value>", "isIntervention": true },
{ "factor": "<fixed condition>", "value": "<same as actual>" }
],
"outcomes": [
{ "description": "<what would have resulted>", "impact": "positive|negative|neutral", "magnitude": 0.0 }
],
"likelihood": 0.0
}
],
"comparison": {
"differences": [
{ "aspect": "<what differs>", "actual": "<actual value>", "counterfactual": "<hypothetical value>" }
],
"insights": ["<insight 1>"],
"lessons": ["<lesson for future decisions>"]
},
"interventionPoint": {
"description": "<what action would have needed to be taken>",
"timing": "<when it would have been needed>",
"feasibility": 0.0,
"expectedImpact": 0.0
}
}
mode is exactly "counterfactual"actual is fully populated with at least one condition and one outcomecounterfactuals has at least one entryisIntervention: true — if multiple conditions are intervened on simultaneously, the analysis is no longer isolating a single counterfactualcomparison.lessons is non-empty — counterfactual analysis that produces no lessons has not been completedinterventionPoint.feasibility is in [0, 1]; a value near 0 indicates the counterfactual was practically impossible even if logically soundlikelihood in each counterfactual is in [0, 1]Input: "Would we have avoided the outage if we had rolled back before 2 AM? The outage started at 2:47 AM after a bad deployment at midnight. We had alerts at 1:30 AM but waited."
Key technique: only the rollback timing is varied. All other conditions — the bad deployment, the system architecture, the alert threshold — remain fixed. This isolates the causal effect of the rollback decision.
Output:
{
"mode": "counterfactual",
"actual": {
"id": "actual",
"name": "Midnight Deployment Outage",
"description": "Bad deployment at 00:00; first alerts at 01:30 AM; team waited; full outage began at 02:47 AM and lasted 3.5 hours",
"conditions": [
{ "factor": "Deployment version", "value": "v4.3.0 (bad)" },
{ "factor": "First alert time", "value": "01:30 AM" },
{ "factor": "Rollback decision", "value": "Not taken; team chose to investigate" },
{ "factor": "Outage start", "value": "02:47 AM" }
],
"outcomes": [
{ "description": "Full service outage lasting 3.5 hours, affecting all users", "impact": "negative", "magnitude": 0.9 }
]
},
"counterfactuals": [
{
"id": "cf1",
"name": "Rollback at 01:45 AM",
"description": "Team initiates rollback to v4.2.0 at 01:45 AM, 15 minutes after first alert and before the cascade failure threshold",
"conditions": [
{ "factor": "Deployment version", "value": "v4.3.0 (bad)" },
{ "factor": "First alert time", "value": "01:30 AM" },
{ "factor": "Rollback decision", "value": "Rollback initiated at 01:45 AM", "isIntervention": true },
{ "factor": "Outage start", "value": "N/A — rollback completes at 02:10 AM before cascade" }
],
"outcomes": [
{ "description": "15-minute service degradation during rollback; no full outage; cascade failure never reached", "impact": "negative", "magnitude": 0.15 }
],
"likelihood": 0.9
}
],
"comparison": {
"differences": [
{ "aspect": "Rollback timing", "actual": "Never initiated", "counterfactual": "01:45 AM (before cascade threshold)" },
{ "aspect": "Outage duration", "actual": "3.5 hours full outage", "counterfactual": "~25 minutes degraded, no full outage" },
{ "aspect": "User impact", "actual": "100% of users affected for 3.5 hours", "counterfactual": "Partial degradation, no complete service loss" }
],
"insights": [
"The rollback was available and technically feasible at 01:45 AM — the window existed",
"The cascade failure that caused full outage did not trigger until 02:47 AM, leaving a 77-minute decision window after first alert",
"The degradation cost of an early rollback (25 min) is an order of magnitude smaller than the actual outage cost (3.5 hours)"
],
"lessons": [
"Establish a pre-agreed rollback trigger: if alerts fire within 90 minutes of a deployment and root cause is not identified in 15 minutes, initiate rollback by default",
"The bias toward investigation over rollback is rational when rollback is costly; reduce rollback cost (faster procedure, better automation) to lower the threshold",
"Document the cascade failure threshold — knowing at what error rate full outage becomes inevitable turns a judgment call into a measurable tripwire"
]
},
"interventionPoint": {
"description": "Initiate rollback to v4.2.0 immediately after first alert confirmation at 01:30–01:45 AM",
"timing": "01:30–02:00 AM window (before cascade failure threshold at ~02:30 AM)",
"feasibility": 0.9,
"expectedImpact": 0.85
}
}
Natural-language summary: "Yes — a rollback before 2 AM would very likely have avoided the full outage. The counterfactual holds everything else fixed (same bad deployment, same alert timing, same architecture) and varies only the rollback decision. With a 77-minute window between the first alert and the cascade failure, the intervention was technically feasible (feasibility 0.9). The key lesson is structural: when rollback cost is low relative to outage cost, the default should be rollback-on-alert rather than investigate-first."