スキル内容

Most root cause analysis is qualitative — you interview people, draw diagrams, write narratives. That's necessary but insufficient for systems where you have rich observational data and need defensible, reproducible, quantitative causal claims.

This skill teaches the mathematical techniques developed by Pearl, Spirtes, Glymour, and others to reason formally about cause from data: when you can identify a causal effect from observations alone, when you need an intervention, and how to compute what would have happened under a counterfactual.

The Ladder of Causation

Judea Pearl (The Book of Why, 2018) defines three levels of causal reasoning, each requiring strictly more machinery than the last:

Rung	Activity	Example question	Required
1	Association	"How often does latency correlate with CPU?"	Joint distribution P(X, Y)
2	Intervention	"What happens to latency if we force CPU=50%?"	do-calculus, do(X) operator
3

U = {u_traffic, u_cpu_noise, u_lat_noise}
V = {Traffic, CPU, Latency}
F: Traffic := u_traffic
   CPU     := 0.7 * Traffic + u_cpu_noise
   Latency := 0.3 * CPU + 0.1 * Traffic + u_lat_noise

Traffic → CPU → Latency
       ↓________↑

P(Y | do(X=x)) = Σ_z P(Y | X=x, Z=z) P(Z=z)

P(Y | do(X=x)) = Σ_m P(m | x) Σ_{x'} P(Y | m, x') P(x')

Measure	Intuition	Best for
Degree centrality	How many neighbors does a node have?	Identifying hubs
Betweenness	How often does a node lie on shortest paths?	Identifying bottlenecks
Closeness	How far is a node from all others?	Identifying influential nodes
Eigenvector / PageRank	How connected are a node's neighbors?	Identifying prestige/importance
Katz	Weighted walks attenuated by distance	Weighted influence

TE(X → Y) = H(Y_{t+1} | Y_t^k) - H(Y_{t+1} | Y_t^k, X_t^l)

I(X; Y) = Σ_{x,y} P(x,y) log [P(x,y) / (P(x)P(y))]

Y_t = Σ_i a_i Y_{t-i} + Σ_j b_j X_{t-j} + ε_t

Input: an incident with rich telemetry (metrics, traces, logs, events)

Step 1. Scope the question as a causal query.
        "Did X cause Y?" or "What happens to Y under do(X=x)?"

Step 2. Construct (or learn) a DAG of the system.
        — Start from known topology (service map, code dependencies)
        — Augment with structure-learning over telemetry time series
        — Validate against domain experts

Step 3. Check identifiability.
        — Is there a valid backdoor adjustment set for (X, Y)?
        — Is there a valid frontdoor set?
        — If neither, an observational answer is impossible — need an experiment.

Step 4. Estimate the causal effect.
        — Fit CPTs / structural equations from historical data
        — Use doubly-robust estimators when unsure about model form
        — Report uncertainty bands, not point estimates

Step 5. Compute counterfactuals for each candidate root cause.
        — Abduct → Act → Predict
        — Rank candidates by the counterfactual effect on the outcome

Step 6. Validate with an intervention if possible.
        — Canary, chaos experiment, or A/B test
        — Compare observed effect to the SCM's prediction
        — If they disagree, the DAG is wrong

Tool	Purpose	Language
DoWhy (Microsoft)	End-to-end causal inference pipeline	Python
CausalNex (QuantumBlack)	Bayesian network learning and inference	Python
pgmpy	Probabilistic graphical models	Python
causal-learn (CMU)	Causal discovery algorithms	Python
EconML (Microsoft)	Heterogeneous treatment effects	Python
CausalImpact (Google)	Bayesian structural time series causal inference	R
Tetrad (CMU)	Graphical causal modeling and discovery	Java
bnlearn	Bayesian network learning	R

The Ladder of Causation | Skills Pool

The Ladder of Causation

The Ladder of Causation

The Ladder of Causation

Structural Causal Models (SCMs)

Why the DAG matters

Backdoor criterion

Frontdoor criterion

Counterfactual computation

Bayesian Networks for fault diagnosis

Fault-diagnosis workflow

Empirical performance

Structure learning

Graph-theoretic fault localization

Centrality measures

Paper 3 finding (our research)

Information-theoretic methods

Transfer entropy

Mutual information

Paper 4 finding

Granger causality

Caveats

Putting it all together — a causal-inference workflow for RCA

Tools and libraries

Common pitfalls

Conditioning on a collider

Mistaking a mediator for a confounder

Data dredging for DAGs

Temporal ordering violations

When causal inference is the wrong tool

Checklist before closing a causal-inference RCA

References

Visualization Expert

Data Analyst

Huggingface Hub

Multi Reviewer Patterns

Dbt Transformation Patterns

Startup Financial Modeling