Machine Learning for Causal Inference Skill | Skills Pool

技能档案

Machine Learning for Causal Inference Skill

Econometrics skill for machine learning methods in causal inference. Activates when the user asks about: "causal forest", "generalized random forest", "GRF", "double machine learning", "DML", "debiased machine learning", "LASSO for variable selection", "post-LASSO", "heterogeneous treatment effects", "CATE", "conditional average treatment effect", "BLP analysis", "CLAN analysis", "causal tree", "honest estimation", "因果森林", "双重机器学习", "异质性处理效应", "条件平均处理效应", "LASSO变量选择", "机器学习因果推断", "去偏机器学习"

sheehe0 星标2026年4月16日

分类: 机器学习

技能内容

若当前会话由 Coase 研究工作流触发（/idea-discovery / /experiment-bridge / /paper-writing），本 skill 的输出必须按以下规则落入阶段文件，不得自行新建目录或脱离工作流上下文：

/idea-discovery Phase 2 Step 3 (Baseline Design Lock): 返回"模型设定 + 识别假设 + 主要识别风险"三段，由 planner 填入 planner/stage_7_baseline_design.md。此阶段不执行代码。
/experiment-bridge Phase 4 (Run Baseline): 生成并执行主回归代码，主回归表走 table skill 规范化，结果写入 executor/stage_1_run_baseline.md。
/experiment-bridge Phase 5 (Robustness): 提供本方法特有的替代估计量、识别诊断或敏感性检验，写入 executor/stage_2_explanation_robustness.md 对应条目。
/paper-writing Phase 6: 不直接参与。writer 从 executor/ 目录摘录方法描述，不得补跑回归。

若用户未指定工作流（直接提问使用本方法），忽略本节，按下方正文自由执行。

Machine Learning for Causal Inference Skill

This skill covers modern ML-based causal inference methods: Causal Forests (GRF) for heterogeneous treatment effects, Double/Debiased Machine Learning (DML) for partially linear models, and LASSO-based variable selection. These methods combine the flexibility of ML with the rigor of econometric identification.

相关技能

Goal	Method
Estimate average treatment effect with many controls	Double ML (DML)
Discover treatment effect heterogeneity	Causal Forest (GRF)
Variable selection for high-dimensional controls	Post-LASSO
Best linear predictor of CATE	BLP analysis
Subgroup with largest/smallest effects	CLAN analysis

Y = θ·D + g(X) + ε     (structural equation)
D = m(X) + v           (treatment equation)

θ = causal parameter of interest
g(X), m(X) = unknown nuisance functions estimated by ML

# R — DoubleML package
library(DoubleML)
library(mlr3)
library(mlr3learners)

# Define data
dml_data <- DoubleMLData$new(
  data = df,
  y_col = "outcome",
  d_cols = "treatment",
  x_cols = c("x1", "x2", "x3", "x4", "x5")
)

# Choose ML methods for nuisance estimation
ml_g <- lrn("regr.ranger", num.trees = 500)   # for E[Y|X]
ml_m <- lrn("classif.ranger", num.trees = 500) # for E[D|X]

# Fit DML (partially linear model)
dml_plr <- DoubleMLPLR$new(dml_data, ml_g, ml_m,
                            n_folds = 5, n_rep = 10)
dml_plr$fit()
print(dml_plr)
# Reports: coefficient, SE, t-stat, p-value, CI

# Interactive model (for CATE via DML)
dml_irm <- DoubleMLIRM$new(dml_data, ml_g, ml_m,
                            n_folds = 5, score = "ATE")
dml_irm$fit()
print(dml_irm)

# Python — DoubleML
import doubleml as dml
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# Define data
dml_data = dml.DoubleMLData(
    df, y_col='outcome', d_cols='treatment',
    x_cols=['x1', 'x2', 'x3', 'x4', 'x5']
)

# Nuisance learners
ml_g = RandomForestRegressor(n_estimators=500, max_depth=5)
ml_m = RandomForestClassifier(n_estimators=500, max_depth=5)

# Partially Linear Regression
dml_plr = dml.DoubleMLPLR(dml_data, ml_g, ml_m,
                            n_folds=5, n_rep=10)
dml_plr.fit()
print(dml_plr.summary)

# Confidence interval
print(dml_plr.confint())

* Stata — ddml (Ahrens et al. 2024)
ssc install ddml
ssc install pystacked

* Partially linear model with cross-fitting
ddml init partial, kfolds(5) reps(10)
ddml E[outcome]: pystacked outcome x1 x2 x3 x4 x5, ///
    type(reg) methods(rf gradboost lassocv)
ddml E[treatment]: pystacked treatment x1 x2 x3 x4 x5, ///
    type(class) methods(rf gradboost lassocv)
ddml crossfit
ddml estimate, robust

# R — Generalized Random Forest
library(grf)

# Prepare data
X <- as.matrix(df[, c("x1", "x2", "x3", "x4", "x5")])
Y <- df$outcome
W <- df$treatment

# Fit causal forest
cf <- causal_forest(X, Y, W,
                    num.trees = 4000,
                    honesty = TRUE,          # honest estimation
                    tune.parameters = "all") # auto-tune

# Average treatment effect (ATE)
ate <- average_treatment_effect(cf, target.sample = "all")
cat("ATE:", ate["estimate"], "SE:", ate["std.err"], "\n")

# ATT
att <- average_treatment_effect(cf, target.sample = "treated")
cat("ATT:", att["estimate"], "SE:", att["std.err"], "\n")

# Individual-level CATE predictions
cate <- predict(cf, estimate.variance = TRUE)
df$cate_hat <- cate$predictions
df$cate_se  <- sqrt(cate$variance.estimates)

# Variable importance
varimp <- variable_importance(cf)
names(varimp) <- colnames(X)
sort(varimp, decreasing = TRUE)

# Python — EconML (Microsoft)
from econml.dml import CausalForestDML

# Fit causal forest via DML
cf = CausalForestDML(
    model_y=RandomForestRegressor(n_estimators=500),
    model_t=RandomForestClassifier(n_estimators=500),
    n_estimators=4000,
    cv=5,
    random_state=42
)
cf.fit(Y=df['outcome'].values,
       T=df['treatment'].values,
       X=df[['x1', 'x2', 'x3', 'x4', 'x5']].values)

# ATE
ate = cf.ate_inference()
print(f"ATE: {ate.mean_point:.4f} (SE: {ate.stderr_mean:.4f})")

# CATE predictions
cate = cf.effect(df[['x1', 'x2', 'x3', 'x4', 'x5']].values)
df['cate_hat'] = cate

# R — BLP of CATE
library(grf)

# After fitting causal_forest cf:
blp <- best_linear_projection(cf, A = X)
print(blp)

# Interpretation:
# - Intercept: average effect
# - Coefficients: how CATE varies with each covariate
# - If all coefficients ≈ 0 → homogeneous treatment effect

# Sorted Group Average Treatment Effects (GATES)
# Split sample by predicted CATE quartiles
df$cate_quartile <- cut(df$cate_hat,
                         breaks = quantile(df$cate_hat, c(0, 0.25, 0.5, 0.75, 1)),
                         labels = c("Q1 (lowest)", "Q2", "Q3", "Q4 (highest)"),
                         include.lowest = TRUE)

# GATES regression
library(fixest)
gates <- feols(outcome ~ i(cate_quartile, treatment),
               data = df, vcov = "HC1")
summary(gates)
# Significant difference between Q4 and Q1 → heterogeneity exists

# CLAN: compare characteristics across CATE quartiles
clan_table <- df %>%
  group_by(cate_quartile) %>%
  summarise(across(c(x1, x2, x3, age, income), mean))
print(clan_table)

τ_AIPW = E[ μ₁(X) − μ₀(X) + D(Y − μ₁(X))/e(X) − (1−D)(Y − μ₀(X))/(1−e(X)) ]

# Python — AIPW / DR Learner (EconML)
from econml.dr import DRLearner
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LogisticRegressionCV
import numpy as np

# Data preparation
Y = df['outcome'].values
T = df['treatment'].values          # binary: 0/1
X = df[['x1', 'x2', 'x3']].values

# DRLearner implements AIPW with cross-fitting for CATE estimation
dr_learner = DRLearner(
    model_propensity=LogisticRegressionCV(cv=5),   # propensity score e(X)
    model_regression=RandomForestRegressor(n_estimators=200),  # outcome μ_d(X)
    model_final=RandomForestRegressor(n_estimators=200),       # CATE model
    cv=5,
    random_state=42
)
dr_learner.fit(Y, T, X=X)

# ATE via AIPW
ate = dr_learner.ate_inference(X=X)
print(f"AIPW ATE: {ate.mean_point:.4f} (SE: {ate.stderr_mean:.4f})")
print(f"95% CI: {ate.conf_int_mean()}")

# CATE predictions
cate = dr_learner.effect(X)

# R — AIPW using AIPW package
# install.packages("AIPW")
library(AIPW)
library(SuperLearner)

aipw_obj <- AIPW$new(
  Y = df$outcome,
  A = df$treatment,
  W = df[, c("x1", "x2", "x3")],
  Q.SL.library = c("SL.ranger", "SL.glm"),   # outcome model
  g.SL.library = c("SL.ranger", "SL.glm"),   # propensity model
  k_split = 5,       # cross-fitting folds
  verbose = FALSE
)
aipw_obj$fit()
aipw_obj$summary()
# Reports: ATE, RR, OR with 95% CI

Learner	Approach	Best When
S-Learner	Single model: fit μ(X, D), then CATE = μ(X,1) − μ(X,0)	Simple baseline; may shrink CATE to zero if D is weak signal
T-Learner	Two separate models: μ₁(X) and μ₀(X)	Unequal sample sizes; less regularization shrinkage on treatment
X-Learner	Imputes counterfactuals, then fits CATE on imputed residuals	Unbalanced treatment (very few treated or controls)
R-Learner	Residualizes Y and D, then fits CATE on residuals	High confounding; closely related to DML

# Python — Meta-Learners via EconML
from econml.metalearners import TLearner, SLearner, XLearner
from econml.dml import LinearDML
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
import numpy as np

Y = df['outcome'].values
T = df['treatment'].values
X = df[['x1', 'x2', 'x3']].values

# --- T-Learner ---
t_learner = TLearner(models=RandomForestRegressor(n_estimators=500))
t_learner.fit(Y, T, X=X)
cate_t = t_learner.effect(X)
print(f"T-Learner ATE: {np.mean(cate_t):.4f}")

# --- S-Learner ---
s_learner = SLearner(overall_model=RandomForestRegressor(n_estimators=500))
s_learner.fit(Y, T, X=X)
cate_s = s_learner.effect(X)

# --- X-Learner ---
x_learner = XLearner(
    models=RandomForestRegressor(n_estimators=500),
    propensity_model=RandomForestClassifier(n_estimators=500)
)
x_learner.fit(Y, T, X=X)
cate_x = x_learner.effect(X)

# --- R-Learner (via LinearDML with non-parametric final stage) ---
r_learner = LinearDML(
    model_y=RandomForestRegressor(n_estimators=500),
    model_t=RandomForestClassifier(n_estimators=500),
    cv=5, random_state=42
)
r_learner.fit(Y, T, X=X, W=None)
cate_r = r_learner.effect(X)

# Compare ATE across meta-learners
print(f"S: {np.mean(cate_s):.4f} | T: {np.mean(cate_t):.4f} | "
      f"X: {np.mean(cate_x):.4f} | R: {np.mean(cate_r):.4f}")

# R — X-Learner using grf building blocks
library(grf)

# Step 1: T-Learner stage
X_mat <- as.matrix(df[, c("x1", "x2", "x3")])
Y <- df$outcome; W <- df$treatment

rf1 <- regression_forest(X_mat[W==1, ], Y[W==1])  # treated
rf0 <- regression_forest(X_mat[W==0, ], Y[W==0])  # control

# Step 2: Impute counterfactuals
mu1 <- predict(rf1, X_mat)$predictions
mu0 <- predict(rf0, X_mat)$predictions

# Step 3: X-Learner imputed effects
D1 <- Y[W==1] - predict(rf0, X_mat[W==1,])$predictions  # treated: Y(1) - mu0
D0 <- predict(rf1, X_mat[W==0,])$predictions - Y[W==0]  # control: mu1 - Y(0)

# Step 4: Fit CATE models on imputed effects
tau1 <- regression_forest(X_mat[W==1,], D1)
tau0 <- regression_forest(X_mat[W==0,], D0)

# Step 5: Combine using propensity score
e_hat <- regression_forest(X_mat, W)$predictions  # propensity
cate_x <- e_hat * predict(tau0, X_mat)$predictions +
          (1 - e_hat) * predict(tau1, X_mat)$predictions
cat("X-Learner ATE:", mean(cate_x), "\n")

# R — Post-LASSO
library(hdm)

# Post-double-selection LASSO for ATE
pds <- rlassoEffect(x = X, y = Y, d = W, method = "double selection")
summary(pds)
# Reports: coefficient, SE, t-stat, CI

# Python — Post-LASSO
from sklearn.linear_model import LassoCV
import statsmodels.api as sm

# Step 1: LASSO on Y ~ X to select controls
lasso_y = LassoCV(cv=5).fit(X, Y)
selected_y = np.where(lasso_y.coef_ != 0)[0]

# Step 2: LASSO on D ~ X to select controls
lasso_d = LassoCV(cv=5).fit(X, W)
selected_d = np.where(lasso_d.coef_ != 0)[0]

# Step 3: Union of selected variables
selected = np.union1d(selected_y, selected_d)

# Step 4: OLS with selected controls
X_selected = sm.add_constant(np.column_stack([W, X[:, selected]]))
ols_result = sm.OLS(Y, X_selected).fit(cov_type='HC1')
print(f"Post-LASSO ATE: {ols_result.params[1]:.4f} (SE: {ols_result.bse[1]:.4f})")

* Stata — Post-double-selection LASSO
ssc install lassopack

* pdslasso: post-double-selection
pdslasso outcome treatment (x1-x50), robust

# Test forest calibration: does the forest detect heterogeneity?
calibration <- test_calibration(cf)
print(calibration)
# Row 1 (mean forest prediction): should be significant → forest detects an effect
# Row 2 (differential forest prediction): significant → heterogeneity exists

# Out-of-bag predictions (built into grf)
oob_predictions <- predict(cf)$predictions  # uses OOB by default
cor(oob_predictions, df$true_cate)  # if true CATE known (simulation)