M-estimation, influence functions, and semiparametric efficiency theory for causal inference
Rigorous framework for statistical inference and efficiency in modern methodology
Use this skill when working on: asymptotic properties of estimators, influence functions, semiparametric efficiency, double robustness, variance estimation, confidence intervals, hypothesis testing, M-estimation, or deriving limiting distributions.
Cramér-Rao Lower Bound: For any unbiased estimator, $$\text{Var}(\hat{\theta}) \geq \frac{1}{nI(\theta)}$$
where $I(\theta)$ is the Fisher information.
Semiparametric Efficiency Bound: The variance of the efficient influence function: $$V_{eff} = E[\phi^*(\theta_0)^2]$$
where $\phi^*$ is the efficient influence function (EIF).
Influence Function Notation: $IF(O; \theta, P)$ represents the influence of observation $O$ on parameter $\theta$ under distribution $P$: $$IF(O; \theta, P) = \lim_{\epsilon \to 0} \frac{T((1-\epsilon)P + \epsilon \delta_O) - T(P)}{\epsilon}$$
Semiparametric Variance: For RAL estimators, $$\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, E[IF(O)^2])$$
Estimating Equations: M-estimators solve $\sum_{i=1}^n \psi(O_i; \theta) = 0$, with asymptotic variance: $$V = \left(\frac{\partial}{\partial \theta} E[\psi(O; \theta)]\right)^{-1} E[\psi(O; \theta)\psi(O; \theta)^T] \left(\frac{\partial}{\partial \theta} E[\psi(O; \theta)]\right)^{-T}$$
| Estimand | Efficient Influence Function | Efficiency Bound |
|---|---|---|
| ATE | $\phi_{ATE} = \frac{A}{\pi}(Y-\mu_1) - \frac{1-A}{1-\pi}(Y-\mu_0) + \mu_1 - \mu_0 - \psi$ | $V_{ATE} = E[\phi_{ATE}^2]$ |
| NDE | Complex (VanderWeele & Tchetgen, 2014) | Higher than ATE |
| NIE | Complex (VanderWeele & Tchetgen, 2014) | Higher than ATE |
# Compute semiparametric efficiency bound
compute_efficiency_bound <- function(data, estimand = "ATE") {
n <- nrow(data)
if (estimand == "ATE") {
# Estimate nuisance functions
ps_model <- glm(A ~ X, data = data, family = binomial)
pi_hat <- predict(ps_model, type = "response")
mu1_model <- lm(Y ~ X, data = subset(data, A == 1))
mu0_model <- lm(Y ~ X, data = subset(data, A == 0))
mu1_hat <- predict(mu1_model, newdata = data)
mu0_hat <- predict(mu0_model, newdata = data)
# Efficient influence function
psi_hat <- mean(mu1_hat - mu0_hat)
phi <- with(data, {
A/pi_hat * (Y - mu1_hat) -
(1-A)/(1-pi_hat) * (Y - mu0_hat) +
mu1_hat - mu0_hat - psi_hat
})
# Efficiency bound = variance of EIF
list(
efficiency_bound = var(phi),
standard_error = sqrt(var(phi) / n),
eif_values = phi
)
}
}
Empirical Process: $\mathbb{G}_n(f) = \sqrt{n}(\mathbb{P}n - P)f = \frac{1}{\sqrt{n}}\sum{i=1}^n (f(O_i) - Pf)$
Uniform Convergence: For function class $\mathcal{F}$, $$\sup_{f \in \mathcal{F}} |\mathbb{G}n(f)| \xrightarrow{d} \sup{f \in \mathcal{F}} |\mathbb{G}(f)|$$
where $\mathbb{G}$ is a Gaussian process.
| Measure | Definition | Use |
|---|---|---|
| VC dimension | Max shattered set size | Classification |
| Covering number | $N(\epsilon, \mathcal{F}, |\cdot|)$ | General classes |
| Bracketing number | $N_{[]}(\epsilon, \mathcal{F}, L_2)$ | Entropy bounds |
| Rademacher complexity | $\mathcal{R}n(\mathcal{F}) = E[\sup{f \in \mathcal{F}} | \frac{1}{n}\sum_i \epsilon_i f(X_i) |
# Estimate Rademacher complexity via Monte Carlo
estimate_rademacher <- function(f_class, data, n_reps = 1000) {
n <- nrow(data)
sup_values <- replicate(n_reps, {
# Random Rademacher variables
epsilon <- sample(c(-1, 1), n, replace = TRUE)
# Compute supremum over function class
sup_f <- max(sapply(f_class, function(f) {
abs(mean(epsilon * f(data)))
}))
sup_f
})
mean(sup_values)
}
A function class $\mathcal{F}$ is Donsker if $\mathbb{G}_n \rightsquigarrow \mathbb{G}$ in $\ell^\infty(\mathcal{F})$, where $\mathbb{G}$ is a tight Gaussian process.
| Class | Description | Application |
|---|---|---|
| VC classes | Finite VC dimension | Classification functions |
| Smooth functions | Bounded derivatives | Regression estimators |
| Monotone functions | Single crossings | Distribution functions |
| Lipschitz functions | Bounded variation | M-estimators |
For M-estimation: If $\psi(O, \theta)$ belongs to a Donsker class, then $$\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, V)$$
where $V = (\partial_\theta E[\psi])^{-1} \text{Var}(\psi) (\partial_\theta E[\psi])^{-T}$
# Verify Donsker conditions for empirical process
check_donsker_conditions <- function(psi_class, data) {
# Estimate bracketing entropy integral
epsilon_grid <- seq(0.01, 1, by = 0.01)
bracket_numbers <- sapply(epsilon_grid, function(eps) {
# Estimate N_[](eps, F, L_2)
estimate_bracketing_number(psi_class, data, eps)
})
# Donsker if integral converges
entropy_integral <- integrate(
function(eps) sqrt(log(approxfun(epsilon_grid, bracket_numbers)(eps))),
lower = 0, upper = 1
)
list(
is_donsker = entropy_integral$value < Inf,
entropy_integral = entropy_integral$value,
bracket_numbers = data.frame(epsilon = epsilon_grid, N = bracket_numbers)
)
}
Estimator θ̂ₙ → Consistency → Asymptotic Normality → Efficiency → Inference
↓ ↓ ↓ ↓
θ̂ₙ →ᵖ θ₀ √n(θ̂ₙ-θ₀) →ᵈ N(0,V) V = V_eff CIs, tests
$X_n \xrightarrow{p} X$ if $\forall \epsilon > 0$: $P(|X_n - X| > \epsilon) \to 0$
Consistency: $\hat{\theta}_n \xrightarrow{p} \theta_0$
$X_n \xrightarrow{d} X$ if $F_{X_n}(x) \to F_X(x)$ at all continuity points
Asymptotic normality: $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, V)$
$X_n \xrightarrow{a.s.} X$ if $P(\lim_{n\to\infty} X_n = X) = 1$
Relationship: $\xrightarrow{a.s.} \Rightarrow \xrightarrow{p} \Rightarrow \xrightarrow{d}$
| Notation | Meaning | Example |
|---|---|---|
| $O_p(1)$ | Bounded in probability | $\hat{\theta}_n = O_p(1)$ |
| $o_p(1)$ | Converges to 0 in probability | $\hat{\theta}_n - \theta_0 = o_p(1)$ |
| $O_p(a_n)$ | $X_n/a_n = O_p(1)$ | $\hat{\theta}_n - \theta_0 = O_p(n^{-1/2})$ |
| $o_p(a_n)$ | $X_n/a_n = o_p(1)$ | Remainder terms |
Weak LLN: If $X_1, \ldots, X_n$ iid with $E|X| < \infty$: $$\bar{X}_n \xrightarrow{p} E[X]$$
Strong LLN: If $X_1, \ldots, X_n$ iid with $E|X| < \infty$: $$\bar{X}_n \xrightarrow{a.s.} E[X]$$
Uniform LLN: For $\sup_{\theta \in \Theta}$ convergence, need additional conditions (compactness, envelope).
Classical CLT: If $X_1, \ldots, X_n$ iid with $E[X] = \mu$, $Var(X) = \sigma^2 < \infty$: $$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$$
Lindeberg-Feller CLT: For triangular arrays with: $$\sum_{i=1}^n E[X_{ni}^2 \mathbf{1}(|X_{ni}| > \epsilon)] \to 0 \quad \forall \epsilon > 0$$
Multivariate CLT: $$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \Sigma)$$
If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{p} c$ (constant):
If $X_n \xrightarrow{d} X$ and $g$ continuous: $$g(X_n) \xrightarrow{d} g(X)$$
If $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, V)$ and $g$ differentiable at $\theta_0$: $$\sqrt{n}(g(\hat{\theta}_n) - g(\theta_0)) \xrightarrow{d} N(0, g'(\theta_0)^\top V g'(\theta_0))$$
Multivariate: Replace $g'(\theta_0)$ with Jacobian matrix.
Estimator $\hat{\theta}_n$ solves: $$\hat{\theta}n = \arg\max{\theta \in \Theta} M_n(\theta)$$
where $M_n(\theta) = n^{-1} \sum_{i=1}^n m(O_i; \theta)$
Result: $\hat{\theta}_n \xrightarrow{p} \theta_0$
Result: $$\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, [-\ddot{M}(\theta_0)]^{-1} V [-\ddot{M}(\theta_0)]^{-1})$$
Sandwich estimator: $$\hat{V} = \hat{A}^{-1} \hat{B} \hat{A}^{-1}$$