Estimators and the Delta Method
Why Estimators Deserve Your Attention
In statistics and machine learning, we almost never observe the quantity we truly care about.
Instead, we estimate it.
Whether it’s a population mean, a regression coefficient, a risk metric, or a model performance score, the object of interest is typically an unknown parameter $\theta$. An estimator is a rule—usually a function of data—that produces an approximation of this unknown quantity.
This article focuses on one of the most powerful tools for studying estimators:
The Delta Method
It allows us to approximate the variance (and distribution) of functions of estimators, using little more than calculus and asymptotics.
Estimators: A Quick Refresher
Let $X_1, \dots, X_n \sim P_\theta$ be i.i.d. data from a distribution indexed by an unknown parameter $\theta$.
An estimator is a function $\hat{\theta}_n = g(X_1, \dots, X_n)$ designed to approximate $\theta$.
Common properties we care about:
- Consistency: $\hat{\theta}_n \to \theta$
- Bias: $\mathbb{E}[\hat{\theta}_n] - \theta$
- Variance: $\mathrm{Var}(\hat{\theta}_n)$
- Asymptotic distribution: how $\hat{\theta}_n$ behaves as $n \to \infty$
Many classical estimators satisfy
$$ \sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma^2) $$
But what happens if we care about a function of $\theta$?
The Core Problem the Delta Method Solves
Suppose:
- $\hat{\theta}_n$ estimates $\theta$
- You care about $h(\theta)$ for some smooth function $h$
Examples:
- $\theta = \sigma^2$, but you want $\sigma$
- $\theta = p$, but you want $\log(p / (1-p))$
- $\theta$ is a vector, but you want a nonlinear risk or performance metric
You compute: $h(\hat{\theta}_n)$
Question:
What is the variance (or distribution) of this transformed estimator?
Intuition: Everything Is a Taylor Expansion
The Delta Method is nothing more than a first-order Taylor approximation.
Expand $h(\hat{\theta}_n)$ around the true value $\theta$: $$ h(\hat{\theta}_n) \approx h(\theta) + h’(\theta)(\hat{\theta}_n - \theta) $$
Subtract $h(\theta)$ and rescale: $$ \sqrt{n}\left(h(\hat{\theta}_n) - h(\theta)\right) \approx h’(\theta)\sqrt{n}(\hat{\theta}_n - \theta) $$
If $$ \sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma^2), $$
then $$ \sqrt{n}\left(h(\hat{\theta}_n) - h(\theta)\right) \xrightarrow{d} \mathcal{N}\left(0, [h’(\theta)]^2 \sigma^2\right) $$
The Delta Method (Formal Statement)
Let:
- $\hat{\theta}_n \xrightarrow{p} \theta$
- $\sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$
- $h$ is differentiable at $\theta$
Then: $$ \sqrt{n}\big(h(\hat{\theta}_n) - h(\theta)\big) \xrightarrow{d} \mathcal{N}\left(0, [h’(\theta)]^2 \sigma^2\right) $$
Equivalently: $$ \mathrm{Var}\big(h(\hat{\theta}_n)\big) \approx \frac{[h’(\theta)]^2 \sigma^2}{n} $$
Example 1: Estimating the Standard Deviation
Suppose:
$$ \hat{\sigma}^2 \approx \mathcal{N}\left(\sigma^2, \frac{2\sigma^4}{n}\right) $$
You want $\hat{\sigma} = \sqrt{\hat{\sigma}^2}$.
Define:
$$ h(x) = \sqrt{x}, \quad h’(x) = \frac{1}{2\sqrt{x}} $$
Apply the Delta Method:
$$ \mathrm{Var}(\hat{\sigma}) \approx \left(\frac{1}{2\sigma}\right)^2 \cdot \frac{2\sigma^4}{n} =\frac{\sigma^2}{2n} $$
Even though $\hat{\sigma}$ is nonlinear, its uncertainty is tractable.
Example 2: Log-Odds Transformation
Let $\hat{p}$ estimate a Bernoulli probability:
$$ \sqrt{n}(\hat{p} - p) \xrightarrow{d} \mathcal{N}(0, p(1-p)) $$
Define:
$$ h(p) = \log!\left(\frac{p}{1-p}\right) \quad\Rightarrow\quad h’(p) = \frac{1}{p(1-p)} $$
Then:
$$ \mathrm{Var}(h(\hat{p})) \approx \frac{1}{n,p(1-p)} $$
This underlies logistic regression inference and confidence intervals.
Multivariate Delta Method (Briefly)
If $\hat{\theta} \in \mathbb{R}^k$ and
$$ \sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} \mathcal{N}(0, \Sigma), $$
and $h:\mathbb{R}^k \to \mathbb{R}$,
then:
$$ \mathrm{Var}(h(\hat{\theta})) \approx \frac{1}{n} \nabla h(\theta)^\top \Sigma \nabla h(\theta) $$
This is essential for:
- Risk metrics
- Composite performance scores
- Safety or reliability functions
- Post-model transformations
Connection to Test-Set Variance
Many evaluation metrics are functions of sample averages:
- Accuracy
- Error rates
- Mean loss
- Calibration metrics
Example:
$$ \hat{R} = \frac{1}{n}\sum_{i=1}^n \ell(Y_i, \hat{f}(X_i)) $$
Often we then apply:
- Logs
- Ratios
- Square roots
- Normalizations
This is especially important when:
- Comparing models
- Setting thresholds
- Reporting confidence intervals
- Making deployment decisions
When the Delta Method Works (and When It Doesn’t)
Works well when:
- $n$ is large
- $h$ is smooth
- The estimator is asymptotically normal
Be careful when:
- $h’(\theta) = 0$
- The estimator is biased or unstable
- The distribution is heavy-tailed
- You are near boundaries (e.g. $p \approx 0$ or $1$)
In those cases:
- Higher-order Delta Methods
- Bootstrap
- Subsampling
may be more appropriate.
Why This Matters
The Delta Method is the bridge between:
- Estimation and inference
- Calculus and probability
- Raw metrics and decision-making
It teaches a deep lesson:
Uncertainty propagates through models exactly the way sensitivity does.
Once you see that, you start to think differently about estimators, metrics, and confidence.
Key Takeaways
- Estimators are random variables, not just numbers
- The Delta Method approximates the variance of transformed estimators
- It is derived from a first-order Taylor expansion
- It underpins confidence intervals for nonlinear quantities
- It explains variance in test-set metrics and derived scores
If you understand the Delta Method, you understand how uncertainty flows through your entire modeling pipeline.