Estimators and the Delta Method

Why Estimators Deserve Your Attention

In statistics and machine learning, we almost never observe the quantity we truly care about.
Instead, we estimate it.

Whether it’s a population mean, a regression coefficient, a risk metric, or a model performance score, the object of interest is typically an unknown parameter $\theta$. An estimator is a rule—usually a function of data—that produces an approximation of this unknown quantity.

Understanding how estimators behave—especially their variability—is just as important as computing their point values.

This article focuses on one of the most powerful tools for studying estimators:

The Delta Method

It allows us to approximate the variance (and distribution) of functions of estimators, using little more than calculus and asymptotics.

Estimators: A Quick Refresher

Let $X_1, \dots, X_n \sim P_\theta$ be i.i.d. data from a distribution indexed by an unknown parameter $\theta$.

An estimator is a function $\hat{\theta}_n = g(X_1, \dots, X_n)$ designed to approximate $\theta$.

Common properties we care about:

Consistency: $\hat{\theta}_n \to \theta$
Bias: $\mathbb{E}[\hat{\theta}_n] - \theta$
Variance: $\mathrm{Var}(\hat{\theta}_n)$
Asymptotic distribution: how $\hat{\theta}_n$ behaves as $n \to \infty$

Many classical estimators satisfy

$$ \sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma^2) $$

But what happens if we care about a function of $\theta$?

The Core Problem the Delta Method Solves

Suppose:

$\hat{\theta}_n$ estimates $\theta$
You care about $h(\theta)$ for some smooth function $h$

Examples:

$\theta = \sigma^2$, but you want $\sigma$
$\theta = p$, but you want $\log(p / (1-p))$
$\theta$ is a vector, but you want a nonlinear risk or performance metric

You compute: $h(\hat{\theta}_n)$

Question:
What is the variance (or distribution) of this transformed estimator?

Intuition: Everything Is a Taylor Expansion

The Delta Method is nothing more than a first-order Taylor approximation.

Expand $h(\hat{\theta}_n)$ around the true value $\theta$: $$ h(\hat{\theta}_n) \approx h(\theta) + h’(\theta)(\hat{\theta}_n - \theta) $$

Subtract $h(\theta)$ and rescale: $$ \sqrt{n}\left(h(\hat{\theta}_n) - h(\theta)\right) \approx h’(\theta)\sqrt{n}(\hat{\theta}_n - \theta) $$

If $$ \sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma^2), $$

then $$ \sqrt{n}\left(h(\hat{\theta}_n) - h(\theta)\right) \xrightarrow{d} \mathcal{N}\left(0, [h’(\theta)]^2 \sigma^2\right) $$

The Delta Method says: propagate uncertainty through a nonlinear function using its derivative.

The Delta Method (Formal Statement)

Let:

$\hat{\theta}_n \xrightarrow{p} \theta$
$\sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$
$h$ is differentiable at $\theta$

Then: $$ \sqrt{n}\big(h(\hat{\theta}_n) - h(\theta)\big) \xrightarrow{d} \mathcal{N}\left(0, [h’(\theta)]^2 \sigma^2\right) $$

Equivalently: $$ \mathrm{Var}\big(h(\hat{\theta}_n)\big) \approx \frac{[h’(\theta)]^2 \sigma^2}{n} $$

Example 1: Estimating the Standard Deviation

Suppose:

$$ \hat{\sigma}^2 \approx \mathcal{N}\left(\sigma^2, \frac{2\sigma^4}{n}\right) $$

You want $\hat{\sigma} = \sqrt{\hat{\sigma}^2}$.

Define:

$$ h(x) = \sqrt{x}, \quad h’(x) = \frac{1}{2\sqrt{x}} $$

Apply the Delta Method:

$$ \mathrm{Var}(\hat{\sigma}) \approx \left(\frac{1}{2\sigma}\right)^2 \cdot \frac{2\sigma^4}{n} =\frac{\sigma^2}{2n} $$

Even though $\hat{\sigma}$ is nonlinear, its uncertainty is tractable.

Example 2: Log-Odds Transformation

Let $\hat{p}$ estimate a Bernoulli probability:

$$ \sqrt{n}(\hat{p} - p) \xrightarrow{d} \mathcal{N}(0, p(1-p)) $$

Define:

$$ h(p) = \log!\left(\frac{p}{1-p}\right) \quad\Rightarrow\quad h’(p) = \frac{1}{p(1-p)} $$

Then:

$$ \mathrm{Var}(h(\hat{p})) \approx \frac{1}{n,p(1-p)} $$

This underlies logistic regression inference and confidence intervals.

Multivariate Delta Method (Briefly)

If $\hat{\theta} \in \mathbb{R}^k$ and

$$ \sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} \mathcal{N}(0, \Sigma), $$

and $h:\mathbb{R}^k \to \mathbb{R}$,

then:

$$ \mathrm{Var}(h(\hat{\theta})) \approx \frac{1}{n} \nabla h(\theta)^\top \Sigma \nabla h(\theta) $$

This is essential for:

Risk metrics
Composite performance scores
Safety or reliability functions
Post-model transformations

Connection to Test-Set Variance

Many evaluation metrics are functions of sample averages:

Accuracy
Error rates
Mean loss
Calibration metrics

Example:

$$ \hat{R} = \frac{1}{n}\sum_{i=1}^n \ell(Y_i, \hat{f}(X_i)) $$

Often we then apply:

Logs
Ratios
Square roots
Normalizations

The Delta Method explains why test-set uncertainty scales like $1/n$ and how nonlinear metrics amplify or dampen noise.

This is especially important when:

Comparing models
Setting thresholds
Reporting confidence intervals
Making deployment decisions

When the Delta Method Works (and When It Doesn’t)

Works well when:

$n$ is large
$h$ is smooth
The estimator is asymptotically normal

Be careful when:

$h’(\theta) = 0$
The estimator is biased or unstable
The distribution is heavy-tailed
You are near boundaries (e.g. $p \approx 0$ or $1$)

In those cases:

Higher-order Delta Methods
Bootstrap
Subsampling
may be more appropriate.

Why This Matters

The Delta Method is the bridge between:

Estimation and inference
Calculus and probability
Raw metrics and decision-making

It teaches a deep lesson:

Uncertainty propagates through models exactly the way sensitivity does.

Once you see that, you start to think differently about estimators, metrics, and confidence.

Key Takeaways

Estimators are random variables, not just numbers
The Delta Method approximates the variance of transformed estimators
It is derived from a first-order Taylor expansion
It underpins confidence intervals for nonlinear quantities
It explains variance in test-set metrics and derived scores

If you understand the Delta Method, you understand how uncertainty flows through your entire modeling pipeline.