Stein's Paradox
Problem
You observe a single sample where is unknown. You want to estimate under squared error loss:
The obvious estimator is the MLE: , which has risk .
For and : the MLE is optimal — no estimator can uniformly beat it.
For : show that the MLE is inadmissible, i.e., exhibit an estimator such that
The James–Stein estimator is:
Show that for all when .
Field
Statistics / Machine Learning
Why It's Beautiful
This is one of the most startling results in all of statistics. It says: even if you are estimating completely unrelated quantities (e.g., the temperature in Toronto, the GDP of Peru, and the mass of Jupiter), you should shrink all your estimates toward zero together — and this provably beats treating each problem independently.
The result shattered the intuition that "optimal estimation of independent quantities should be done independently." It led directly to the development of empirical Bayes methods, regularization (ridge regression shrinks toward zero for exactly this reason), and shrinkage estimators throughout modern statistics and ML.
Efron called it "the most striking result in post-war mathematical statistics."
Key Idea / Trick
Use Stein's identity: for and any weakly differentiable :
Write with , expand the squared loss, and apply the identity to evaluate the cross-term. The risk drops below precisely because .
Difficulty
4 / 5
Tags
Statistics, Estimation, Admissibility, James-Stein, Shrinkage, Stein's identity, Empirical Bayes, Regularization, MSE
Stein's Paradox — Answer
Setup
Let . Write the James–Stein estimator as:
The risk of any estimator expands as: = d + 2E\langle X - \mu,\, g(X)\rangle + E\|g(X)\|^2 \tag{1}
Stein's Identity
Lemma (Stein, 1981). If and is weakly differentiable with , then:
Proof sketch for : Integration by parts on the Gaussian density : Wait — since , we get . Integrate by parts:
Computing the Cross-Term
For , compute the divergence:
Using and summing over :
So by Stein's identity:
Computing the Squared Norm Term
Putting It Together
Substituting into :
Since a.s., we have . Therefore, when :
for all . The MLE is inadmissible.
Why Fails
For : or , but can become negative (worse than MLE) for some when — the identity doesn't yield a uniform improvement. In fact for the MLE is admissible.
The phase transition at is sharp and still not fully "intuitively explained" — it's one of those results that is mathematically clear but conceptually mysterious.
Connection to Ridge Regression
Ridge regression estimates , which shrinks coefficients toward zero. This is exactly Stein shrinkage in disguise — ridge is justified not just as regularization against overfitting, but as a provably better estimator in the MSE sense when the number of parameters .