Why Does LASSO Produce Sparse Solutions but Ridge Does Not?
Problem
Consider minimizing a convex differentiable loss (e.g. least squares) subject to a norm constraint on :
Question: Give a geometric argument for why the LASSO constraint tends to produce solutions with exact zeros (sparse ), while Ridge does not — even when the unconstrained minimizer is the same for both.
Why It's Interesting
This is one of the most fundamental and beautiful insights in modern ML. Sparsity is not an assumption baked in — it emerges purely from the geometry of the ball. The same loss, the same data, a different shape of constraint: one gives you feature selection for free, the other never does.
Answer: view
Answer: Why Does LASSO Produce Sparse Solutions but Ridge Does Not?
Key Idea
The ball has corners on the coordinate axes. The ball is a smooth sphere with no corners. The optimal constrained solution is found where the loss level set first touches the constraint region — and touching a corner forces a coordinate to be exactly zero.
Geometric Argument
The constrained problem is equivalent (by Lagrange duality) to the penalized form:
Think of "inflating" the level sets of outward from the unconstrained minimizer until they first touch the constraint set.
LASSO ( ball)
In , the ball is a diamond (rotated square) with vertices at and .
The expanding elliptical level sets of generically hit one of these corners first — points of the form or .
At a corner, one (or more) coordinates is exactly zero. That is sparsity.
Ridge ( ball)
The ball is a sphere, which is strictly convex and smooth — no corners anywhere.
A level set ellipse touches the sphere at an interior point of the boundary, generically where both coordinates are nonzero.
There is no mechanism to "snap" a coordinate to zero.
Why Corners Are Decisive
A corner of the ball is a point where the subdifferential of is large (it contains an entire interval, not just a single gradient direction). This gives the KKT conditions "room" to be satisfied even when the gradient of is not pointing exactly along a coordinate axis — so a wide range of loss functions get pinned to the corner.
Formally, at a corner like , the KKT condition for LASSO is:
This is satisfiable for any with magnitude — a non-trivial range. Ridge has no such mechanism because the subdifferential of is the singleton , giving no slack.
Summary
| | LASSO | Ridge | |--|-------|-------| | Constraint region | ball (diamond, has corners) | ball (sphere, smooth) | | Level sets touch at | Corner coordinate = 0 | Smooth boundary all nonzero | | Sparsity | Yes, exact zeros | No, only shrinkage toward 0 | | Use case | Feature selection | Coefficient shrinkage |
The punchline: LASSO does feature selection not because we asked it to, but because sharp corners make it geometrically inevitable.