🧮 Brain Teaser

Why Does LASSO Produce Sparse Solutions but Ridge Does Not?


Problem

Consider minimizing a convex differentiable loss L(β)L(\beta) (e.g. least squares) subject to a norm constraint on βRd\beta \in \mathbb{R}^d:

minβL(β)subject toβ1t(LASSO)\min_{\beta} L(\beta) \quad \text{subject to} \quad \|\beta\|_1 \leq t \qquad \text{(LASSO)}

minβL(β)subject toβ22t(Ridge)\min_{\beta} L(\beta) \quad \text{subject to} \quad \|\beta\|_2^2 \leq t \qquad \text{(Ridge)}

Question: Give a geometric argument for why the LASSO constraint tends to produce solutions with exact zeros (sparse β^\hat\beta), while Ridge does not — even when the unconstrained minimizer β^OLS\hat\beta^{\text{OLS}} is the same for both.


Why It's Interesting

This is one of the most fundamental and beautiful insights in modern ML. Sparsity is not an assumption baked in — it emerges purely from the geometry of the L1L^1 ball. The same loss, the same data, a different shape of constraint: one gives you feature selection for free, the other never does.


Answer: view

LASSORidgeSparsityRegularizationConvex geometry

Answer: Why Does LASSO Produce Sparse Solutions but Ridge Does Not?

Key Idea

The L1L^1 ball has corners on the coordinate axes. The L2L^2 ball is a smooth sphere with no corners. The optimal constrained solution is found where the loss level set first touches the constraint region — and touching a corner forces a coordinate to be exactly zero.


Geometric Argument

The constrained problem is equivalent (by Lagrange duality) to the penalized form:

minβL(β)+λβ1vsminβL(β)+λβ22\min_\beta \, L(\beta) + \lambda \|\beta\|_1 \quad \text{vs} \quad \min_\beta \, L(\beta) + \lambda \|\beta\|_2^2

Think of "inflating" the level sets of LL outward from the unconstrained minimizer β^OLS\hat\beta^{\text{OLS}} until they first touch the constraint set.

LASSO (L1L^1 ball)

In R2\mathbb{R}^2, the L1L^1 ball β1t\|\beta\|_1 \leq t is a diamond (rotated square) with vertices at (±t,0)(\pm t, 0) and (0,±t)(0, \pm t).

The expanding elliptical level sets of LL generically hit one of these corners first — points of the form (β1,0)(\beta_1, 0) or (0,β2)(0, \beta_2).

At a corner, one (or more) coordinates is exactly zero. That is sparsity.

Ridge (L2L^2 ball)

The L2L^2 ball β22t\|\beta\|_2^2 \leq t is a sphere, which is strictly convex and smooth — no corners anywhere.

A level set ellipse touches the sphere at an interior point of the boundary, generically where both coordinates are nonzero.

There is no mechanism to "snap" a coordinate to zero.


Why Corners Are Decisive

A corner of the L1L^1 ball is a point where the subdifferential of 1\|\cdot\|_1 is large (it contains an entire interval, not just a single gradient direction). This gives the KKT conditions "room" to be satisfied even when the gradient of LL is not pointing exactly along a coordinate axis — so a wide range of loss functions get pinned to the corner.

Formally, at a corner like β=(t,0)\beta = (t, 0), the KKT condition for LASSO is:

β2L(β^)λβ2=[λ,λ]-\nabla_{\beta_2} L(\hat\beta) \in \lambda \cdot \partial |\beta_2| = [-\lambda, \lambda]

This is satisfiable for any β2L\nabla_{\beta_2} L with magnitude λ\leq \lambda — a non-trivial range. Ridge has no such mechanism because the subdifferential of β22\|\beta\|_2^2 is the singleton {2β}\{2\beta\}, giving no slack.


Summary

| | LASSO | Ridge | |--|-------|-------| | Constraint region | L1L^1 ball (diamond, has corners) | L2L^2 ball (sphere, smooth) | | Level sets touch at | Corner \Rightarrow coordinate = 0 | Smooth boundary \Rightarrow all nonzero | | Sparsity | Yes, exact zeros | No, only shrinkage toward 0 | | Use case | Feature selection | Coefficient shrinkage |

The punchline: LASSO does feature selection not because we asked it to, but because sharp corners make it geometrically inevitable.

Type: ML/StatsEdit on GitHub ↗