This short note compares the effects of $\ell_1$ (LASSO) and $\ell_2$ (Ridge) regularization on a simple linear regression problem. In particular it illustrates why $\ell_1$ minimization is more efficient at performing feature selection (i.e returning zero coefficients).
We start by displaying a set of $(\beta_0, \beta_1)$ pairs, solutions to the problem $$\min_{\boldsymbol \beta} \sum_{i=1}^N \left(t^{(i)} - (\beta_0 + \beta_1 x^{(i)})\right)^2 + \lambda \sum_{j=0}^1 |\beta_j|^2$$ where $t^{(i)}$ are generated as $t^{(i)} = \beta_0+\beta_1x^{(i)} + \varepsilon^{(i)}$ all around the origin. You can play with the value of $\lambda$ and $\varepsilon$ below to see how this affects the distribution of all $(\beta_0, \beta_1)$ solutions
import numpy as np
x = np.linspace(0,1,10)
XTildematrix = np.hstack((np.ones(np.shape(x.reshape(-1,1))), x.reshape(-1,1)))
anglevec = np.linspace(0,2*np.pi,50)
radius = 1
i=0
ridgeSolution = np.zeros((2, len(anglevec)))
# change the value of lbda to control the distortion of the circle below
lbda = 0.6
for angle in anglevec:
beta0 = radius*np.cos(angle)
beta1 = radius*np.sin(angle)
beta = np.array([beta0, beta1], dtype="object")
target = np.dot(beta, np.array([1, x],dtype="object"))
# adding some noise
noise = np.random.normal(0,.01,target.shape)
target_noisy = target + noise
tmp = np.linalg.inv(
np.dot(XTildematrix.T, XTildematrix) + lbda*np.identity(2))
ridgeSolution[:,i] = np.dot(tmp,np.dot(XTildematrix.T,target_noisy))
i+=1
import matplotlib.pyplot as plt
# Note how the circle gets distorted from the choice of the regularization
# parameter lbda
fig = plt.figure()
ax = fig.add_subplot(111)
plt.scatter(ridgeSolution[0,:], ridgeSolution[1,:], alpha=0.5)
plt.axis("equal")
plt.show()
We do the same with the LASSO penalty. Here we generate the pairs $(\beta_0, \beta_1)$ so that they are located all around the origin and are solutions to the problem
$$\min_{\boldsymbol \beta} \sum_{i=1}^N \left(t^{(i)} - (\beta_0 + \beta_1 x^{(i)})\right)^2 + \lambda \sum_{j=0}^1 |\beta_j|$$where $t^{(i)}$ are generated as $t^{(i)} = \beta_0+\beta_1x^{(i)} + \varepsilon^{(i)}$ all around the origin. Again, you can play with the value of $\lambda$ and $\varepsilon$ below to see how this affects the distribution of all $(\beta_0, \beta_1)$ solutions
import numpy as np
from sklearn import linear_model
# same idea with ridge
x = np.linspace(0,1,10)
XTildematrix = np.hstack((np.ones(np.shape(x.reshape(-1,1))), x.reshape(-1,1)))
anglevec = np.linspace(0,2*np.pi,100)
radius = 1
i=0
lassoSolution = np.zeros((2, len(anglevec)))
lbda = 0.06
for angle in anglevec:
beta0 = radius*np.cos(angle)
beta1 = radius*np.sin(angle)
beta = np.array([beta0, beta1], dtype="object")
target = np.dot(beta, np.array([1, x],dtype="object"))
# adding some noise
noise = np.random.normal(0,.01,target.shape)
target_noisy = target + noise
clf = linear_model.Lasso(alpha=lbda)
clf.fit(x.reshape(-1,1), target_noisy.reshape(-1,1))
lassoSolution[:,i] = np.squeeze(np.array([clf.intercept_, clf.coef_ ],dtype="object"))
i+=1
From the distribution of those solutions, you can already see that unlike the Ridge formulation, the pairs $(\beta_0, \beta_1)$ will not be distributed uniformly around the origin but many of them (as we require the pairs to be solutions to the LASSO objective) will have one of their components ($\beta_0$ or $\beta_1$) set to $0$.
import matplotlib.pyplot as plt
from __future__ import division
fig = plt.figure()
ax = fig.add_subplot(111)
plt.scatter(lassoSolution[0,:], lassoSolution[1,:], alpha=0.5)
plt.axis("equal")
plt.show()
So far we have see how the distribution of pairs $(\beta_0,\beta_1)$ varies from one regularization approach to the other. We haven't provided much intuition for why there is such an increase in 'sparse' $(\beta_0,\beta_1)$ pairs when switching from Ridge to LASSO. What we can do to get some intuition is to draw the level lines of the data misfit $\sum_{i=1}^N \left(t^{(i)} - (\beta_0 + \beta_1 x^{(i)})\right)^2$ on the one hand, and the level lines of the Ridge (resp. LASSO) penalty on the other. The solution of the minimization
$$\min_{\boldsymbol \beta} \sum_{i=1}^N \left(t^{(i)} - (\beta_0 + \beta_1 x^{(i)})\right)^2 + \lambda \sum_{j=0}^1 |\beta_j|^2$$and
$$\min_{\boldsymbol \beta} \sum_{i=1}^N \left(t^{(i)} - (\beta_0 + \beta_1 x^{(i)})\right)^2 + \lambda \sum_{j=0}^1 |\beta_j|$$will be given by the intersection of the curves.
# plotting the level lines
import numpy as np
x = np.linspace(0,1,10)
XTildematrix = np.hstack((np.ones(np.shape(x.reshape(-1,1))), x.reshape(-1,1)))
anglevec = np.linspace(0,2*np.pi,50)
radius = 1
i=0
xx, yy = np.meshgrid(np.linspace(-1.5,1.5,100),np.linspace(-1.5,1.5,100))
betaMat = np.hstack((xx.flatten().reshape(-1,1), yy.flatten().reshape(-1,1)))
fig = plt.figure()
ax = fig.add_subplot(111)
for angle in anglevec:
beta0 = radius*np.cos(angle)
beta1 = radius*np.sin(angle)
beta = np.array([beta0, beta1], dtype="object")
target = np.dot(beta, np.array([1, x],dtype="object"))
# adding some noise
noise = np.random.normal(0,.0,target.shape)
target_noisy = target + noise
costfun = np.sum((np.dot(betaMat,XTildematrix.T) - np.matmul(np.ones((len(xx.flatten()),1)),
np.expand_dims(target_noisy, axis=1).T))**2, axis=1)
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [0.2], colors='red', alpha=0.1);
l1ballfunction = np.abs(xx) + np.abs(yy)
plt.contour(xx, yy, l1ballfunction, [0.6], colors='blue', alpha=0.9);
plt.axis("equal")
plt.show()
To understand why the LASSO formulation is better at feature selection, we can focus on a couple of parabolas and look at the evolution of the intersection between the parabola and $\ell_1$ ball as a function of the angle.
# extending the parabolas
# selecting the angles corresponding to zero coefficients
# plotting the level lines
import numpy as np
x = np.linspace(0,1,10)
XTildematrix = np.hstack((np.ones(np.shape(x.reshape(-1,1))), x.reshape(-1,1)))
anglevec = np.linspace(0,2*np.pi,50)
radius = 1
i=0
xx, yy = np.meshgrid(np.linspace(-1.5,1.5,100),np.linspace(-1.5,1.5,100))
betaMat = np.hstack((xx.flatten().reshape(-1,1), yy.flatten().reshape(-1,1)))
fig = plt.figure()
ax = fig.add_subplot(111)
i=0
for angle in anglevec:
if ((angle<np.pi/8) or (angle> 2*np.pi - np.pi/8) ):
beta0 = radius*np.cos(angle)
beta1 = radius*np.sin(angle)
beta = np.array([beta0, beta1], dtype="object")
target = np.dot(beta, np.array([1, x],dtype="object"))
# adding some noise
noise = np.random.normal(0,.0,target.shape)
target_noisy = target + noise
costfun = np.sum((np.dot(betaMat,XTildematrix.T) - np.matmul(np.ones((len(xx.flatten()),1)),
np.expand_dims(target_noisy, axis=1).T))**2, axis=1)
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [0.4], colors='red', alpha=0.1);
if i==4:
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [0.4], colors='red');
i+=1
l1ballfunction = np.abs(xx) + np.abs(yy)
plt.contour(xx, yy, l1ballfunction, [0.6], colors='blue');
plt.axis("equal")
plt.show()
For the first parabola, you can see that the intersection occurs at the (spiky) tip of the $\ell_1$ ball. The solution returned in this case will thus have one weight set to $0$. Let us now take the second parabola.
# extending the parabolas
# selecting the angles corresponding to zero coefficients
# plotting the level lines
import numpy as np
x = np.linspace(0,1,10)
XTildematrix = np.hstack((np.ones(np.shape(x.reshape(-1,1))), x.reshape(-1,1)))
anglevec = np.linspace(0,2*np.pi,50)
radius = 1
i=0
xx, yy = np.meshgrid(np.linspace(-1.5,1.5,100),np.linspace(-1.5,1.5,100))
betaMat = np.hstack((xx.flatten().reshape(-1,1), yy.flatten().reshape(-1,1)))
fig = plt.figure()
ax = fig.add_subplot(111)
i=0
for angle in anglevec:
if ((angle<np.pi/8) or (angle> 2*np.pi - np.pi/8) ):
beta0 = radius*np.cos(angle)
beta1 = radius*np.sin(angle)
beta = np.array([beta0, beta1], dtype="object")
target = np.dot(beta, np.array([1, x],dtype="object"))
# adding some noise
noise = np.random.normal(0,.0,target.shape)
target_noisy = target + noise
costfun = np.sum((np.dot(betaMat,XTildematrix.T) - np.matmul(np.ones((len(xx.flatten()),1)),
np.expand_dims(target_noisy, axis=1).T))**2, axis=1)
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [0.65], colors='red', alpha=0.1);
if i==5:
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [0.65], colors='red');
i+=1
l1ballfunction = np.abs(xx) + np.abs(yy)
plt.contour(xx, yy, l1ballfunction, [0.6], colors='blue');
plt.axis("equal")
plt.show()
For this second parabola, we see once again, that because of the structure of the LASSO penalty, the intersection occurs on the axis. The solution will againn have a zero coefficient.
#extending the parabolas
# selecting the angles corresponding to zero coefficients
# plotting the level lines
import numpy as np
x = np.linspace(0,1,10)
XTildematrix = np.hstack((np.ones(np.shape(x.reshape(-1,1))), x.reshape(-1,1)))
anglevec = np.linspace(0,2*np.pi,50)
radius = 1
i=0
xx, yy = np.meshgrid(np.linspace(-1.7,1.7,100),np.linspace(-1.7,1.7,100))
betaMat = np.hstack((xx.flatten().reshape(-1,1), yy.flatten().reshape(-1,1)))
fig = plt.figure()
ax = fig.add_subplot(111)
i=0
for angle in anglevec:
if ((angle<np.pi/8) or (angle> 2*np.pi - np.pi/8) ):
beta0 = radius*np.cos(angle)
beta1 = radius*np.sin(angle)
beta = np.array([beta0, beta1], dtype="object")
target = np.dot(beta, np.array([1, x],dtype="object"))
# adding some noise
noise = np.random.normal(0,.0,target.shape)
target_noisy = target + noise
costfun = np.sum((np.dot(betaMat,XTildematrix.T) - np.matmul(np.ones((len(xx.flatten()),1)),
np.expand_dims(target_noisy, axis=1).T))**2, axis=1)
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [1.1], colors='red', alpha=0.1);
if i == 6:
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [1.1], colors='red');
i+=1
l1ballfunction = np.abs(xx) + np.abs(yy)
plt.contour(xx, yy, l1ballfunction, [0.6], colors='blue');
plt.axis("equal")
plt.show()
For the third parabola, we observe the same phenomenon. Because of the "spiky" structure of the ball. The parabola hits the tip of the blue ball first.
#extending the parabolas
# selecting the angles corresponding to zero coefficients
# plotting the level lines
import numpy as np
x = np.linspace(0,1,10)
XTildematrix = np.hstack((np.ones(np.shape(x.reshape(-1,1))), x.reshape(-1,1)))
anglevec = np.linspace(0,2*np.pi,50)
radius = 1
i=0
xx, yy = np.meshgrid(np.linspace(-1.5,1.5,100),np.linspace(-1.5,1.5,100))
betaMat = np.hstack((xx.flatten().reshape(-1,1), yy.flatten().reshape(-1,1)))
fig = plt.figure()
ax = fig.add_subplot(111)
i=0
for angle in anglevec:
if ((angle<np.pi/8) or (angle> 2*np.pi - np.pi/8) ):
beta0 = radius*np.cos(angle)
beta1 = radius*np.sin(angle)
beta = np.array([beta0, beta1], dtype="object")
target = np.dot(beta, np.array([1, x],dtype="object"))
# adding some noise
noise = np.random.normal(0,.0,target.shape)
target_noisy = target + noise
costfun = np.sum((np.dot(betaMat,XTildematrix.T) - np.matmul(np.ones((len(xx.flatten()),1)),
np.expand_dims(target_noisy, axis=1).T))**2, axis=1)
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [1.1], colors='red', alpha=0.1);
if i == 6:
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [1.1], colors='red', alpha=0.8);
i+=1
l1ballfunction = np.abs(xx) + np.abs(yy)
plt.contour(xx, yy, l1ballfunction, [0.6], colors='blue');
plt.axis("equal")
plt.show()
For this last example above, we see that the parabola hits the tip of the ball again
We apply the same approach to the $\ell_2$ ball and study when a parabola generated at a particular angle around the origin hits the $\ell_2$ ball corresponding to the 'Ridge' penalty.
# plotting the level lines
import numpy as np
x = np.linspace(0,1,10)
XTildematrix = np.hstack((np.ones(np.shape(x.reshape(-1,1))), x.reshape(-1,1)))
anglevec = np.linspace(0,2*np.pi,50)
radius = 1
i=0
xx, yy = np.meshgrid(np.linspace(-1.5,1.5,100),np.linspace(-1.5,1.5,100))
betaMat = np.hstack((xx.flatten().reshape(-1,1), yy.flatten().reshape(-1,1)))
fig = plt.figure()
ax = fig.add_subplot(111)
for angle in anglevec:
beta0 = radius*np.cos(angle)
beta1 = radius*np.sin(angle)
beta = np.array([beta0, beta1], dtype="object")
target = np.dot(beta, np.array([1, x],dtype="object"))
# adding some noise
noise = np.random.normal(0,.0,target.shape)
target_noisy = target + noise
costfun = np.sum((np.dot(betaMat,XTildematrix.T) - np.matmul(np.ones((len(xx.flatten()),1)),
np.expand_dims(target_noisy, axis=1).T))**2, axis=1)
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [0.2], colors='red', alpha = 0.1);
l2ballfunction = np.sqrt(np.abs(xx)**2 + np.abs(yy)**2)
plt.contour(xx, yy, l2ballfunction, [0.6], colors='blue');
plt.axis("equal")
plt.show()
# extending the parabolas
# selecting the angles corresponding to zero coefficients
# plotting the level lines
import numpy as np
x = np.linspace(0,1,10)
XTildematrix = np.hstack((np.ones(np.shape(x.reshape(-1,1))), x.reshape(-1,1)))
anglevec = np.linspace(0,2*np.pi,50)
radius = 1
i=0
xx, yy = np.meshgrid(np.linspace(-1.5,1.5,100),np.linspace(-1.5,1.5,100))
betaMat = np.hstack((xx.flatten().reshape(-1,1), yy.flatten().reshape(-1,1)))
fig = plt.figure()
ax = fig.add_subplot(111)
i=0
for angle in anglevec:
if ((angle<np.pi/8) or (angle> 2*np.pi - np.pi/8) ):
beta0 = radius*np.cos(angle)
beta1 = radius*np.sin(angle)
beta = np.array([beta0, beta1], dtype="object")
target = np.dot(beta, np.array([1, x],dtype="object"))
# adding some noise
noise = np.random.normal(0,.0,target.shape)
target_noisy = target + noise
costfun = np.sum((np.dot(betaMat,XTildematrix.T) - np.matmul(np.ones((len(xx.flatten()),1)),
np.expand_dims(target_noisy, axis=1).T))**2, axis=1)
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [0.32], colors='red', alpha=0.1);
if i==4:
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [0.32], colors='red');
tt0 = target_noisy
i+=1
l2ballfunction = np.sqrt(np.abs(xx)**2 + np.abs(yy)**2)
plt.contour(xx, yy, l2ballfunction, [0.6], colors='blue');
plt.axis("equal")
plt.show()
beta1 = np.linspace(-0.6, 0.6,100)
beta0 = np.sqrt(abs((0.6)**2 -beta1**2))
betaTest = np.hstack((beta1.reshape(-1,1), beta0.reshape(-1,1)))
objective = np.sum((np.dot(XTildematrix,betaTest.T) - np.matmul(tt0.reshape(-1,1),np.ones((1, len(betaTest)))))**2, axis = 0 )
minBeta1_1 = beta1[np.argmin(objective)]
minBeta0_1 = np.sqrt(abs(0.6**2 - minBeta1_1**2))
# extending the parabolas
# selecting the angles corresponding to zero coefficients
# plotting the level lines
import numpy as np
x = np.linspace(0,1,10)
XTildematrix = np.hstack((np.ones(np.shape(x.reshape(-1,1))), x.reshape(-1,1)))
anglevec = np.linspace(0,2*np.pi,50)
radius = 1
i=0
xx, yy = np.meshgrid(np.linspace(-1.5,1.5,100),np.linspace(-1.5,1.5,100))
betaMat = np.hstack((xx.flatten().reshape(-1,1), yy.flatten().reshape(-1,1)))
fig = plt.figure()
ax = fig.add_subplot(111)
i=0
for angle in anglevec:
if ((angle<np.pi/8) or (angle> 2*np.pi - np.pi/8) ):
beta0 = radius*np.cos(angle)
beta1 = radius*np.sin(angle)
beta = np.array([beta0, beta1], dtype="object")
target = np.dot(beta, np.array([1, x],dtype="object"))
# adding some noise
noise = np.random.normal(0,.0,target.shape)
target_noisy = target + noise
costfun = np.sum((np.dot(betaMat,XTildematrix.T) - np.matmul(np.ones((len(xx.flatten()),1)),
np.expand_dims(target_noisy, axis=1).T))**2, axis=1)
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [0.45], colors='red', alpha=0.1);
if i==5:
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [0.45], colors='red');
tt1 = target_noisy
i+=1
l2ballfunction = np.sqrt(np.abs(xx)**2 + np.abs(yy)**2)
plt.contour(xx, yy, l2ballfunction, [0.6], colors='blue');
plt.axis("equal")
plt.show()
beta1 = np.linspace(-0.6, 0.6,100)
beta0 = np.sqrt(abs((0.6)**2 -beta1**2))
betaTest = np.hstack((beta1.reshape(-1,1), beta0.reshape(-1,1)))
objective = np.sum((np.dot(XTildematrix,betaTest.T) - np.matmul(tt1.reshape(-1,1),np.ones((1, len(betaTest)))))**2, axis=0 )
minBeta1_2 = beta1[np.argmin(objective)]
minBeta0_2 = np.sqrt(abs(0.6**2 - minBeta1_2**2))
#extending the parabolas
# selecting the angles corresponding to zero coefficients
# plotting the level lines
import numpy as np
x = np.linspace(0,1,10)
XTildematrix = np.hstack((np.ones(np.shape(x.reshape(-1,1))), x.reshape(-1,1)))
anglevec = np.linspace(0,2*np.pi,50)
radius = 1
i=0
xx, yy = np.meshgrid(np.linspace(-1.7,1.7,100),np.linspace(-1.7,1.7,100))
betaMat = np.hstack((xx.flatten().reshape(-1,1), yy.flatten().reshape(-1,1)))
fig = plt.figure()
ax = fig.add_subplot(111)
i=0
for angle in anglevec:
if ((angle<np.pi/8) or (angle> 2*np.pi - np.pi/8) ):
beta0 = radius*np.cos(angle)
beta1 = radius*np.sin(angle)
beta = np.array([beta0, beta1], dtype="object")
target = np.dot(beta, np.array([1, x],dtype="object"))
# adding some noise
noise = np.random.normal(0,.0,target.shape)
target_noisy = target + noise
costfun = np.sum((np.dot(betaMat,XTildematrix.T) - np.matmul(np.ones((len(xx.flatten()),1)),
np.expand_dims(target_noisy, axis=1).T))**2, axis=1)
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [0.68], colors='red', alpha=0.1);
if i == 6:
plt.contour(xx, yy, np.reshape(costfun,np.shape(xx)), [0.68], colors='red');
tt2 = target_noisy
i+=1
l2ballfunction = np.sqrt(np.abs(xx)**2 + np.abs(yy)**2)
plt.contour(xx, yy, l2ballfunction, [0.6], colors='blue');
plt.axis("equal")
plt.show()
beta1 = np.linspace(-0.6, 0.6,100)
beta0 = np.sqrt(abs((0.6)**2 -beta1**2))
betaTest = np.hstack((beta1.reshape(-1,1), beta0.reshape(-1,1)))
objective = np.sum((np.dot(XTildematrix,betaTest.T) - np.matmul(tt2.reshape(-1,1),np.ones((1, len(betaTest)))))**2, axis=0)
minBeta1_3 = beta1[np.argmin(objective)]
minBeta0_3 = np.sqrt(abs(0.6**2 - minBeta1_3**2))
We can now plot each of the intersections on the circle to study which of the pairs $(\beta_0, \beta_1)$ will have a zero weight.
plt.scatter(minBeta1_1, minBeta0_1, marker='x')
plt.scatter(minBeta1_2, minBeta0_2, marker='x')
plt.scatter(minBeta1_3, minBeta0_3, marker='x')
plt.contour(xx, yy, l2ballfunction, [0.6], colors='blue');
plt.axis("equal")
plt.show()
plt.scatter(minBeta1_1, minBeta0_1,marker='x', s= 120,linewidths=2.5)
plt.scatter(minBeta1_2, minBeta0_2,marker='x', s= 120,linewidths=2.5)
plt.scatter(minBeta1_3, minBeta0_3,marker='x', s= 120,linewidths=2.5)
xx, yy = np.meshgrid(np.linspace(-1.7,1.7,100),np.linspace(-1.7,1.7,100))
l2ballfunction = np.sqrt(np.abs(xx)**2 + np.abs(yy)**2)
plt.contour(xx, yy, l2ballfunction, [0.6], colors='blue');
plt.xlim((0.4,0.7))
plt.ylim((0,0.3))
plt.show()
In this case, you see that because of the 'smooth' structure of the $\ell_2$ norm (vs the $\ell_1$ which is much more spiky near the axes), a small change in the angle of the ellipsoid, leads to a small change in the intersection between the circle and the parabola, and hence to a small displacement of the solution along the circle. Even the first parabola was intersecting the circle on the axis, the next one already has a distinct intersection.