import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
Today we'll use the "scikit-learn" (aka sklearn
) implementation of these models. "scikit-learn" is a machine learning-focused library that has a ton of helper functions for data scientists to quickly and efficiently try a lot of different models.
Since this notebook is mainly to show you how to implement these models in python, you should watch this Statquest playlist to get the intuition behind these models:
sklearn
workflow¶The most basic sklearn
model workflow goes something like this:
Step | Action | Example |
---|---|---|
0 | Import the model from sklearn |
from sklearn.linear_model import LinearRegression |
1 | Instantiate the model class | model = LinearRegression() |
2 | Fit the model | model.fit(X, y) |
3 | Use the fitted model to make a prediction | model.predict(X) |
This standard interface is what makes sklearn
so powerful - with .fit()
and .predict()
and a huge variety of models in the library, you can use this common language to try all sorts of machine learning models.
And that's just the start, from this basic 3-step pipeline you can add all kinds of additional steps to this worfklow that make experimentation very convenient for the user.
sklearn
¶Before we cover Lasso, Ridge, and ElasticNet in sklearn
it might be helpful to demonstrate some basic functionality with doing a linear regression.
Reference: Linear Regression
# Import the model
from sklearn.linear_model import LinearRegression
Next we'll load in our data. We'll be using seaborn
, a visualization library that has some 'toy' datasets to play with and we'll also use it briefly for some specialized plots.
The mpg
dataset is a famous dataset that has some car engine attributes like engine displacement (engine size) and others that can be used to estimate the car's fuel efficiency, measured in miles per gallon or mpg
.
We can load it in using the sns.load_dataset()
function:
# load mpg example dataset
mpg = sns.load_dataset("mpg")
# we'll filter on just a few columns and drop na/empty values
mpg = mpg[['mpg', 'displacement', 'horsepower', 'weight']].dropna()
X = mpg[['displacement', 'horsepower', 'weight']]
y = mpg['mpg']
And one helpful plot for you would be the sns.pairplot
function that will plot scatterplots for each of the variables against each other and also display the distribution for each variable as a histogram:
# Convenience method to plot pairs of variables together all in one go!
sns.pairplot(mpg)
<seaborn.axisgrid.PairGrid at 0x7ff5f0b498e0>
Next, we'll follow the sklearn
workflow.
Let's model the relationship between average MPG and engine displacement and get a quick $R^2$ value:
# Filter on just the 'displacement' variable in the X dataframe
x = X[["displacement"]]
model = LinearRegression()
model.fit(x, y)
# model.score() returns the model R^2 in this case
R2 = model.score(x, y)
print("Model R^2: ", R2)
Model R^2: 0.6482294003193044
# Show the fitted values
yhat = model.predict(x)
plt.scatter(x, y, label="Engine Displacement")
plt.plot(x, model.predict(x), label="Fitted model", c="r")
plt.title("MPG ~ Displacement")
plt.legend()
plt.show()
With Linear Regression and similar models, we can extract the fitted coefficients and the intercept with two commands:
model.coef_
will return the coefficientsmodel.intercept_
will return the interceptprint("model coefficient(s):", model.coef_)
print("model intercept:", model.intercept_)
model coefficient(s): [-0.06005143] model intercept: 35.12063593840391
Notice that this has fewer "batteries included" than statsmodels but it gets the job done. So if you wanted other summary statistics like p-values you will need to do them manually in sklearn
.
But all that aside, in math terms, we've just fit the model:
$$ \hat{Y} = -0.06x + 35.12 $$where $x$ is the engine displacement and $\hat{Y}$ is the estimated average value of mpg. In other words, for every unit increase in engine displacement, average mpg is expected to decrease by -0.06.
from sklearn.linear_model import Lasso, Ridge, ElasticNet
# For this exercise we'll split the data into training and test sets
from sklearn.model_selection import train_test_split
# And we'll need to scale the data
from sklearn.preprocessing import StandardScaler
sklearn
Reference:
With regularization models you must scale your data!
Since we're dealing with penalties, we need to standardize the scale of data so the regularization penalty doesn't over-penalize numbers that are orders of magnitude higher than others.
The StandardScaler()
class in sklearn
helps us with this, and using it is simple:
StandardScaler()
fit
the scaler with the .fit()
method on the datatransform
the data with .transform()
.fit
and .transform
steps with a single .fit_transform()
callAnd this particular scaler simply takes the X data and standardizes them into Z-scores by subtracting the mean and dividing over the standard deviation, i.e.:
$$ \text{scaled horsepower} = \frac{\text{horsepower} - \text{average horsepower}}{\text{horsepower standard deviation}} $$# Here's the data again for clarity
X = mpg[['displacement', 'horsepower', 'weight']]
y = mpg['mpg']
# Instantiate the Scaler class
scaler = StandardScaler()
# Fit and then transform the X-data
scaled_X = scaler.fit_transform(X)
# The transformed data is now an array, we want to convert it back
# to a dataframe for easier parsing (optional)
scaled_X = pd.DataFrame(scaled_X, columns=['displacement', 'horsepower', 'weight'])
note: This transformation is a little messy because I'm scaling only the X values (i.e. the engine displacement, horsepower, and weight) and leaving the y unscaled to help a little with the interpretation piece. This code is normally much simpler when you are scaling the y values as well.
# We'll take our scaled X values and combine it with the unscaled y
df = scaled_X.copy()
df['mpg'] = y
df = df.dropna()
train_test_split()
¶The next step we will perform is to divide the data into two parts: a training set and a test set.
# split the new dataframe into train/test
mpg_train, mpg_test = train_test_split(df, test_size=0.20, random_state=42)
X_train = mpg_train[['displacement', 'horsepower', 'weight']]
X_test = mpg_test[['displacement', 'horsepower', 'weight']]
y_train = mpg_train['mpg']
y_test = mpg_test['mpg']
Statquest video: Regularization Part 2: Lasso (L1) Regression
First model we'll demonstrate is a single call of Lasso and manually set the value of the alpha parameter.
[Lasso](Regularization Part 2: Lasso (L1) Regression) penalizes the less important features by reducing their coefficients (possibly to zero), unlike Ridge Regression which merely shrinks the coefficients for the less important features but cannot set them to zero entirely.
model = Lasso(alpha=0.01, random_state=42)
model.fit(X_train, y_train)
R_2 = model.score(X_test, y_test)
# print the score rounded to two digits
print("R^2 Value: ", np.round(R_2,2))
R^2 Value: 0.27
So our simple Lasso model fit on the training data resulted in a $R^2$ of 0.27 on the test data. That is not so good! But the important piece is that we should look at the model coefficients:
print("intercept: ", model.intercept_)
pd.DataFrame({"feature": model.feature_names_in_,
"coefficients":model.coef_})
intercept: 23.348832405776655
feature | coefficients | |
---|---|---|
0 | displacement | -2.317209 |
1 | horsepower | -1.852412 |
2 | weight | -0.818235 |
So, in math terms, here's what the above model is saying:
$$ \text{expected mpg} = -2.32(\text{displacement}) - 1.85(\text{horsepower}) - 0.82(\text{weight}) + 23.35 $$So for every unit increase in displacement
we expect to see a decrease in mpg
by 2.32 miles per gallon. That's a lot! But note here that we scaled the displacement, horsepower, and weight numbers. If we want to convert these coefficients back to the original scale, we'll need to do a little math. I'll skip it for now but just know that it's possible in this situation.
But this is a post on feature importance, so how would you do this with Lasso, Ridge, or ElasticNet?
So we notice that with the alpha
parameter at a low value of 0.01, Lasso didn't send any of these variables to zero. If we were to increase the alpha
parameter, we would see that some of these coefficients would go to zero.
So one way of interpreting feature importance with Lasso or Ridge is to try different values of alpha
and see what happens to the coefficients.
The code below shows an example:
coefs = []
alphas = np.arange(0.01,10, 0.01)
for a in alphas:
model = Lasso(alpha=a, random_state=42)
model.fit(X_train, y_train)
coefs.append(model.coef_)
Then we'll do a plot of the coeffient values vs alpha
.
ax = plt.gca()
ax.plot(alphas, coefs, label=['displacement', 'horsepower', 'weight'])
ax.set_xscale("log")
ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
plt.xlabel("log(alpha)")
plt.ylabel("coefficients")
plt.title("Lasso coefficients as a function of regularization strength")
plt.axis("tight")
plt.legend()
plt.show()
So you might notice a few funky things about this plot. I reversed the x-axis so that alpha decreases from left to right, so regularization strength is decreasing from left to right. And the coefficients are negative before going to zero with higher values of alpha. A little confusing but it's a convention.
But notice how we have an order in which the lines go to zero. In the above example, weight
is the first value to go to zero as alpha increases, followed by horsepower
, and then displacement
. This gives you a relative idea of the feature importance which might be helpful in your understanding of the problem.
For another example with Ridge Regression, check this out
Note: Another term for this kind of plot is a 'path' plot, and sklearn
has some helper functions, see example
Now, another twist on the problem is if your goal isn't learning feature importance and your goal is to do prediction and let the models do the shrinkage for you, then you can repeatedly try a lot of different values of alpha
and find the value that gives you the least error on the test set. In the below example we'll use root mean squared error (RMSE) as our error metric that we're optimizing for:
yhat = model.predict(X_test)
RMSE = (np.mean((yhat - y_test)**2))**0.5
print(RMSE)
7.019453139121642
Then we iterate over several values of alpha:
errors = []
alphas = np.arange(0.01,10, 0.01)
for a in alphas:
model = Lasso(alpha=a, random_state=42)
model.fit(X_train, y_train)
yhat = model.predict(X_test)
RMSE = (np.mean((yhat - y_test)**2))**0.5
errors.append(RMSE)
ax = plt.gca()
plt.scatter(alphas, errors)
ax.set_xscale("log")
plt.xlabel("log(alpha)")
plt.ylabel("RMSE")
plt.title("Lasso Test RMSE at various values of alpha")
plt.show()
best_alpha = alphas[np.argmin(errors)]
print("best_alpha = ", best_alpha)
best_alpha = 1.07
We can then plug that best alpha back in to our model:
model = Lasso(alpha=best_alpha, random_state=42)
model.fit(X_train, y_train)
R_2 = model.score(X_test, y_test)
# print the score rounded to two digits
print("Test R^2 Value: ", np.round(R_2,2))
Test R^2 Value: 0.3
So, we get a minor improvement!
pd.DataFrame({"feature": model.feature_names_in_,
"coefficients":model.coef_})
feature | coefficients | |
---|---|---|
0 | displacement | -2.084907 |
1 | horsepower | -1.353816 |
2 | weight | -0.377263 |
And notice here that since we were optimizing for RMSE on the test set, we found a value of alpha
that shrunk the coefficients but didn't set any of them to zero. If we had increased alpha
more we would eventually see weight
fall off.
So since we are dealing with multivariable regression we can't make a 4-D plot to show the 4-D hyperplane that this model fit. But we can kind of get a sense by plotting just a single predictor variable (i.e. 'horsepower') against the value of mpg
to get some intuition:
y_fitted = model.predict(X_train)
plt.scatter(X_train[["horsepower"]], y_train, label="Training Data")
plt.scatter(X_train[["horsepower"]], y_fitted, label="Fitted Values", c="r")
plt.title("Lasso ")
plt.xlabel("scaled horsepower")
plt.ylabel("mpg")
plt.legend()
plt.show()
But, can we do better?
alpha
?¶In practice you could iteratively try many many different kinds of alpha until you arrived at the best answer - and you could do this manually (as we just did above) but sklearn
has helper functions that already do this for you.
Here we'll use 5-fold cross validation to find the ideal value of alpha from the training set, and then use that fitted model on the test set.
sklearn
has LassoCV
, RidgeCV
, and ElasticNetCV
implementations for our convenience. Here we'll repeat the analysis from above as a demo with these three.
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV
model = LassoCV(cv=5, random_state=0).fit(X_train, y_train)
print(model.score(X_test,y_test))
y_fitted = model.predict(X_train)
plt.scatter(X_train[["horsepower"]], y_train, label="X")
plt.scatter(X_train[["horsepower"]], y_fitted, label="Fitted", c="r")
plt.legend()
plt.show()
0.26601042348436355
model.alpha_
0.004619906312005148
pd.DataFrame({"feature": model.feature_names_in_,
"coefficients":model.coef_})
feature | coefficients | |
---|---|---|
0 | displacement | -2.318341 |
1 | horsepower | -1.854957 |
2 | weight | -0.820504 |