Linear regression is one of the simplest supervised learning algorithms in our toolkit. If you have ever taken an introductory statistics course in college, likely the final topic you covered was linear regression. In fact, it is so simple that it is sometimes not considered machine learning at all!
Whatever you believe, the fact is that linear regression--and its extensions--continues to be a common and useful method of making predictions when the target vector is a quantitative value (e.g. home price, age)
You want to train a model that represents a linear relationship between the feature and target vector.
Use a linear regression (LinearRegression
in scikit-learn)
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
boston = load_boston()
features = boston.data[:, 0:2]
target = boston.target
regression = LinearRegression()
model = regression.fit(features, target)
from sklearn.linear_model import Ridge
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
boston = load_boston()
features = boston.data
target = boston.target
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
regression = Ridge(alpha=0.5)
model = regression.fit(features_standardized, target)
In standard linear regression the model trains to minimize the sum of squared error between the true($y_i$) and prediction ($\hat y_i$) target values, or residual sum of squares (RSS): $$ RSS = \sum_{i=1}^n{(y_i - \hat y_i)^2} $$
Regularized regression learners are similar, except they attempt to minimize RSS and some penalty for the total size of the coefficient values, called a shrinkage penalty because it attempts to "shrink" the model. There are two common types of regularized learners for linear regression: ridge regression and the lasso. The only formal difference is the type of shrinkage penalty used. In ridge regression, the shrinkage penalty is a tuning hyperparameter multiplied by the squared sum of all coefficients: $$ RSS+\alpha \sum_{j=1}^p{\hat \beta_j^2} $$
where $\hat \beta_j$ is the coefficient of the jth of p features and $\alpha$ is a hyperparameter (discussed next). The lasso is similar, except the shrinkage penalty is a tuning hyperparmeter multiplied by the squared sum of all coefficients: $$ \frac{1}{2n} RSS + \alpha \sum_{j=1}^p{|\beta_j|} $$
where n is the number of observations. So which one should we use? A a very general rule of thumb, ridge regression often produces slightly better predictions than lasso, but lasso (for reasons we will discuss in Recipe 13.5) produces more interpretable models. If we want a balance between, ridge and lasso's penalty functions we can use elastic net, which is simply a regression model with both penalties included. Regardless of which one we use, bot hridge and lasso regresions can penalize large or complex models by including coefficient values in the loss funciton we are trying to minimize
The hyper parameter $\alpha$ lets us control how much we penalize the coefficients, with higher values of $\alpha$ creating simpler models. The ideal value of $\alpha$ should be tuned like any other hyperparameter. In scikit-learn, $\alpha$ is set using the alpha parameter.
scikit-learn includes a RidgeCV method that allows us to select the ideal value for $\alpha:
from sklearn.linear_model import RidgeCV
regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])
model_cv = regr_cv.fit(features_standardized, target)
model_cv.coef_
array([-0.91215884, 1.0658758 , 0.11942614, 0.68558782, -2.03231631, 2.67922108, 0.01477326, -3.0777265 , 2.58814315, -2.00973173, -2.05390717, 0.85614763, -3.73565106])
# view alpha
model_cv.alpha_
1.0
One final note: because in linear regression the value of the coefficients is partially determined by the scale of the feature, and in regularized models all coefficients are summed together, we must make sure to standardize the feature prior to training
You want to simplify your linear regression model by reducing the number of features.
Use a lasso regression
from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
boston = load_boston()
features = boston.data
target = boston.target
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
regression = Lasso(alpha=0.5)
model = regression.fit(features_standardized, target)
One interesting characteristic of lasso regression's penalty is that it can shrink the coefficients of a model to zero, effectively reducing the number of features in the model. For example, in our solution we set alpha
to 0.5 and we can see that many of the coefficients are 0, meaning their corresponding features are not used in the model:
model.coef_
array([-0.10697735, 0. , -0. , 0.39739898, -0. , 2.97332316, -0. , -0.16937793, -0. , -0. , -1.59957374, 0.54571511, -3.66888402])
However if we increase $\alpha$ to a much higher value, we see that lierally none of the features are being used:
regression_a10 = Lasso(alpha=10)
model_a10 = regression_a10.fit(features_standardized, target)
model_a10.coef_
array([-0., 0., -0., 0., -0., 0., -0., 0., -0., -0., -0., 0., -0.])
The practical benefit of this effect is that it means that we could include 100 features in our feature matrix and then, through adjusting lasso's $\alpha$ hyperparameter, produce a model that uses only 10 (for instance) of the most important features. This lets us reduce variance whiel improving interpretability of our model (since fewer features is easier to explain)