This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.
We will generate one-dimensional data with a simple model (including some noise), and we will try to fit a function to this data. With this function, we can predict values on new data points. This is a curve-fitting regression problem.
import numpy as np
import scipy.stats as st
import sklearn.linear_model as lm
import matplotlib.pyplot as plt
%matplotlib inline
f = lambda x: np.exp(3 * x)
x_tr = np.linspace(0., 2, 200)
y_tr = f(x_tr)
x = np.array([0, .1, .2, .5, .8, .9, 1])
y = f(x) + np.random.randn(len(x))
plt.figure(figsize=(6,3));
plt.plot(x_tr[:100], y_tr[:100], '--k');
plt.plot(x, y, 'ok', ms=10);
LinearRegression
class). Then we fit the model to our data. Finally, we predict values from our trained model.# We create the model.
lr = lm.LinearRegression()
# We train the model on our training dataset.
lr.fit(x[:, np.newaxis], y);
# Now, we predict points with our trained model.
y_lr = lr.predict(x_tr[:, np.newaxis])
We need to convert x
and x_tr
to column vectors, as it is a general convention in scikit-learn that observations are rows, while features are columns. Here, we have 7 observations with 1 feature.
plt.figure(figsize=(6,3));
plt.plot(x_tr, y_tr, '--k');
plt.plot(x_tr, y_lr, 'g');
plt.plot(x, y, 'ok', ms=10);
plt.xlim(0, 1);
plt.ylim(y.min()-1, y.max()+1);
plt.title("Linear regression");
np.vander
function. We will explain this trick in more detail in How it works....lrp = lm.LinearRegression()
plt.figure(figsize=(6,3));
plt.plot(x_tr, y_tr, '--k');
for deg, s in zip([2, 5], ['-', '.']):
lrp.fit(np.vander(x, deg + 1), y);
y_lrp = lrp.predict(np.vander(x_tr, deg + 1))
plt.plot(x_tr, y_lrp, s, label='degree ' + str(deg));
plt.legend(loc=2);
plt.xlim(0, 1.4);
plt.ylim(-10, 40);
# Print the model's coefficients.
print(' '.join(['%.2f' % c for c in lrp.coef_]))
plt.plot(x, y, 'ok', ms=10);
plt.title("Linear regression");
We have fitted two polynomial models of degree 2 and 5. The degree 2 polynomial fits the data points less precisely than the degree 5 polynomial. However, it seems more robust: the degree 5 polynomial seems really bad at predicting values outside the data points (look for example at the portion $x \geq 1$). This is what we call overfitting: by using a model too complex, we obtain a better fit on the trained dataset, but a less robust model outside this set.
The ridge regression model has a meta-parameter which represents the weight of the regularization term. We could try different values with trials and errors, using the Ridge
class. However, scikit-learn includes another model called RidgeCV
which includes a parameter search with cross-validation. In practice, it means that you don't have to tweak this parameter by hand: scikit-learn does it for you. Since the models of scikit-learn always follow the fit
-predict
API, all we have to do is replace lm.LinearRegression
by lm.RidgeCV
in the code above. We will give more details in the next section.
ridge = lm.RidgeCV()
plt.figure(figsize=(6,3));
plt.plot(x_tr, y_tr, '--k');
for deg, s in zip([2, 5], ['-', '.']):
ridge.fit(np.vander(x, deg + 1), y);
y_ridge = ridge.predict(np.vander(x_tr, deg + 1))
plt.plot(x_tr, y_ridge, s, label='degree ' + str(deg));
plt.legend(loc=2);
plt.xlim(0, 1.5);
plt.ylim(-5, 80);
# Print the model's coefficients.
print(' '.join(['%.2f' % c for c in ridge.coef_]))
plt.plot(x, y, 'ok', ms=10);
plt.title("Ridge regression");
This time, the degree 5 polynomial seems better than the simpler degree 2 polynomial (which now causes underfitting). The ridge regression reduces the overfitting issue here.
You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).
IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).