from sklearn import datasets # Load the diabetes dataset diabetes = datasets.load_diabetes()
The dataset consists of data and targets. Target tells us what is the desired output for specific example from data:
X = diabetes.data y = diabetes.target print X.shape print y.shape
(442, 10) (442,)
We want to split the data into train set and test set. We fit the linear model on the train set, and we show that it performs good on test set.
Before splitting the data, we shuffle (mix) the examples, because for some datasets the examples are ordered.
If we wouldn't shuffle, train set and test set could be totally different, thus linear model fitted on train set wouldn't be valid on test set. Now we shuffle:
from sklearn.utils import shuffle X, y = shuffle(X, y, random_state=1) print X.shape print y.shape
(442, 10) (442,)
Each example of data has 10 columns in total.
We want to work with 1-dim data because it is simple to visualize. Therefore select only one column, e.g column 2 and fit linear model on it:
# Use only one column from data print(X.shape) X = X[:, 2:3] print(X.shape)
(442, 10) (442, 1)
Split the data into training/testing sets
train_set_size = 250 X_train = X[:train_set_size] # selects first 250 rows (examples) for train set X_test = X[train_set_size:] # selects from row 250 until the last one for test set print(X_train.shape) print(X_test.shape)
(250, 1) (192, 1)
Split the targets into training/testing sets
y_train = y[:train_set_size] # selects first 250 rows (targets) for train set y_test = y[train_set_size:] # selects from row 250 until the last one for test set print(y_train.shape) print(y_test.shape)
Now we can look at our train data. We can see that the examples have linear relation.
Therefore, we can use linear model to make good classification of our examples.
plt.scatter(X_train, y_train) plt.scatter(X_test, y_test) plt.xlabel('Data') plt.ylabel('Target');
Create linear regression object, which we use later to apply linear regression on data
from sklearn import linear_model regr = linear_model.LinearRegression()
Fit the model using the training set
We found the coefficients and the bias (the intercept)
[ 865.04619508] 151.179169728
Now we calculate the mean square error on the test set
# The mean square error print("Training error: ", np.mean((regr.predict(X_train) - y_train) ** 2)) print("Test error: ", np.mean((regr.predict(X_test) - y_test) ** 2))
('Training error: ', 3800.1408249628944) ('Test error: ', 4047.2429967010571)
Now we want to plot the train data and teachers (marked as dots).
With line we represents the data and predictions (linear model that we found):
# Visualises dots, where each dot represent a data exaple and corresponding teacher plt.scatter(X_train, y_train, color='black') # Plots the linear model plt.plot(X_train, regr.predict(X_train), color='blue', linewidth=3); plt.xlabel('Data') plt.ylabel('Target')
<matplotlib.text.Text at 0xb101b0cc>
We do similar with test data, and show that linear model is valid for a test set:
# Visualises dots, where each dot represent a data exaple and corresponding teacher plt.scatter(X_test, y_test, color='black') # Plots the linear model plt.plot(X_test, regr.predict(X_test), color='blue', linewidth=3); plt.xlabel('Data') plt.ylabel('Target');