Introduction to Python for Data Sciences |
Franck Iutzeler |
Now that we explored data structures provided by the Pandas library, we will investigate how to learn over it using Scikit-learn.
Scikit-learn is ont of the most celebrated and used machine learning library. It features a complete set of efficiently implemented machine learning algorithms for classification, regression, and clustering. Scikit-learn is designed to operate over Numpy, Scipy, and Pandas data structures.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Machine learning is the task of predicting properties out of some data. The dataset consists in several examples or samples and the associated target properties can be available, partially available, or not at all; we respectively call these setting supervised, semi-supervised, unsupervised. The examples are made out of one or several features or attributes that can be of different types (real number, discretes values, strings, booleans, etc.).
Learning problems can be broadly divided in a few categories:
The following flowchart can be found on the Scikit Learn website:
The process of learning and predicting with Scikit Learn follows three main steps:
1. Selecting and adjusting a model
2. Fitting the model to the data
3. Predicting from this fitted model
We will illustrate this process on a simple linear model $$ y = a x + b + \nu$$ where
a = np.random.randn()*5 # Drawing randomly the slope
b = np.random.rand()*10 # Drawing randomly the initial point
m = 50 # number of points
x = np.random.rand(m,1)*10 # Drawing randomly abscisses
y = a*x + b + np.random.randn(m,1) # y = ax+b + noise
plt.scatter(x, y)
<matplotlib.collections.PathCollection at 0x7f0c4fd942b0>
As we want to fit a linear model $y=ax+b$ through the data, we will import the Linear Regression
module from scikit learn with sklearn.linear_model import LinearRegression
.
As our model has a non null coefficient at the origin, the model needs an intercept. This can be tuned, along with several other parameters, see Scikit Learn's linear_model documentation.
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
print(model)
LinearRegression()
This terminates our model tuning. Notice that we have described our model, but no learning or fitting has been done.
Applying our model to the data $(x,y)$ is done using the fit
method.
model.fit(x,y)
LinearRegression()
Once the model is fitted, one can observe the learned coefficients:
coef_
for the model coefficients ($a$ here)intercept_
foe the intercept ($b$ here)print("Learned coefficients: a = {:.6f} \t b = {:.6f}".format(float(model.coef_),float(model.intercept_)))
print("True coefficients: a = {:.6f} \t b = {:.6f}".format(a,b))
Learned coefficients: a = 3.637057 b = 2.559807 True coefficients: a = 3.674435 b = 2.385000
From a feature matrix, the method predict
returns the predicted output from the fitted model.
xFit = np.linspace(-2,12,21).reshape(-1, 1)
yFit = model.predict(xFit)
plt.scatter(x, y , label="data")
plt.plot(xFit, yFit , label="model",color="r")
plt.legend()
<matplotlib.legend.Legend at 0x7f0c276249d0>
Scikit Learn can take as an input (i.e. passed to fit
and predict
) several format including:
The examples/samples of the datasets are stored as rows.
The features are the columns.
In order to cross-validate our model, it is customary to split the dataset into training and testing subsets. It can be done manually but there is also a dedicated method.
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(x,y)
print(xTrain.shape,yTrain.shape)
print(xTest.shape,yTest.shape)
(37, 1) (37, 1) (13, 1) (13, 1)
Let us use cross validation to compare linear model and linear model with intercept.
from sklearn.linear_model import LinearRegression
model1 = LinearRegression(fit_intercept=True)
model2 = LinearRegression(fit_intercept=False)
model1.fit(xTrain,yTrain)
yPre1 = model1.predict(xTest)
error1 = np.linalg.norm(yTest-yPre1)
model2.fit(xTrain,yTrain)
yPre2 = model2.predict(xTest)
error2 = np.linalg.norm(yTest-yPre2)
print("Testing Error with intercept:", error1, "\t without intercept:" ,error2)
Testing Error with intercept: 3.3147332805841767 without intercept: 7.139630719389703
plt.scatter(xTrain, yTrain , label="Train data")
plt.scatter(xTest, yTest , color= 'k' , label="Test data")
plt.plot(xTest, yPre1 , color='r', label="model w/ intercept (err = {:.1f})".format(error1))
plt.plot(xTest, yPre2 , color='m', label="model w/o intercept (err = {:.1f})".format(error2))
plt.legend()
<matplotlib.legend.Legend at 0x7f0c2757b940>