From the video series: Introduction to machine learning with scikit-learn
Pandas: popular Python library for data exploration, manipulation, and analysis
# conventional way to import pandas
import pandas as pd
# read CSV file directly from a URL and save the results
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
# display the first 5 rows
data.head()
TV | Radio | Newspaper | Sales | |
---|---|---|---|---|
1 | 230.1 | 37.8 | 69.2 | 22.1 |
2 | 44.5 | 39.3 | 45.1 | 10.4 |
3 | 17.2 | 45.9 | 69.3 | 9.3 |
4 | 151.5 | 41.3 | 58.5 | 18.5 |
5 | 180.8 | 10.8 | 58.4 | 12.9 |
Primary object types:
# display the last 5 rows
data.tail()
TV | Radio | Newspaper | Sales | |
---|---|---|---|---|
196 | 38.2 | 3.7 | 13.8 | 7.6 |
197 | 94.2 | 4.9 | 8.1 | 9.7 |
198 | 177.0 | 9.3 | 6.4 | 12.8 |
199 | 283.6 | 42.0 | 66.2 | 25.5 |
200 | 232.1 | 8.6 | 8.7 | 13.4 |
# check the shape of the DataFrame (rows, columns)
data.shape
(200, 4)
What are the features?
What is the response?
What else do we know?
Seaborn: Python library for statistical data visualization built on top of Matplotlib
conda install seaborn
from the command line# conventional way to import seaborn
import seaborn as sns
# allow plots to appear within the notebook
%matplotlib inline
# visualize the relationship between the features and the response using scatterplots
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg')
<seaborn.axisgrid.PairGrid at 0xaaf8fd0>
Pros: fast, no tuning required, highly interpretable, well-understood
Cons: unlikely to produce the best predictive accuracy (presumes a linear relationship between the features and response)
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$
In this case:
$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$
The $\beta$ values are called the model coefficients. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions!
# create a Python list of feature names
feature_cols = ['TV', 'Radio', 'Newspaper']
# use the list to select a subset of the original DataFrame
X = data[feature_cols]
# equivalent command to do this in one line
X = data[['TV', 'Radio', 'Newspaper']]
# print the first 5 rows
X.head()
TV | Radio | Newspaper | |
---|---|---|---|
1 | 230.1 | 37.8 | 69.2 |
2 | 44.5 | 39.3 | 45.1 |
3 | 17.2 | 45.9 | 69.3 |
4 | 151.5 | 41.3 | 58.5 |
5 | 180.8 | 10.8 | 58.4 |
# check the type and shape of X
print(type(X))
print(X.shape)
<class 'pandas.core.frame.DataFrame'> (200, 3)
# select a Series from the DataFrame
y = data['Sales']
# equivalent command that works if there are no spaces in the column name
y = data.Sales
# print the first 5 values
y.head()
1 22.1 2 10.4 3 9.3 4 18.5 5 12.9 Name: Sales, dtype: float64
# check the type and shape of y
print(type(y))
print(y.shape)
<class 'pandas.core.series.Series'> (200L,)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# default split is 75% for training and 25% for testing
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(150, 3) (150L,) (50, 3) (50L,)
# import model
from sklearn.linear_model import LinearRegression
# instantiate
linreg = LinearRegression()
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
# print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)
2.87696662232 [ 0.04656457 0.17915812 0.00345046]
# pair the feature names with the coefficients
list(zip(feature_cols, linreg.coef_))
[('TV', 0.046564567874150288), ('Radio', 0.17915812245088836), ('Newspaper', 0.0034504647111804065)]
How do we interpret the TV coefficient (0.0466)?
Important notes:
# make predictions on the testing set
y_pred = linreg.predict(X_test)
We need an evaluation metric in order to compare our predictions with the actual values!
Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.
Let's create some example numeric predictions, and calculate three common evaluation metrics for regression problems:
# define true and predicted response values
true = [100, 50, 30, 20]
pred = [90, 50, 50, 30]
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$# calculate MAE by hand
print((10 + 0 + 20 + 10)/4.)
# calculate MAE using scikit-learn
from sklearn import metrics
print(metrics.mean_absolute_error(true, pred))
10.0 10.0
Mean Squared Error (MSE) is the mean of the squared errors:
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$# calculate MSE by hand
print((10**2 + 0**2 + 20**2 + 10**2)/4.)
# calculate MSE using scikit-learn
print(metrics.mean_squared_error(true, pred))
150.0 150.0
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$# calculate RMSE by hand
import numpy as np
print(np.sqrt((10**2 + 0**2 + 20**2 + 10**2)/4.))
# calculate RMSE using scikit-learn
print(np.sqrt(metrics.mean_squared_error(true, pred)))
12.2474487139 12.2474487139
Comparing these metrics:
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
1.40465142303
Does Newspaper "belong" in our model? In other words, does it improve the quality of our predictions?
Let's remove it from the model and check the RMSE!
# create a Python list of feature names
feature_cols = ['TV', 'Radio']
# use the list to select a subset of the original DataFrame
X = data[feature_cols]
# select a Series from the DataFrame
y = data.Sales
# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
# make predictions on the testing set
y_pred = linreg.predict(X_test)
# compute the RMSE of our predictions
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
1.38790346994
The RMSE decreased when we removed Newspaper from the model. (Error is something we want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.
Linear regression:
Pandas:
Seaborn:
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()