A Summary of lecture "Supervised Learning with scikit-learn", via datacamp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
The Gapminder dataset that you worked with in previous chapters also contained a categorical 'Region'
feature, which we dropped in previous exercises since you did not have the tools to deal with it. Now however, you do, so we have added it back in!
Your job in this exercise is to explore this feature. Boxplots are particularly useful for visualizing categorical features such as this.
df = pd.read_csv('./dataset/gm_2008_region.csv')
df.head()
population | fertility | HIV | CO2 | BMI_male | GDP | BMI_female | life | child_mortality | Region | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 34811059.0 | 2.73 | 0.1 | 3.328945 | 24.59620 | 12314.0 | 129.9049 | 75.3 | 29.5 | Middle East & North Africa |
1 | 19842251.0 | 6.43 | 2.0 | 1.474353 | 22.25083 | 7103.0 | 130.1247 | 58.3 | 192.0 | Sub-Saharan Africa |
2 | 40381860.0 | 2.24 | 0.5 | 4.785170 | 27.50170 | 14646.0 | 118.8915 | 75.5 | 15.4 | America |
3 | 2975029.0 | 1.40 | 0.1 | 1.804106 | 25.35542 | 7383.0 | 132.8108 | 72.5 | 20.0 | Europe & Central Asia |
4 | 21370348.0 | 1.96 | 0.1 | 18.016313 | 27.56373 | 41312.0 | 117.3755 | 81.5 | 5.2 | East Asia & Pacific |
# Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)
plt.savefig('../images/gm-boxplot.png', dpi=100)
Scikit-learn does not accept non-numerical features. You saw in the previous exercise that the 'Region'
feature contains very useful information that can predict life expectancy. For example, Sub-Saharan Africa has a lower life expectancy compared to Europe and Central Asia. Therefore, if you are trying to predict life expectancy, it would be preferable to retain the 'Region'
feature. To do this, you need to binarize it by creating dummy variables, which is what you will do in this exercise.
df.columns
Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP', 'BMI_female', 'life', 'child_mortality', 'Region'], dtype='object')
# Create dummy variables: df_region
df_region = pd.get_dummies(df)
# Print the columns of df_regions
print(df_region.columns)
# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first=True)
# Print the new columns of df_region
print(df_region.columns)
Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP', 'BMI_female', 'life', 'child_mortality', 'Region_America', 'Region_East Asia & Pacific', 'Region_Europe & Central Asia', 'Region_Middle East & North Africa', 'Region_South Asia', 'Region_Sub-Saharan Africa'], dtype='object') Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP', 'BMI_female', 'life', 'child_mortality', 'Region_East Asia & Pacific', 'Region_Europe & Central Asia', 'Region_Middle East & North Africa', 'Region_South Asia', 'Region_Sub-Saharan Africa'], dtype='object')
Having created the dummy variables from the 'Region' feature, you can build regression models as you did before. Here, you'll use ridge regression to perform 5-fold cross-validation.
X = df_region.drop('life', axis='columns')
y = df_region['life']
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5, normalize=True)
# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv=5)
# Print the cross-validated scores
print(ridge_cv)
[0.86808336 0.80623545 0.84004203 0.7754344 0.87503712]
You will see that there are certain data points labeled with a '?'
. These denote missing values. As you saw in the video, different datasets encode missing values in different ways. Sometimes it may be a '9999'
, other times a 0
- real-world data can be very messy! If you're lucky, the missing values will already be encoded as NaN
. We use NaN
because it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as .dropna()
and .fillna()
, as well as scikit-learn's Imputation transformer Imputer()
.
In this exercise, your job is to convert the '?'
s to NaNs, and then drop the rows that contain them from the DataFrame.
df = pd.read_csv('./dataset/house-votes-84.csv', header=None)
df.columns = ['party', 'infants', 'water', 'budget', 'physician', 'salvador',
'religious', 'satellite', 'aid', 'missile', 'immigration', 'synfuels',
'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']
df.replace({'n':0, 'y':1}, inplace=True)
df.head()
party | infants | water | budget | physician | salvador | religious | satellite | aid | missile | immigration | synfuels | education | superfund | crime | duty_free_exports | eaa_rsa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | republican | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | ? | 1 | 1 | 1 | 0 | 1 |
1 | republican | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | ? |
2 | democrat | ? | 1 | 1 | ? | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
3 | democrat | 0 | 1 | 1 | 0 | ? | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
4 | democrat | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | ? | 1 | 1 | 1 | 1 |
# Convert '?' to NaN
df[df == '?'] = np.nan
# Print the number of NaNs
print(df.isnull().sum())
# Print shape of original DataFrame
print('Shape of Original DataFrame: {}'.format(df.shape))
# Drop missing values and print shape fo new DataFrame
df = df.dropna()
# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))
party 0 infants 12 water 48 budget 11 physician 11 salvador 15 religious 11 satellite 14 aid 15 missile 22 immigration 7 synfuels 21 education 31 superfund 25 crime 17 duty_free_exports 28 eaa_rsa 104 dtype: int64 Shape of Original DataFrame: (435, 17) Shape of DataFrame After Dropping All Rows with Missing Values: (232, 17)
As you've come to appreciate, there are many steps to building a model, from creating training and test sets, to fitting a classifier or regressor, to tuning its parameters, to evaluating its performance on new data. Imputation can be seen as the first step of this machine learning process, the entirety of which can be viewed within the context of a pipeline. Scikit-learn provides a pipeline constructor that allows you to piece together these steps into one process and thereby simplify your workflow.
You'll now practice setting up a pipeline with two steps: the imputation step, followed by the instantiation of a classifier. You've seen three classifiers in this course so far: k-NN, logistic regression, and the decision tree. You will now be introduced to a fourth one - the Support Vector Machine, or SVM. For now, do not worry about how it works under the hood. It works exactly as you would expect of the scikit-learn estimators that you have worked with previously, in that it has the same .fit()
and .predict()
methods as before.
Note:
sklearn.preprocessing.Imputer
is replaced withsklearn.imputer.SimpleImputer
from version 0.20
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
# Setup the Imputation transformer: imp
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
# Instantiate the SVC classifier: clf
clf = SVC()
# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
('SVM', clf)]
Having setup the steps of the pipeline in the previous exercise, you will now use it on the voting dataset to classify a Congressman's party affiliation. What makes pipelines so incredibly useful is the simple interface that they provide. You can use the .fit()
and .predict()
methods on pipelines just as you did with your classifiers and regressors!
X = df.drop('party', axis='columns')
y = df['party']
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
# Setup the pipeline steps: steps
steps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
('SVM', SVC())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = pipeline.predict(X_test)
# Compute metrics
print(classification_report(y_test, y_pred))
precision recall f1-score support democrat 0.97 0.97 0.97 36 republican 0.97 0.97 0.97 34 accuracy 0.97 70 macro avg 0.97 0.97 0.97 70 weighted avg 0.97 0.97 0.97 70
You will now explore scaling for yourself on a new dataset - White Wine Quality! Hugo used the Red Wine Quality dataset in the video. We have used the 'quality'
feature of the wine to create a binary target variable: If 'quality'
is less than 5, the target variable is 1, and otherwise, it is 0.
Notice how some features seem to have different units of measurement. 'density'
, for instance, takes values between 0.98 and 1.04, while 'total sulfur dioxide'
ranges from 9 to 440. As a result, it may be worth scaling the features here. Your job in this exercise is to scale the features and compute the mean and standard deviation of the unscaled features compared to the scaled features.
df = pd.read_csv('./dataset/white-wine.csv')
df.head()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |
1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |
2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
df['quality'] = df['quality'] < 5
X = df.drop('quality', axis='columns').values
y = df['quality'].values
df.head()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.0010 | 3.00 | 0.45 | 8.8 | False |
1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.9940 | 3.30 | 0.49 | 9.5 | False |
2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.9951 | 3.26 | 0.44 | 10.1 | False |
3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | False |
4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | False |
from sklearn.preprocessing import scale
# Scale the features: X_scaled
X_scaled = scale(X)
# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X)))
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))
# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled)))
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))
Mean of Unscaled Features: 18.432687072460002 Standard Deviation of Unscaled Features: 41.54494764094571 Mean of Scaled Features: 2.7314972981668206e-15 Standard Deviation of Scaled Features: 0.9999999999999999
With regard to whether or not scaling is effective, the proof is in the pudding! See for yourself whether or not scaling the features of the White Wine Quality dataset has any impact on its performance. You will use a k-NN classifier as part of a pipeline that includes scaling, and for the purposes of comparison, a k-NN classifier trained on the unscaled data has been provided.
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
('knn', KNeighborsClassifier())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)
# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))
Accuracy with Scaling: 0.964625850340136 Accuracy without Scaling: 0.9666666666666667
It is time now to piece together everything you have learned so far into a pipeline for classification! Your job in this exercise is to build a pipeline that includes scaling and hyperparameter tuning to classify wine quality.
You'll return to using the SVM classifier you were briefly introduced to earlier in this chapter. The hyperparameters you will tune are C
and gamma
. C
controls the regularization strength. It is analogous to the C
you tuned for logistic regression in Chapter 3, while gamma
controls the kernel coefficient: Do not worry about this now as it is beyond the scope of this course.
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Setup the pipeline
steps = [('scaler', StandardScaler()),
('SVM', SVC())]
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
'SVM__gamma':[0.1, 0.01]}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters, cv=3)
# Fit to the training set
cv.fit(X_train, y_train)
# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)
# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))
Accuracy: 0.9693877551020408 precision recall f1-score support False 0.97 1.00 0.98 951 True 0.43 0.10 0.17 29 accuracy 0.97 980 macro avg 0.70 0.55 0.58 980 weighted avg 0.96 0.97 0.96 980 Tuned Model Parameters: {'SVM__C': 100, 'SVM__gamma': 0.01}
For this final exercise, you will return to the Gapminder dataset. Guess what? Even this dataset has missing values that we dealt with for you in earlier chapters! Now, you have all the tools to take care of them yourself!
Your job is to build a pipeline that imputes the missing data, scales the features, and fits an ElasticNet to the Gapminder data. You will then tune the l1_ratio
of your ElasticNet using GridSearchCV
.
df = pd.read_csv('./dataset/gm_2008_region.csv')
df.head()
population | fertility | HIV | CO2 | BMI_male | GDP | BMI_female | life | child_mortality | Region | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 34811059.0 | 2.73 | 0.1 | 3.328945 | 24.59620 | 12314.0 | 129.9049 | 75.3 | 29.5 | Middle East & North Africa |
1 | 19842251.0 | 6.43 | 2.0 | 1.474353 | 22.25083 | 7103.0 | 130.1247 | 58.3 | 192.0 | Sub-Saharan Africa |
2 | 40381860.0 | 2.24 | 0.5 | 4.785170 | 27.50170 | 14646.0 | 118.8915 | 75.5 | 15.4 | America |
3 | 2975029.0 | 1.40 | 0.1 | 1.804106 | 25.35542 | 7383.0 | 132.8108 | 72.5 | 20.0 | Europe & Central Asia |
4 | 21370348.0 | 1.96 | 0.1 | 18.016313 | 27.56373 | 41312.0 | 117.3755 | 81.5 | 5.2 | East Asia & Pacific |
df.drop(['Region'], axis='columns', inplace=True)
X = df.drop('life', axis='columns').values
y = df['life'].values
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNet
from sklearn.pipeline import Pipeline
# Setup the pipeline steps: steps
steps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('scaler', StandardScaler()),
('elasticnet', ElasticNet(tol=0.6))]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio': np.linspace(0, 1, 30)}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, param_grid=parameters, cv=3)
# Fit to the training set
gm_cv.fit(X_train, y_train)
# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
Tuned ElasticNet Alpha: {'elasticnet__l1_ratio': 0.6206896551724138} Tuned ElasticNet R squared: 0.855644773878311