This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance. This is the Summary of lecture "Preprocessing for Machine Learning in Python", via datacamp.
import pandas as pd
import numpy as np
Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the wine dataset. One of the columns, Proline
, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.
The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (knn
) as well as the X
and y
sets you need to fit and score on.
wine = pd.read_csv('./dataset/wine_types.csv')
wine.head()
Type | Alcohol | Malic acid | Ash | Alcalinity of ash | Magnesium | Total phenols | Flavanoids | Nonflavanoid phenols | Proanthocyanins | Color intensity | Hue | OD280/OD315 of diluted wines | Proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 |
1 | 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 |
2 | 1 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 |
3 | 1 | 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 |
4 | 1 | 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 |
X = wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]
y = wine['Type']
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)
# SCore the model on the test data
print(knn.score(X_test, y_test))
0.6888888888888889
Check the variance of the columns in the wine
dataset.
wine.describe()
Type | Alcohol | Malic acid | Ash | Alcalinity of ash | Magnesium | Total phenols | Flavanoids | Nonflavanoid phenols | Proanthocyanins | Color intensity | Hue | OD280/OD315 of diluted wines | Proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 |
mean | 1.938202 | 13.000618 | 2.336348 | 2.366517 | 19.494944 | 99.741573 | 2.295112 | 2.029270 | 0.361854 | 1.590899 | 5.058090 | 0.957449 | 2.611685 | 746.893258 |
std | 0.775035 | 0.811827 | 1.117146 | 0.274344 | 3.339564 | 14.282484 | 0.625851 | 0.998859 | 0.124453 | 0.572359 | 2.318286 | 0.228572 | 0.709990 | 314.907474 |
min | 1.000000 | 11.030000 | 0.740000 | 1.360000 | 10.600000 | 70.000000 | 0.980000 | 0.340000 | 0.130000 | 0.410000 | 1.280000 | 0.480000 | 1.270000 | 278.000000 |
25% | 1.000000 | 12.362500 | 1.602500 | 2.210000 | 17.200000 | 88.000000 | 1.742500 | 1.205000 | 0.270000 | 1.250000 | 3.220000 | 0.782500 | 1.937500 | 500.500000 |
50% | 2.000000 | 13.050000 | 1.865000 | 2.360000 | 19.500000 | 98.000000 | 2.355000 | 2.135000 | 0.340000 | 1.555000 | 4.690000 | 0.965000 | 2.780000 | 673.500000 |
75% | 3.000000 | 13.677500 | 3.082500 | 2.557500 | 21.500000 | 107.000000 | 2.800000 | 2.875000 | 0.437500 | 1.950000 | 6.200000 | 1.120000 | 3.170000 | 985.000000 |
max | 3.000000 | 14.830000 | 5.800000 | 3.230000 | 30.000000 | 162.000000 | 3.880000 | 5.080000 | 0.660000 | 3.580000 | 13.000000 | 1.710000 | 4.000000 | 1680.000000 |
The Proline
column has an extremely high variance.
Now that we know that the Proline
column in our wine dataset has a large amount of variance, let's log normalize it.
# Print out the variance of the Proline column
print(wine['Proline'].var())
# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])
# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())
99166.71735542428 0.17231366191842018
We want to use the Ash
, Alcalinity of ash
, and Magnesium
columns in the wine
dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model. Using describe()
to return descriptive statistics about this dataset, which of the following statements are true about the scale of data in these columns?
wine[['Ash', 'Alcalinity of ash', 'Magnesium']].describe()
Ash | Alcalinity of ash | Magnesium | |
---|---|---|---|
count | 178.000000 | 178.000000 | 178.000000 |
mean | 2.366517 | 19.494944 | 99.741573 |
std | 0.274344 | 3.339564 | 14.282484 |
min | 1.360000 | 10.600000 | 70.000000 |
25% | 2.210000 | 17.200000 | 88.000000 |
50% | 2.360000 | 19.500000 | 98.000000 |
75% | 2.557500 | 21.500000 | 107.000000 |
max | 3.230000 | 30.000000 | 162.000000 |
Since we know that the Ash
, Alcalinity of ash
, and Magnesium
columns in the wine
dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.
from sklearn.preprocessing import StandardScaler
# Create the scaler
ss = StandardScaler()
# Take a subset of the DataFrame you want to scale
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]
print(wine_subset.iloc[:3])
# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)
print(wine_subset_scaled[:3])
Ash Alcalinity of ash Magnesium 0 2.43 15.6 127 1 2.14 11.2 100 2 2.67 18.6 101 [[ 0.23205254 -1.16959318 1.91390522] [-0.82799632 -2.49084714 0.01814502] [ 1.10933436 -0.2687382 0.08835836]]
Let's first take a look at the accuracy of a K-nearest neighbors model on the wine
dataset without standardizing the data. The knn
model as well as the X
and y
data and labels sets have been created already. Most of this process of creating models in scikit-learn should look familiar to you.
wine = pd.read_csv('./dataset/wine_types.csv')
X = wine.drop('Type', axis=1)
y = wine['Type']
knn = KNeighborsClassifier()
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)
# Score the model on the test data
print(knn.score(X_test, y_test))
0.7555555555555555
The accuracy score on the unscaled wine dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data.
knn = KNeighborsClassifier()
# Create the scaling method
ss = StandardScaler()
# Apply the scaling method to the dataset used for modeling
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)
# Score the model on the test data
print(knn.score(X_test, y_test))
0.9555555555555556