Normalization makes data more meaningful by converting absolute values into comparisons with related values. Chris Vallier has produced this demonstration of normalization using PyJanitor.
pyjanitor functions demonstrated here:
import janitor
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(style="whitegrid")
We'll use a dataset with fuel efficiency in miles per gallon ("mpg"), engine displacement in cubic centimeters ("disp"), and horsepower ("hp") for a variety of car models. It's a crazy, but customary, mix of units.
csv_file = 'https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv'
cars_df = pd.read_csv(csv_file)
Quantities without units are dangerous, so let's use pyjanitor's rename_column
...
cars_df = cars_df.rename_column('disp', 'disp_cc')
cars_df.head()
model | mpg | cyl | disp_cc | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
4 | Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
cars_df = cars_df.sort_values('mpg',ascending= False)
sns.barplot(y='model', x='mpg', data=cars_df, color = 'b' , orient= "h", )
<matplotlib.axes._subplots.AxesSubplot at 0x1a1c5e47f0>
cars_df = cars_df.sort_values('disp_cc',ascending= False)
sns.barplot(y='model', x='disp_cc', data=cars_df, color = 'b' , orient= "h")
<matplotlib.axes._subplots.AxesSubplot at 0x1a1d9e3978>
cars_df = cars_df.sort_values('hp',ascending= False)
sns.barplot(y='model', x='hp', data=cars_df, color = 'b' , orient= "h")
<matplotlib.axes._subplots.AxesSubplot at 0x1a1dbfb0b8>
First we'll use pyjanitor's min_max_scale to rescale the mpg
, disp_cc
, and hp
columns in-place so that each value varies from 0 to 1.
(cars_df.min_max_scale(col_name='mpg',new_max=1,new_min=0)
.min_max_scale(col_name='disp_cc',new_max=1,new_min=0)
.min_max_scale(col_name='hp',new_max=1,new_min=0)
)
model | mpg | cyl | disp_cc | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
30 | Maserati Bora | 0.195745 | 8 | 0.573460 | 1.000000 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
28 | Ford Pantera L | 0.229787 | 8 | 0.698179 | 0.749117 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
6 | Duster 360 | 0.165957 | 8 | 0.720629 | 0.681979 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
23 | Camaro Z28 | 0.123404 | 8 | 0.695685 | 0.681979 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
16 | Chrysler Imperial | 0.182979 | 8 | 0.920180 | 0.628975 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
15 | Lincoln Continental | 0.000000 | 8 | 0.970067 | 0.575972 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
14 | Cadillac Fleetwood | 0.000000 | 8 | 1.000000 | 0.540636 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
13 | Merc 450SLC | 0.204255 | 8 | 0.510601 | 0.452297 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
11 | Merc 450SE | 0.255319 | 8 | 0.510601 | 0.452297 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
12 | Merc 450SL | 0.293617 | 8 | 0.510601 | 0.452297 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
24 | Pontiac Firebird | 0.374468 | 8 | 0.820404 | 0.434629 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
4 | Hornet Sportabout | 0.353191 | 8 | 0.720629 | 0.434629 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
29 | Ferrari Dino | 0.395745 | 6 | 0.184335 | 0.434629 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
21 | Dodge Challenger | 0.217021 | 8 | 0.615864 | 0.346290 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
22 | AMC Javelin | 0.204255 | 8 | 0.580943 | 0.346290 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
10 | Merc 280C | 0.314894 | 6 | 0.240708 | 0.250883 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
9 | Merc 280 | 0.374468 | 6 | 0.240708 | 0.250883 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
27 | Lotus Europa | 0.851064 | 4 | 0.059865 | 0.215548 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
0 | Mazda RX4 | 0.451064 | 6 | 0.221751 | 0.204947 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | Mazda RX4 Wag | 0.451064 | 6 | 0.221751 | 0.204947 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
3 | Hornet 4 Drive | 0.468085 | 6 | 0.466201 | 0.204947 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
31 | Volvo 142E | 0.468085 | 4 | 0.124470 | 0.201413 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
5 | Valiant | 0.327660 | 6 | 0.383886 | 0.187279 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
20 | Toyota Corona | 0.472340 | 4 | 0.122225 | 0.159011 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
8 | Merc 230 | 0.527660 | 4 | 0.173859 | 0.151943 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
2 | Datsun 710 | 0.527660 | 4 | 0.092043 | 0.144876 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
26 | Porsche 914-2 | 0.663830 | 4 | 0.122724 | 0.137809 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
25 | Fiat X1-9 | 0.719149 | 4 | 0.019706 | 0.049470 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
17 | Fiat 128 | 0.936170 | 4 | 0.018957 | 0.049470 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
19 | Toyota Corolla | 1.000000 | 4 | 0.000000 | 0.045936 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
7 | Merc 240D | 0.595745 | 4 | 0.188576 | 0.035336 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
18 | Honda Civic | 0.851064 | 4 | 0.011474 | 0.000000 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
The shapes of the bar graphs remain the same, but the horizontal axes show the new scale.
cars_df = cars_df.sort_values('mpg',ascending= False)
sns.barplot(y='model', x='mpg', data=cars_df, color = 'b' , orient= "h", )
<matplotlib.axes._subplots.AxesSubplot at 0x1a1ddb6630>
cars_df = cars_df.sort_values('disp_cc',ascending= False)
sns.barplot(y='model', x='disp_cc', data=cars_df, color = 'b' , orient= "h")
<matplotlib.axes._subplots.AxesSubplot at 0x1a1ddc32e8>
cars_df = cars_df.sort_values('hp',ascending= False)
sns.barplot(y='model', x='hp', data=cars_df, color = 'b' , orient= "h")
<matplotlib.axes._subplots.AxesSubplot at 0x1a1e147e48>
Next we'll convert to standard scores. This expresses each value in terms of its standard deviations from the mean, expressing where each model stands in relation to the others.
We'll use pyjanitor's transform_column to apply the standard score calculation, x-x.mean()) / x.std()
, to each value in each of the columns we're evaluating.
cars_df.transform_column(['mpg','disp_cc','hp'],lambda x: (x-x.mean()) / x.std())
model | mpg | cyl | disp_cc | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
30 | Maserati Bora | -0.844644 | 8 | 0.567039 | 2.746567 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
28 | Ford Pantera L | -0.711907 | 8 | 0.970465 | 1.711021 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
6 | Duster 360 | -0.960789 | 8 | 1.043081 | 1.433903 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
23 | Camaro Z28 | -1.126710 | 8 | 0.962396 | 1.433903 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
16 | Chrysler Imperial | -0.894420 | 8 | 1.688562 | 1.215126 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
15 | Lincoln Continental | -1.607883 | 8 | 1.849932 | 0.996348 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
14 | Cadillac Fleetwood | -1.607883 | 8 | 1.946754 | 0.850497 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
13 | Merc 450SLC | -0.811460 | 8 | 0.363713 | 0.485868 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
11 | Merc 450SE | -0.612354 | 8 | 0.363713 | 0.485868 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
12 | Merc 450SL | -0.463025 | 8 | 0.363713 | 0.485868 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
24 | Pontiac Firebird | -0.147774 | 8 | 1.365821 | 0.412942 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
4 | Hornet Sportabout | -0.230735 | 8 | 1.043081 | 0.412942 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
29 | Ferrari Dino | -0.064813 | 6 | -0.691647 | 0.412942 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
21 | Dodge Challenger | -0.761683 | 8 | 0.704204 | 0.048313 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
22 | AMC Javelin | -0.811460 | 8 | 0.591245 | 0.048313 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
10 | Merc 280C | -0.380064 | 6 | -0.509299 | -0.345486 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
9 | Merc 280 | -0.147774 | 6 | -0.509299 | -0.345486 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
27 | Lotus Europa | 1.710547 | 4 | -1.094266 | -0.491337 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
0 | Mazda RX4 | 0.150885 | 6 | -0.570620 | -0.535093 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | Mazda RX4 Wag | 0.150885 | 6 | -0.570620 | -0.535093 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
3 | Hornet 4 Drive | 0.217253 | 6 | 0.220094 | -0.535093 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
31 | Volvo 142E | 0.217253 | 4 | -0.885292 | -0.549678 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
5 | Valiant | -0.330287 | 6 | -0.046167 | -0.608019 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
20 | Toyota Corona | 0.233846 | 4 | -0.892553 | -0.724700 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
8 | Merc 230 | 0.449543 | 4 | -0.725535 | -0.753870 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
2 | Datsun 710 | 0.449543 | 4 | -0.990182 | -0.783040 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
26 | Porsche 914-2 | 0.980492 | 4 | -0.890939 | -0.812211 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
25 | Fiat X1-9 | 1.196190 | 4 | -1.224169 | -1.176840 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
17 | Fiat 128 | 2.042389 | 4 | -1.226589 | -1.176840 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
19 | Toyota Corolla | 2.291272 | 4 | -1.287910 | -1.191425 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
7 | Merc 240D | 0.715018 | 4 | -0.677931 | -1.235180 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
18 | Honda Civic | 1.710547 | 4 | -1.250795 | -1.381032 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
cars_df = cars_df.sort_values('mpg',ascending= False)
sns.barplot(y='model', x='mpg', data=cars_df, color = 'b', orient= "h", )
<matplotlib.axes._subplots.AxesSubplot at 0x1a1e2fc780>
cars_df = cars_df.sort_values('disp_cc',ascending= False)
sns.barplot(y='model', x='disp_cc', data=cars_df, color = 'b', orient= "h")
<matplotlib.axes._subplots.AxesSubplot at 0x1a1e4ddc18>
cars_df = cars_df.sort_values('hp',ascending= False)
sns.barplot(y='model', x='hp', data=cars_df, color = 'b', orient= "h")
<matplotlib.axes._subplots.AxesSubplot at 0x1a1e6ac240>