Normalizing the Data

Shouke Wei, Ph.D. Professor

Email: shouke.wei@gmail.com

Objective¶

learn how to normalize the features, save and load the normalization scaler for new data

In [5]:

# import required packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# read data
df = pd.read_csv('./data/gdp_china_encoded.csv')

# show the first 5 rows
df.head()

Out[5]:

	year	gdp	pop	finv	trade	fexpen	uinc
0	2000	1.074125	8.650000	0.314513	1.408147	0.108032	0.976157
1	2001	1.203925	8.733000	0.348443	1.501391	0.132133	1.041519
2	2002	1.350242	8.842000	0.385078	1.830169	0.152108	1.113720
3	2003	1.584464	8.963000	0.481320	2.346735	0.169563	1.238043
4	2004	1.886462	9.052298	0.587002	2.955899	0.185295	1.362765

Slice data into features X and target y¶

In [6]:

X = df.drop(['gdp'],axis=1)
y = df['gdp']

Split train and test data¶

In [7]:

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.30, random_state=1)

In [8]:

X_train

Out[8]:

	year	pop	finv	trade	fexpen	uinc	prov_hn	prov_js	prov_sd	prov_zj
66	2009	5.276	1.074232	1.282390	0.265335	2.461081	0.0	0.0	0.0	1.0
54	2016	9.947	5.332294	1.547657	0.875521	3.401208	0.0	0.0	1.0	0.0
36	2017	7.656	5.327700	3.999750	1.062103	4.362180	0.0	1.0	0.0	0.0
45	2007	9.367	1.253770	0.931296	0.226185	1.426470	0.0	0.0	1.0	0.0
52	2014	9.789	4.249555	1.701122	0.717731	2.922194	0.0	0.0	1.0	0.0
...	...	...	...	...	...	...	...	...	...	...
75	2018	5.155	3.169770	2.851160	0.862953	5.557430	0.0	0.0	0.0	1.0
9	2009	10.130	1.293312	4.174383	0.433437	2.157472	0.0	0.0	0.0	0.0
72	2015	5.539	2.732332	2.159908	0.664598	4.371448	0.0	0.0	0.0	1.0
12	2012	10.594	1.875150	6.211629	0.738786	3.022671	0.0	0.0	0.0	0.0
37	2018	7.723	5.327680	4.379350	1.165735	4.720000	0.0	1.0	0.0	0.0

66 rows × 10 columns

1. Normalization and Standardization¶

The terms standardize and normalize are used interchangeably in data preprocessing, although in statistics, the latter term also has other connotations.

The process of normalization involves transforming the data to a smaller or common range such as [−1,1] or [0, 1].

2. Why data normalization?¶

Normalization:

gives all attributes an equal weight
avoids dependence on the measurement units
particularly useful for machine learning training or
helps speed up the learning phase

In a linear regression model, it can help too though it is not necessary.

3. Methods for data normalization¶

Min-max normalization:¶

$$x'=\frac{x - min(x)}{max(x) - min(x)}$$$$x'=\frac{x - min(x)}{max(x) - min(x)}(new\_max(x)-new\_min(x)) + new\_min(x)$$

Mean normalization¶

$$x'=\frac{x - mean(x)}{max(x) - min(x)}$$

Z-score normalization / Standardization¶

$$x'=\frac{x - \mu}{\sigma}$$$$μ: \text{the mean of the variable,}$$$$σ: \text{is the standard deviation of the variable.}$$

Scaling to unit length¶

$$x'=\frac{x}{||x||}$$$$||x||: \text{the Euclidean length of the variable}.$$

Decimal scaling¶

$$x'=\frac{x}{10^j}$$$$ j: \text{the smallest integer such that max(|x'|)<1}$$

4. Sklearn built-in methods for data normalization¶

(1) MinMaxScaler¶

Transform features by scaling each feature to a given range

(2) MaxAbsScaler¶

Scale each feature by its maximum absolute value [-1, 1] by dividing through the largest maximum value

(3) RobustScaler¶

Scale features using statistics that are robust to outliers.It subtracts the column median and divides by the interquartile range.

(4) StandardScaler¶

StandardScaler scales each column to have 0 mean and unit variance.

(5) Normalizer¶

Normalize samples individually to unit norm. The normalizer operates on the rows rather than the columns. It applies l2 normalization by default.

Reference: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

`MinMaxScaler` Example:¶

(1) Normaliz the trainning dataset¶

In [16]:

# slice the continous features from the training data
X_train_continuous = X_train.loc[:,'year':'uinc']
X_train_continuous

Out[16]:

	year	pop	finv	trade	fexpen	uinc
66	2009	5.276	1.074232	1.282390	0.265335	2.461081
54	2016	9.947	5.332294	1.547657	0.875521	3.401208
36	2017	7.656	5.327700	3.999750	1.062103	4.362180
45	2007	9.367	1.253770	0.931296	0.226185	1.426470
52	2014	9.789	4.249555	1.701122	0.717731	2.922194
...	...	...	...	...	...	...
75	2018	5.155	3.169770	2.851160	0.862953	5.557430
9	2009	10.130	1.293312	4.174383	0.433437	2.157472
72	2015	5.539	2.732332	2.159908	0.664598	4.371448
12	2012	10.594	1.875150	6.211629	0.738786	3.022671
37	2018	7.723	5.327680	4.379350	1.165735	4.720000

66 rows × 6 columns

In [18]:

# to learn the underlying parameters of the scaler from the training data-set
min_max_scaler = MinMaxScaler().fit(X_train_continuous)

#  transform the training data-set to range [0,1]
X_train_continuous_scaled = min_max_scaler.transform(X_train_continuous)

# convert it into dataframe
X_train_continuous_scaled = pd.DataFrame(X_train_continuous_scaled,index=X_train_continuous.index,
                                        columns=X_train_continuous.columns)
X_train_continuous_scaled

Out[18]:

	year	pop	finv	trade	fexpen	uinc
66	0.500000	0.094319	0.173947	0.176927	0.145251	0.390579
54	0.888889	0.833518	0.964883	0.214073	0.544119	0.575614
36	0.944444	0.470961	0.964029	0.557440	0.666084	0.764752
45	0.388889	0.741731	0.207296	0.127763	0.119660	0.186948
52	0.777778	0.808514	0.763764	0.235562	0.440974	0.481335
...	...	...	...	...	...	...
75	1.000000	0.075170	0.563194	0.396602	0.535903	1.000000
9	0.500000	0.862478	0.214641	0.581894	0.255137	0.330823
72	0.833333	0.135939	0.481940	0.299806	0.406242	0.766576
12	0.666667	0.935908	0.322718	0.867170	0.454738	0.501111
37	1.000000	0.481564	0.964026	0.610595	0.733827	0.835178

66 rows × 6 columns

In [19]:

# diplay the full scaled train dataset 
X_train_scaled = X_train.copy()
X_train_scaled.loc[:,'year':'uinc'] = X_train_continuous_scaled
X_train_scaled

Out[19]:

	year	pop	finv	trade	fexpen	uinc	prov_hn	prov_js	prov_sd	prov_zj
66	0.500000	0.094319	0.173947	0.176927	0.145251	0.390579	0.0	0.0	0.0	1.0
54	0.888889	0.833518	0.964883	0.214073	0.544119	0.575614	0.0	0.0	1.0	0.0
36	0.944444	0.470961	0.964029	0.557440	0.666084	0.764752	0.0	1.0	0.0	0.0
45	0.388889	0.741731	0.207296	0.127763	0.119660	0.186948	0.0	0.0	1.0	0.0
52	0.777778	0.808514	0.763764	0.235562	0.440974	0.481335	0.0	0.0	1.0	0.0
...	...	...	...	...	...	...	...	...	...	...
75	1.000000	0.075170	0.563194	0.396602	0.535903	1.000000	0.0	0.0	0.0	1.0
9	0.500000	0.862478	0.214641	0.581894	0.255137	0.330823	0.0	0.0	0.0	0.0
72	0.833333	0.135939	0.481940	0.299806	0.406242	0.766576	0.0	0.0	0.0	1.0
12	0.666667	0.935908	0.322718	0.867170	0.454738	0.501111	0.0	0.0	0.0	0.0
37	1.000000	0.481564	0.964026	0.610595	0.733827	0.835178	0.0	1.0	0.0	0.0

66 rows × 10 columns

(2) Normaliz the testing dataset¶

In [20]:

# slice the continous features from the testing data
X_test_continuous = X_test.loc[:,'year':'uinc']
#  transforme the testing data-set to range [0,1] using the training scaler
X_test_continuous_scaled = min_max_scaler.transform(X_test_continuous)

# convert it into dataframe
X_test_continuous_scaled = pd.DataFrame(X_test_continuous_scaled,index=X_test_continuous.index,
                                        columns=X_test_continuous.columns)
X_test_continuous_scaled

Out[20]:

	year	pop	finv	trade	fexpen	uinc
40	0.111111	0.696629	0.039111	0.036688	0.028066	0.056056
31	0.666667	0.512739	0.547526	0.481719	0.431193	0.490291
46	0.444444	0.749644	0.261131	0.151409	0.148605	0.227113
58	0.055556	0.007754	0.027068	0.035371	0.010851	0.112156
77	0.055556	0.771483	0.003089	0.000578	0.005052	0.009864
49	0.611111	0.784460	0.471284	0.210696	0.298783	0.354778
87	0.611111	0.745055	0.304467	0.026858	0.249544	0.264300
44	0.333333	0.732553	0.180803	0.103640	0.091655	0.146158
88	0.666667	0.747903	0.372843	0.043088	0.299066	0.308541
90	0.777778	0.752651	0.546188	0.053241	0.365891	0.372103
67	0.555556	0.121380	0.204294	0.237688	0.181500	0.444669
27	0.444444	0.487735	0.258616	0.378848	0.184089	0.273840
74	0.944444	0.062035	0.563162	0.355903	0.464050	0.915100
84	0.444444	0.751543	0.169272	0.014353	0.120951	0.166605
32	0.722222	0.515746	0.650043	0.475029	0.481579	0.527854
55	0.944444	0.732553	0.999799	0.248336	0.577012	0.630277
39	0.055556	0.690141	0.026208	0.030915	0.021080	0.045954
10	0.555556	0.911695	0.264619	0.741384	0.326203	0.376546
2	0.111111	0.658649	0.045937	0.253633	0.071237	0.125392
38	0.000000	0.683336	0.021424	0.026322	0.011883	0.033926
53	0.833333	0.817693	0.871813	0.207203	0.511095	0.527062
73	0.888889	0.144010	0.536787	0.308322	0.427701	0.835909
19	0.000000	0.418895	0.022146	0.050256	0.010458	0.040032
89	0.722222	0.749011	0.458983	0.049350	0.336712	0.334089
94	1.000000	0.740624	0.801175	0.074534	0.574353	0.533536
35	0.888889	0.525241	0.896903	0.468056	0.624309	0.696451
33	0.777778	0.519069	0.753419	0.482110	0.525635	0.582191
48	0.555556	0.776705	0.406844	0.176662	0.242760	0.298763
70	0.722222	0.129451	0.360436	0.288562	0.281029	0.635990

In [21]:

# diplay the full scaled train dataset 
X_test_scaled = X_test.copy()
X_test_scaled.loc[:,'year':'uinc'] = X_test_continuous_scaled
X_test_scaled

Out[21]:

	year	pop	finv	trade	fexpen	uinc	prov_hn	prov_js	prov_sd	prov_zj
40	0.111111	0.696629	0.039111	0.036688	0.028066	0.056056	0.0	0.0	1.0	0.0
31	0.666667	0.512739	0.547526	0.481719	0.431193	0.490291	0.0	1.0	0.0	0.0
46	0.444444	0.749644	0.261131	0.151409	0.148605	0.227113	0.0	0.0	1.0	0.0
58	0.055556	0.007754	0.027068	0.035371	0.010851	0.112156	0.0	0.0	0.0	1.0
77	0.055556	0.771483	0.003089	0.000578	0.005052	0.009864	1.0	0.0	0.0	0.0
49	0.611111	0.784460	0.471284	0.210696	0.298783	0.354778	0.0	0.0	1.0	0.0
87	0.611111	0.745055	0.304467	0.026858	0.249544	0.264300	1.0	0.0	0.0	0.0
44	0.333333	0.732553	0.180803	0.103640	0.091655	0.146158	0.0	0.0	1.0	0.0
88	0.666667	0.747903	0.372843	0.043088	0.299066	0.308541	1.0	0.0	0.0	0.0
90	0.777778	0.752651	0.546188	0.053241	0.365891	0.372103	1.0	0.0	0.0	0.0
67	0.555556	0.121380	0.204294	0.237688	0.181500	0.444669	0.0	0.0	0.0	1.0
27	0.444444	0.487735	0.258616	0.378848	0.184089	0.273840	0.0	1.0	0.0	0.0
74	0.944444	0.062035	0.563162	0.355903	0.464050	0.915100	0.0	0.0	0.0	1.0
84	0.444444	0.751543	0.169272	0.014353	0.120951	0.166605	1.0	0.0	0.0	0.0
32	0.722222	0.515746	0.650043	0.475029	0.481579	0.527854	0.0	1.0	0.0	0.0
55	0.944444	0.732553	0.999799	0.248336	0.577012	0.630277	0.0	0.0	1.0	0.0
39	0.055556	0.690141	0.026208	0.030915	0.021080	0.045954	0.0	0.0	1.0	0.0
10	0.555556	0.911695	0.264619	0.741384	0.326203	0.376546	0.0	0.0	0.0	0.0
2	0.111111	0.658649	0.045937	0.253633	0.071237	0.125392	0.0	0.0	0.0	0.0
38	0.000000	0.683336	0.021424	0.026322	0.011883	0.033926	0.0	0.0	1.0	0.0
53	0.833333	0.817693	0.871813	0.207203	0.511095	0.527062	0.0	0.0	1.0	0.0
73	0.888889	0.144010	0.536787	0.308322	0.427701	0.835909	0.0	0.0	0.0	1.0
19	0.000000	0.418895	0.022146	0.050256	0.010458	0.040032	0.0	1.0	0.0	0.0
89	0.722222	0.749011	0.458983	0.049350	0.336712	0.334089	1.0	0.0	0.0	0.0
94	1.000000	0.740624	0.801175	0.074534	0.574353	0.533536	1.0	0.0	0.0	0.0
35	0.888889	0.525241	0.896903	0.468056	0.624309	0.696451	0.0	1.0	0.0	0.0
33	0.777778	0.519069	0.753419	0.482110	0.525635	0.582191	0.0	1.0	0.0	0.0
48	0.555556	0.776705	0.406844	0.176662	0.242760	0.298763	0.0	0.0	1.0	0.0
70	0.722222	0.129451	0.360436	0.288562	0.281029	0.635990	0.0	0.0	0.0	1.0

7. Save and load the training scaler¶

In [22]:

import joblib
joblib.dump(min_max_scaler,'mm_scaler')

Out[22]:

['mm_scaler']

In [25]:

import joblib
mm_scaler = joblib.load('mm_scaler')

In [26]:

X_test_continuous_scaled2 = mm_scaler.transform(X_test_continuous)
X_test_continuous_scaled2

Out[26]:

array([[1.11111111e-01, 6.96629213e-01, 3.91109924e-02, 3.66880369e-02,
        2.80658336e-02, 5.60560888e-02],
       [6.66666667e-01, 5.12739357e-01, 5.47525660e-01, 4.81719398e-01,
        4.31192786e-01, 4.90290710e-01],
       [4.44444444e-01, 7.49643931e-01, 2.61131077e-01, 1.51408782e-01,
        1.48605435e-01, 2.27112677e-01],
       [5.55555556e-02, 7.75439152e-03, 2.70675105e-02, 3.53714159e-02,
        1.08511200e-02, 1.12155675e-01],
       [5.55555556e-02, 7.71482830e-01, 3.08939634e-03, 5.78018498e-04,
        5.05165395e-03, 9.86379321e-03],
       [6.11111111e-01, 7.84459566e-01, 4.71284143e-01, 2.10695512e-01,
        2.98782975e-01, 3.54778102e-01],
       [6.11111111e-01, 7.45054597e-01, 3.04466957e-01, 2.68583659e-02,
        2.49544384e-01, 2.64299509e-01],
       [3.33333333e-01, 7.32552619e-01, 1.80803243e-01, 1.03640173e-01,
        9.16553580e-02, 1.46157577e-01],
       [6.66666667e-01, 7.47903149e-01, 3.72842512e-01, 4.30876725e-02,
        2.99066019e-01, 3.08540932e-01],
       [7.77777778e-01, 7.52650736e-01, 5.46187701e-01, 5.32412770e-02,
        3.65891269e-01, 3.72102526e-01],
       [5.55555556e-01, 1.21379965e-01, 2.04293577e-01, 2.37688015e-01,
        1.81500017e-01, 4.44668993e-01],
       [4.44444444e-01, 4.87735401e-01, 2.58616392e-01, 3.78847655e-01,
        1.84089251e-01, 2.73839731e-01],
       [9.44444444e-01, 6.20351321e-02, 5.63162106e-01, 3.55902600e-01,
        4.64050109e-01, 9.15100051e-01],
       [4.44444444e-01, 7.51542966e-01, 1.69272246e-01, 1.43526835e-02,
        1.20951421e-01, 1.66604537e-01],
       [7.22222222e-01, 5.15746162e-01, 6.50043391e-01, 4.75028989e-01,
        4.81578590e-01, 5.27853859e-01],
       [9.44444444e-01, 7.32552619e-01, 9.99799390e-01, 2.48335520e-01,
        5.77011575e-01, 6.30277019e-01],
       [5.55555556e-02, 6.90140845e-01, 2.62082304e-02, 3.09146826e-02,
        2.10799348e-02, 4.59537506e-02],
       [5.55555556e-01, 9.11694888e-01, 2.64618908e-01, 7.41384232e-01,
        3.26202971e-01, 3.76545523e-01],
       [1.11111111e-01, 6.58648520e-01, 4.59367528e-02, 2.53632717e-01,
        7.12369492e-02, 1.25392359e-01],
       [0.00000000e+00, 6.83335971e-01, 2.14236782e-02, 2.63224027e-02,
        1.18826301e-02, 3.39259298e-02],
       [8.33333333e-01, 8.17692673e-01, 8.71812713e-01, 2.07203245e-01,
        5.11094943e-01, 5.27062449e-01],
       [8.88888889e-01, 1.44010128e-01, 5.36786887e-01, 3.08322356e-01,
        4.27701471e-01, 8.35909435e-01],
       [0.00000000e+00, 4.18895395e-01, 2.21456890e-02, 5.02564960e-02,
        1.04576035e-02, 4.00324437e-02],
       [7.22222222e-01, 7.49010919e-01, 4.58983397e-01, 4.93503389e-02,
        3.36712215e-01, 3.34089054e-01],
       [1.00000000e+00, 7.40623516e-01, 8.01174907e-01, 7.45341047e-02,
        5.74353051e-01, 5.33536425e-01],
       [8.88888889e-01, 5.25241336e-01, 8.96903285e-01, 4.68055899e-01,
        6.24309385e-01, 6.96451388e-01],
       [7.77777778e-01, 5.19069473e-01, 7.53418917e-01, 4.82109655e-01,
        5.25635444e-01, 5.82191322e-01],
       [5.55555556e-01, 7.76705175e-01, 4.06844447e-01, 1.76661501e-01,
        2.42759819e-01, 2.98763149e-01],
       [7.22222222e-01, 1.29450862e-01, 3.60436446e-01, 2.88561556e-01,
        2.81028974e-01, 6.35990288e-01]])

In [ ]: