![]() |
![]() |
![]() |
# import required packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
# read data
df = pd.read_csv('./data/gdp_china_encoded.csv')
# show the first 5 rows
df.head()
year | gdp | pop | finv | trade | fexpen | uinc | prov_hn | prov_js | prov_sd | prov_zj | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 1.074125 | 8.650000 | 0.314513 | 1.408147 | 0.108032 | 0.976157 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 2001 | 1.203925 | 8.733000 | 0.348443 | 1.501391 | 0.132133 | 1.041519 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 2002 | 1.350242 | 8.842000 | 0.385078 | 1.830169 | 0.152108 | 1.113720 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 2003 | 1.584464 | 8.963000 | 0.481320 | 2.346735 | 0.169563 | 1.238043 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 2004 | 1.886462 | 9.052298 | 0.587002 | 2.955899 | 0.185295 | 1.362765 | 0.0 | 0.0 | 0.0 | 0.0 |
X = df.drop(['gdp'],axis=1)
y = df['gdp']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.30, random_state=1)
X_train
year | pop | finv | trade | fexpen | uinc | prov_hn | prov_js | prov_sd | prov_zj | |
---|---|---|---|---|---|---|---|---|---|---|
66 | 2009 | 5.276 | 1.074232 | 1.282390 | 0.265335 | 2.461081 | 0.0 | 0.0 | 0.0 | 1.0 |
54 | 2016 | 9.947 | 5.332294 | 1.547657 | 0.875521 | 3.401208 | 0.0 | 0.0 | 1.0 | 0.0 |
36 | 2017 | 7.656 | 5.327700 | 3.999750 | 1.062103 | 4.362180 | 0.0 | 1.0 | 0.0 | 0.0 |
45 | 2007 | 9.367 | 1.253770 | 0.931296 | 0.226185 | 1.426470 | 0.0 | 0.0 | 1.0 | 0.0 |
52 | 2014 | 9.789 | 4.249555 | 1.701122 | 0.717731 | 2.922194 | 0.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
75 | 2018 | 5.155 | 3.169770 | 2.851160 | 0.862953 | 5.557430 | 0.0 | 0.0 | 0.0 | 1.0 |
9 | 2009 | 10.130 | 1.293312 | 4.174383 | 0.433437 | 2.157472 | 0.0 | 0.0 | 0.0 | 0.0 |
72 | 2015 | 5.539 | 2.732332 | 2.159908 | 0.664598 | 4.371448 | 0.0 | 0.0 | 0.0 | 1.0 |
12 | 2012 | 10.594 | 1.875150 | 6.211629 | 0.738786 | 3.022671 | 0.0 | 0.0 | 0.0 | 0.0 |
37 | 2018 | 7.723 | 5.327680 | 4.379350 | 1.165735 | 4.720000 | 0.0 | 1.0 | 0.0 | 0.0 |
66 rows × 10 columns
The terms standardize and normalize are used interchangeably in data preprocessing, although in statistics, the latter term also has other connotations.
The process of normalization involves transforming the data to a smaller or common range such as [−1,1] or [0, 1].
Normalization:
In a linear regression model, it can help too though it is not necessary.
Normalize samples individually to unit norm. The normalizer operates on the rows rather than the columns. It applies l2 normalization by default.
Reference: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
MinMaxScaler
Example:¶# slice the continous features from the training data
X_train_continuous = X_train.loc[:,'year':'uinc']
X_train_continuous
year | pop | finv | trade | fexpen | uinc | |
---|---|---|---|---|---|---|
66 | 2009 | 5.276 | 1.074232 | 1.282390 | 0.265335 | 2.461081 |
54 | 2016 | 9.947 | 5.332294 | 1.547657 | 0.875521 | 3.401208 |
36 | 2017 | 7.656 | 5.327700 | 3.999750 | 1.062103 | 4.362180 |
45 | 2007 | 9.367 | 1.253770 | 0.931296 | 0.226185 | 1.426470 |
52 | 2014 | 9.789 | 4.249555 | 1.701122 | 0.717731 | 2.922194 |
... | ... | ... | ... | ... | ... | ... |
75 | 2018 | 5.155 | 3.169770 | 2.851160 | 0.862953 | 5.557430 |
9 | 2009 | 10.130 | 1.293312 | 4.174383 | 0.433437 | 2.157472 |
72 | 2015 | 5.539 | 2.732332 | 2.159908 | 0.664598 | 4.371448 |
12 | 2012 | 10.594 | 1.875150 | 6.211629 | 0.738786 | 3.022671 |
37 | 2018 | 7.723 | 5.327680 | 4.379350 | 1.165735 | 4.720000 |
66 rows × 6 columns
# to learn the underlying parameters of the scaler from the training data-set
min_max_scaler = MinMaxScaler().fit(X_train_continuous)
# transform the training data-set to range [0,1]
X_train_continuous_scaled = min_max_scaler.transform(X_train_continuous)
# convert it into dataframe
X_train_continuous_scaled = pd.DataFrame(X_train_continuous_scaled,index=X_train_continuous.index,
columns=X_train_continuous.columns)
X_train_continuous_scaled
year | pop | finv | trade | fexpen | uinc | |
---|---|---|---|---|---|---|
66 | 0.500000 | 0.094319 | 0.173947 | 0.176927 | 0.145251 | 0.390579 |
54 | 0.888889 | 0.833518 | 0.964883 | 0.214073 | 0.544119 | 0.575614 |
36 | 0.944444 | 0.470961 | 0.964029 | 0.557440 | 0.666084 | 0.764752 |
45 | 0.388889 | 0.741731 | 0.207296 | 0.127763 | 0.119660 | 0.186948 |
52 | 0.777778 | 0.808514 | 0.763764 | 0.235562 | 0.440974 | 0.481335 |
... | ... | ... | ... | ... | ... | ... |
75 | 1.000000 | 0.075170 | 0.563194 | 0.396602 | 0.535903 | 1.000000 |
9 | 0.500000 | 0.862478 | 0.214641 | 0.581894 | 0.255137 | 0.330823 |
72 | 0.833333 | 0.135939 | 0.481940 | 0.299806 | 0.406242 | 0.766576 |
12 | 0.666667 | 0.935908 | 0.322718 | 0.867170 | 0.454738 | 0.501111 |
37 | 1.000000 | 0.481564 | 0.964026 | 0.610595 | 0.733827 | 0.835178 |
66 rows × 6 columns
# diplay the full scaled train dataset
X_train_scaled = X_train.copy()
X_train_scaled.loc[:,'year':'uinc'] = X_train_continuous_scaled
X_train_scaled
year | pop | finv | trade | fexpen | uinc | prov_hn | prov_js | prov_sd | prov_zj | |
---|---|---|---|---|---|---|---|---|---|---|
66 | 0.500000 | 0.094319 | 0.173947 | 0.176927 | 0.145251 | 0.390579 | 0.0 | 0.0 | 0.0 | 1.0 |
54 | 0.888889 | 0.833518 | 0.964883 | 0.214073 | 0.544119 | 0.575614 | 0.0 | 0.0 | 1.0 | 0.0 |
36 | 0.944444 | 0.470961 | 0.964029 | 0.557440 | 0.666084 | 0.764752 | 0.0 | 1.0 | 0.0 | 0.0 |
45 | 0.388889 | 0.741731 | 0.207296 | 0.127763 | 0.119660 | 0.186948 | 0.0 | 0.0 | 1.0 | 0.0 |
52 | 0.777778 | 0.808514 | 0.763764 | 0.235562 | 0.440974 | 0.481335 | 0.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
75 | 1.000000 | 0.075170 | 0.563194 | 0.396602 | 0.535903 | 1.000000 | 0.0 | 0.0 | 0.0 | 1.0 |
9 | 0.500000 | 0.862478 | 0.214641 | 0.581894 | 0.255137 | 0.330823 | 0.0 | 0.0 | 0.0 | 0.0 |
72 | 0.833333 | 0.135939 | 0.481940 | 0.299806 | 0.406242 | 0.766576 | 0.0 | 0.0 | 0.0 | 1.0 |
12 | 0.666667 | 0.935908 | 0.322718 | 0.867170 | 0.454738 | 0.501111 | 0.0 | 0.0 | 0.0 | 0.0 |
37 | 1.000000 | 0.481564 | 0.964026 | 0.610595 | 0.733827 | 0.835178 | 0.0 | 1.0 | 0.0 | 0.0 |
66 rows × 10 columns
# slice the continous features from the testing data
X_test_continuous = X_test.loc[:,'year':'uinc']
# transforme the testing data-set to range [0,1] using the training scaler
X_test_continuous_scaled = min_max_scaler.transform(X_test_continuous)
# convert it into dataframe
X_test_continuous_scaled = pd.DataFrame(X_test_continuous_scaled,index=X_test_continuous.index,
columns=X_test_continuous.columns)
X_test_continuous_scaled
year | pop | finv | trade | fexpen | uinc | |
---|---|---|---|---|---|---|
40 | 0.111111 | 0.696629 | 0.039111 | 0.036688 | 0.028066 | 0.056056 |
31 | 0.666667 | 0.512739 | 0.547526 | 0.481719 | 0.431193 | 0.490291 |
46 | 0.444444 | 0.749644 | 0.261131 | 0.151409 | 0.148605 | 0.227113 |
58 | 0.055556 | 0.007754 | 0.027068 | 0.035371 | 0.010851 | 0.112156 |
77 | 0.055556 | 0.771483 | 0.003089 | 0.000578 | 0.005052 | 0.009864 |
49 | 0.611111 | 0.784460 | 0.471284 | 0.210696 | 0.298783 | 0.354778 |
87 | 0.611111 | 0.745055 | 0.304467 | 0.026858 | 0.249544 | 0.264300 |
44 | 0.333333 | 0.732553 | 0.180803 | 0.103640 | 0.091655 | 0.146158 |
88 | 0.666667 | 0.747903 | 0.372843 | 0.043088 | 0.299066 | 0.308541 |
90 | 0.777778 | 0.752651 | 0.546188 | 0.053241 | 0.365891 | 0.372103 |
67 | 0.555556 | 0.121380 | 0.204294 | 0.237688 | 0.181500 | 0.444669 |
27 | 0.444444 | 0.487735 | 0.258616 | 0.378848 | 0.184089 | 0.273840 |
74 | 0.944444 | 0.062035 | 0.563162 | 0.355903 | 0.464050 | 0.915100 |
84 | 0.444444 | 0.751543 | 0.169272 | 0.014353 | 0.120951 | 0.166605 |
32 | 0.722222 | 0.515746 | 0.650043 | 0.475029 | 0.481579 | 0.527854 |
55 | 0.944444 | 0.732553 | 0.999799 | 0.248336 | 0.577012 | 0.630277 |
39 | 0.055556 | 0.690141 | 0.026208 | 0.030915 | 0.021080 | 0.045954 |
10 | 0.555556 | 0.911695 | 0.264619 | 0.741384 | 0.326203 | 0.376546 |
2 | 0.111111 | 0.658649 | 0.045937 | 0.253633 | 0.071237 | 0.125392 |
38 | 0.000000 | 0.683336 | 0.021424 | 0.026322 | 0.011883 | 0.033926 |
53 | 0.833333 | 0.817693 | 0.871813 | 0.207203 | 0.511095 | 0.527062 |
73 | 0.888889 | 0.144010 | 0.536787 | 0.308322 | 0.427701 | 0.835909 |
19 | 0.000000 | 0.418895 | 0.022146 | 0.050256 | 0.010458 | 0.040032 |
89 | 0.722222 | 0.749011 | 0.458983 | 0.049350 | 0.336712 | 0.334089 |
94 | 1.000000 | 0.740624 | 0.801175 | 0.074534 | 0.574353 | 0.533536 |
35 | 0.888889 | 0.525241 | 0.896903 | 0.468056 | 0.624309 | 0.696451 |
33 | 0.777778 | 0.519069 | 0.753419 | 0.482110 | 0.525635 | 0.582191 |
48 | 0.555556 | 0.776705 | 0.406844 | 0.176662 | 0.242760 | 0.298763 |
70 | 0.722222 | 0.129451 | 0.360436 | 0.288562 | 0.281029 | 0.635990 |
# diplay the full scaled train dataset
X_test_scaled = X_test.copy()
X_test_scaled.loc[:,'year':'uinc'] = X_test_continuous_scaled
X_test_scaled
year | pop | finv | trade | fexpen | uinc | prov_hn | prov_js | prov_sd | prov_zj | |
---|---|---|---|---|---|---|---|---|---|---|
40 | 0.111111 | 0.696629 | 0.039111 | 0.036688 | 0.028066 | 0.056056 | 0.0 | 0.0 | 1.0 | 0.0 |
31 | 0.666667 | 0.512739 | 0.547526 | 0.481719 | 0.431193 | 0.490291 | 0.0 | 1.0 | 0.0 | 0.0 |
46 | 0.444444 | 0.749644 | 0.261131 | 0.151409 | 0.148605 | 0.227113 | 0.0 | 0.0 | 1.0 | 0.0 |
58 | 0.055556 | 0.007754 | 0.027068 | 0.035371 | 0.010851 | 0.112156 | 0.0 | 0.0 | 0.0 | 1.0 |
77 | 0.055556 | 0.771483 | 0.003089 | 0.000578 | 0.005052 | 0.009864 | 1.0 | 0.0 | 0.0 | 0.0 |
49 | 0.611111 | 0.784460 | 0.471284 | 0.210696 | 0.298783 | 0.354778 | 0.0 | 0.0 | 1.0 | 0.0 |
87 | 0.611111 | 0.745055 | 0.304467 | 0.026858 | 0.249544 | 0.264300 | 1.0 | 0.0 | 0.0 | 0.0 |
44 | 0.333333 | 0.732553 | 0.180803 | 0.103640 | 0.091655 | 0.146158 | 0.0 | 0.0 | 1.0 | 0.0 |
88 | 0.666667 | 0.747903 | 0.372843 | 0.043088 | 0.299066 | 0.308541 | 1.0 | 0.0 | 0.0 | 0.0 |
90 | 0.777778 | 0.752651 | 0.546188 | 0.053241 | 0.365891 | 0.372103 | 1.0 | 0.0 | 0.0 | 0.0 |
67 | 0.555556 | 0.121380 | 0.204294 | 0.237688 | 0.181500 | 0.444669 | 0.0 | 0.0 | 0.0 | 1.0 |
27 | 0.444444 | 0.487735 | 0.258616 | 0.378848 | 0.184089 | 0.273840 | 0.0 | 1.0 | 0.0 | 0.0 |
74 | 0.944444 | 0.062035 | 0.563162 | 0.355903 | 0.464050 | 0.915100 | 0.0 | 0.0 | 0.0 | 1.0 |
84 | 0.444444 | 0.751543 | 0.169272 | 0.014353 | 0.120951 | 0.166605 | 1.0 | 0.0 | 0.0 | 0.0 |
32 | 0.722222 | 0.515746 | 0.650043 | 0.475029 | 0.481579 | 0.527854 | 0.0 | 1.0 | 0.0 | 0.0 |
55 | 0.944444 | 0.732553 | 0.999799 | 0.248336 | 0.577012 | 0.630277 | 0.0 | 0.0 | 1.0 | 0.0 |
39 | 0.055556 | 0.690141 | 0.026208 | 0.030915 | 0.021080 | 0.045954 | 0.0 | 0.0 | 1.0 | 0.0 |
10 | 0.555556 | 0.911695 | 0.264619 | 0.741384 | 0.326203 | 0.376546 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.111111 | 0.658649 | 0.045937 | 0.253633 | 0.071237 | 0.125392 | 0.0 | 0.0 | 0.0 | 0.0 |
38 | 0.000000 | 0.683336 | 0.021424 | 0.026322 | 0.011883 | 0.033926 | 0.0 | 0.0 | 1.0 | 0.0 |
53 | 0.833333 | 0.817693 | 0.871813 | 0.207203 | 0.511095 | 0.527062 | 0.0 | 0.0 | 1.0 | 0.0 |
73 | 0.888889 | 0.144010 | 0.536787 | 0.308322 | 0.427701 | 0.835909 | 0.0 | 0.0 | 0.0 | 1.0 |
19 | 0.000000 | 0.418895 | 0.022146 | 0.050256 | 0.010458 | 0.040032 | 0.0 | 1.0 | 0.0 | 0.0 |
89 | 0.722222 | 0.749011 | 0.458983 | 0.049350 | 0.336712 | 0.334089 | 1.0 | 0.0 | 0.0 | 0.0 |
94 | 1.000000 | 0.740624 | 0.801175 | 0.074534 | 0.574353 | 0.533536 | 1.0 | 0.0 | 0.0 | 0.0 |
35 | 0.888889 | 0.525241 | 0.896903 | 0.468056 | 0.624309 | 0.696451 | 0.0 | 1.0 | 0.0 | 0.0 |
33 | 0.777778 | 0.519069 | 0.753419 | 0.482110 | 0.525635 | 0.582191 | 0.0 | 1.0 | 0.0 | 0.0 |
48 | 0.555556 | 0.776705 | 0.406844 | 0.176662 | 0.242760 | 0.298763 | 0.0 | 0.0 | 1.0 | 0.0 |
70 | 0.722222 | 0.129451 | 0.360436 | 0.288562 | 0.281029 | 0.635990 | 0.0 | 0.0 | 0.0 | 1.0 |
import joblib
joblib.dump(min_max_scaler,'mm_scaler')
['mm_scaler']
import joblib
mm_scaler = joblib.load('mm_scaler')
X_test_continuous_scaled2 = mm_scaler.transform(X_test_continuous)
X_test_continuous_scaled2
array([[1.11111111e-01, 6.96629213e-01, 3.91109924e-02, 3.66880369e-02, 2.80658336e-02, 5.60560888e-02], [6.66666667e-01, 5.12739357e-01, 5.47525660e-01, 4.81719398e-01, 4.31192786e-01, 4.90290710e-01], [4.44444444e-01, 7.49643931e-01, 2.61131077e-01, 1.51408782e-01, 1.48605435e-01, 2.27112677e-01], [5.55555556e-02, 7.75439152e-03, 2.70675105e-02, 3.53714159e-02, 1.08511200e-02, 1.12155675e-01], [5.55555556e-02, 7.71482830e-01, 3.08939634e-03, 5.78018498e-04, 5.05165395e-03, 9.86379321e-03], [6.11111111e-01, 7.84459566e-01, 4.71284143e-01, 2.10695512e-01, 2.98782975e-01, 3.54778102e-01], [6.11111111e-01, 7.45054597e-01, 3.04466957e-01, 2.68583659e-02, 2.49544384e-01, 2.64299509e-01], [3.33333333e-01, 7.32552619e-01, 1.80803243e-01, 1.03640173e-01, 9.16553580e-02, 1.46157577e-01], [6.66666667e-01, 7.47903149e-01, 3.72842512e-01, 4.30876725e-02, 2.99066019e-01, 3.08540932e-01], [7.77777778e-01, 7.52650736e-01, 5.46187701e-01, 5.32412770e-02, 3.65891269e-01, 3.72102526e-01], [5.55555556e-01, 1.21379965e-01, 2.04293577e-01, 2.37688015e-01, 1.81500017e-01, 4.44668993e-01], [4.44444444e-01, 4.87735401e-01, 2.58616392e-01, 3.78847655e-01, 1.84089251e-01, 2.73839731e-01], [9.44444444e-01, 6.20351321e-02, 5.63162106e-01, 3.55902600e-01, 4.64050109e-01, 9.15100051e-01], [4.44444444e-01, 7.51542966e-01, 1.69272246e-01, 1.43526835e-02, 1.20951421e-01, 1.66604537e-01], [7.22222222e-01, 5.15746162e-01, 6.50043391e-01, 4.75028989e-01, 4.81578590e-01, 5.27853859e-01], [9.44444444e-01, 7.32552619e-01, 9.99799390e-01, 2.48335520e-01, 5.77011575e-01, 6.30277019e-01], [5.55555556e-02, 6.90140845e-01, 2.62082304e-02, 3.09146826e-02, 2.10799348e-02, 4.59537506e-02], [5.55555556e-01, 9.11694888e-01, 2.64618908e-01, 7.41384232e-01, 3.26202971e-01, 3.76545523e-01], [1.11111111e-01, 6.58648520e-01, 4.59367528e-02, 2.53632717e-01, 7.12369492e-02, 1.25392359e-01], [0.00000000e+00, 6.83335971e-01, 2.14236782e-02, 2.63224027e-02, 1.18826301e-02, 3.39259298e-02], [8.33333333e-01, 8.17692673e-01, 8.71812713e-01, 2.07203245e-01, 5.11094943e-01, 5.27062449e-01], [8.88888889e-01, 1.44010128e-01, 5.36786887e-01, 3.08322356e-01, 4.27701471e-01, 8.35909435e-01], [0.00000000e+00, 4.18895395e-01, 2.21456890e-02, 5.02564960e-02, 1.04576035e-02, 4.00324437e-02], [7.22222222e-01, 7.49010919e-01, 4.58983397e-01, 4.93503389e-02, 3.36712215e-01, 3.34089054e-01], [1.00000000e+00, 7.40623516e-01, 8.01174907e-01, 7.45341047e-02, 5.74353051e-01, 5.33536425e-01], [8.88888889e-01, 5.25241336e-01, 8.96903285e-01, 4.68055899e-01, 6.24309385e-01, 6.96451388e-01], [7.77777778e-01, 5.19069473e-01, 7.53418917e-01, 4.82109655e-01, 5.25635444e-01, 5.82191322e-01], [5.55555556e-01, 7.76705175e-01, 4.06844447e-01, 1.76661501e-01, 2.42759819e-01, 2.98763149e-01], [7.22222222e-01, 1.29450862e-01, 3.60436446e-01, 2.88561556e-01, 2.81028974e-01, 6.35990288e-01]])