Normalization is a common technique used in machine learning to render the scales of different magnitudes to a common range between 0 and 1.
Here we demonstrate how this is done with pandas and altair.
Original inspiration: (Jason Brownlee: Machine Learning Algorithms from Scratch)[https://machinelearningmastery.com/machine-learning-algorithms-from-scratch/]
import altair as alt
# alt.renderers.enable('default')
alt.renderers
RendererRegistry(active='default', registered=['colab', 'default', 'html', 'json', 'jupyterlab', 'kaggle', 'mimetype', 'notebook', 'nteract', 'png', 'svg', 'zeppelin'])
from vega_datasets import data
We use the Gapminder health and income dataset
health_income = data('gapminder-health-income')
health_income.head()
country | income | health | population | |
---|---|---|---|---|
0 | Afghanistan | 1925 | 57.63 | 32526562 |
1 | Albania | 10620 | 76.00 | 2896679 |
2 | Algeria | 13434 | 76.50 | 39666519 |
3 | Andorra | 46577 | 84.10 | 70473 |
4 | Angola | 7615 | 61.00 | 25021974 |
income_domain = [health_income['income'].min(), health_income['income'].max()]
health_domain = [health_income['health'].min(), health_income['health'].max()]
alt.Chart(health_income).mark_point().encode(
alt.X('income:Q', scale=alt.Scale(domain=income_domain)),
alt.Y('health:Q', scale=alt.Scale(domain=health_domain)),
alt.Size('population:Q'),
alt.Tooltip('country:N')
).properties(height=600, width=800)
The process:
$ \text {scaled value} = \frac{value - min} {max - min} $
The first step ensures that the smallest value will become 0. Dividing the reduced values by the range 'compresses' the values so the new maximum becomes 1.
quantitative_columns = ['income', 'health', 'population']
The original minimum and maximum values
health_income.loc[health_income[quantitative_columns].idxmin(), :]
country | income | health | population | |
---|---|---|---|---|
32 | Central African Republic | 599 | 53.8 | 4900274 |
93 | Lesotho | 2598 | 48.5 | 2135022 |
105 | Marshall Islands | 3661 | 65.1 | 52993 |
minimums = health_income[quantitative_columns].min()
minimums
income 599.0 health 48.5 population 52993.0 dtype: float64
health_income.loc[health_income[quantitative_columns].idxmax(), :]
country | income | health | population | |
---|---|---|---|---|
134 | Qatar | 132877 | 82.0 | 2235355 |
3 | Andorra | 46577 | 84.1 | 70473 |
35 | China | 13334 | 76.9 | 1376048943 |
maximums = health_income[quantitative_columns].max()
maximums
income 1.328770e+05 health 8.410000e+01 population 1.376049e+09 dtype: float64
Difference of values from the column minimum
health_income[quantitative_columns] - minimums
income | health | population | |
---|---|---|---|
0 | 1326.0 | 9.13 | 32473569.0 |
1 | 10021.0 | 27.50 | 2843686.0 |
2 | 12835.0 | 28.00 | 39613526.0 |
3 | 45978.0 | 35.60 | 17480.0 |
4 | 7016.0 | 12.50 | 24968981.0 |
... | ... | ... | ... |
182 | 5024.0 | 28.00 | 93394608.0 |
183 | 3720.0 | 26.70 | 4615473.0 |
184 | 3288.0 | 19.10 | 26779222.0 |
185 | 3435.0 | 10.46 | 16158774.0 |
186 | 1202.0 | 11.51 | 15549758.0 |
187 rows × 3 columns
Value ranges: the difference between the maximum and the minimum
maximums - minimums
income 1.322780e+05 health 3.560000e+01 population 1.375996e+09 dtype: float64
Let's normalize the dataset
def normalize_dataset(dataset, quantitative_columns):
dataset = dataset.copy()
minimums = dataset[quantitative_columns].min()
maximums = dataset[quantitative_columns].max()
dataset[quantitative_columns] = (dataset[quantitative_columns] - minimums) / (maximums - minimums)
return dataset
normalized_health_income = normalize_dataset(health_income, quantitative_columns)
normalized_health_income
country | income | health | population | |
---|---|---|---|---|
0 | Afghanistan | 0.010024 | 0.256461 | 0.023600 |
1 | Albania | 0.075757 | 0.772472 | 0.002067 |
2 | Algeria | 0.097030 | 0.786517 | 0.028789 |
3 | Andorra | 0.347586 | 1.000000 | 0.000013 |
4 | Angola | 0.053040 | 0.351124 | 0.018146 |
... | ... | ... | ... | ... |
182 | Vietnam | 0.037981 | 0.786517 | 0.067874 |
183 | West Bank and Gaza | 0.028123 | 0.750000 | 0.003354 |
184 | Yemen | 0.024857 | 0.536517 | 0.019462 |
185 | Zambia | 0.025968 | 0.293820 | 0.011743 |
186 | Zimbabwe | 0.009087 | 0.323315 | 0.011301 |
187 rows × 4 columns
The new minimum and maximum values
normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmin(), :]
country | income | health | population | |
---|---|---|---|---|
32 | Central African Republic | 0.000000 | 0.148876 | 0.003523 |
93 | Lesotho | 0.015112 | 0.000000 | 0.001513 |
105 | Marshall Islands | 0.023148 | 0.466292 | 0.000000 |
normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmax(), :]
country | income | health | population | |
---|---|---|---|---|
134 | Qatar | 1.000000 | 0.941011 | 0.001586 |
3 | Andorra | 0.347586 | 1.000000 | 0.000013 |
35 | China | 0.096275 | 0.797753 | 1.000000 |
Plotting the normalized data, we got the same results, but with the income
, health
, and population
scales all normalized to the [0, 1] range.
alt.Chart(normalized_health_income).mark_point().encode(
alt.X('income:Q',),
alt.Y('health:Q'),
alt.Size('population:Q'),
alt.Tooltip('country:N')
).properties(height=600, width=800)