# Data scale normalization¶

Normalization is a common technique used in machine learning to render the scales of different magnitudes to a common range between 0 and 1.

Here we demonstrate how this is done with pandas and altair.

Original inspiration: (Jason Brownlee: Machine Learning Algorithms from Scratch)[https://machinelearningmastery.com/machine-learning-algorithms-from-scratch/]

In [1]:
import altair as alt

# alt.renderers.enable('default')
alt.renderers

Out[1]:
RendererRegistry(active='default', registered=['colab', 'default', 'html', 'json', 'jupyterlab', 'kaggle', 'mimetype', 'notebook', 'nteract', 'png', 'svg', 'zeppelin'])
In [2]:
from vega_datasets import data


We use the Gapminder health and income dataset

In [3]:
health_income = data('gapminder-health-income')

Out[3]:
country income health population
0 Afghanistan 1925 57.63 32526562
1 Albania 10620 76.00 2896679
2 Algeria 13434 76.50 39666519
3 Andorra 46577 84.10 70473
4 Angola 7615 61.00 25021974
In [4]:
income_domain = [health_income['income'].min(), health_income['income'].max()]
health_domain = [health_income['health'].min(), health_income['health'].max()]

alt.Chart(health_income).mark_point().encode(
alt.X('income:Q', scale=alt.Scale(domain=income_domain)),
alt.Y('health:Q', scale=alt.Scale(domain=health_domain)),
alt.Size('population:Q'),
alt.Tooltip('country:N')
).properties(height=600, width=800)

Out[4]:

The process:

1. Take the values' difference from the smallest one;
2. Take the value range, that is, the difference between the largest and smallest values;
3. Divide the reduced values with the range.

$\text {scaled value} = \frac{value - min} {max - min}$

The first step ensures that the smallest value will become 0. Dividing the reduced values by the range 'compresses' the values so the new maximum becomes 1.

In [5]:
quantitative_columns = ['income', 'health', 'population']


The original minimum and maximum values

In [6]:
health_income.loc[health_income[quantitative_columns].idxmin(), :]

Out[6]:
country income health population
32 Central African Republic 599 53.8 4900274
93 Lesotho 2598 48.5 2135022
105 Marshall Islands 3661 65.1 52993
In [7]:
minimums = health_income[quantitative_columns].min()
minimums

Out[7]:
income          599.0
health           48.5
population    52993.0
dtype: float64
In [8]:
health_income.loc[health_income[quantitative_columns].idxmax(), :]

Out[8]:
country income health population
134 Qatar 132877 82.0 2235355
3 Andorra 46577 84.1 70473
35 China 13334 76.9 1376048943
In [9]:
maximums = health_income[quantitative_columns].max()
maximums

Out[9]:
income        1.328770e+05
health        8.410000e+01
population    1.376049e+09
dtype: float64

Difference of values from the column minimum

In [10]:
health_income[quantitative_columns] - minimums

Out[10]:
income health population
0 1326.0 9.13 32473569.0
1 10021.0 27.50 2843686.0
2 12835.0 28.00 39613526.0
3 45978.0 35.60 17480.0
4 7016.0 12.50 24968981.0
... ... ... ...
182 5024.0 28.00 93394608.0
183 3720.0 26.70 4615473.0
184 3288.0 19.10 26779222.0
185 3435.0 10.46 16158774.0
186 1202.0 11.51 15549758.0

187 rows × 3 columns

Value ranges: the difference between the maximum and the minimum

In [11]:
maximums - minimums

Out[11]:
income        1.322780e+05
health        3.560000e+01
population    1.375996e+09
dtype: float64

Let's normalize the dataset

In [12]:
def normalize_dataset(dataset, quantitative_columns):
dataset = dataset.copy()

minimums = dataset[quantitative_columns].min()
maximums = dataset[quantitative_columns].max()

dataset[quantitative_columns] = (dataset[quantitative_columns] - minimums) / (maximums - minimums)

return dataset

In [13]:
normalized_health_income = normalize_dataset(health_income, quantitative_columns)
normalized_health_income

Out[13]:
country income health population
0 Afghanistan 0.010024 0.256461 0.023600
1 Albania 0.075757 0.772472 0.002067
2 Algeria 0.097030 0.786517 0.028789
3 Andorra 0.347586 1.000000 0.000013
4 Angola 0.053040 0.351124 0.018146
... ... ... ... ...
182 Vietnam 0.037981 0.786517 0.067874
183 West Bank and Gaza 0.028123 0.750000 0.003354
184 Yemen 0.024857 0.536517 0.019462
185 Zambia 0.025968 0.293820 0.011743
186 Zimbabwe 0.009087 0.323315 0.011301

187 rows × 4 columns

The new minimum and maximum values

In [14]:
normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmin(), :]

Out[14]:
country income health population
32 Central African Republic 0.000000 0.148876 0.003523
93 Lesotho 0.015112 0.000000 0.001513
105 Marshall Islands 0.023148 0.466292 0.000000
In [15]:
normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmax(), :]

Out[15]:
country income health population
134 Qatar 1.000000 0.941011 0.001586
3 Andorra 0.347586 1.000000 0.000013
35 China 0.096275 0.797753 1.000000

Plotting the normalized data, we got the same results, but with the income, health, and population scales all normalized to the [0, 1] range.

## Maximum values¶

In [16]:
alt.Chart(normalized_health_income).mark_point().encode(
alt.X('income:Q',),
alt.Y('health:Q'),
alt.Size('population:Q'),
alt.Tooltip('country:N')
).properties(height=600, width=800)

Out[16]: