Data scale normalization¶

Normalization is a common technique used in machine learning to render the scales of different magnitudes to a common range between 0 and 1.

Here we demonstrate how this is done with pandas and altair.

Original inspiration: (Jason Brownlee: Machine Learning Algorithms from Scratch)[https://machinelearningmastery.com/machine-learning-algorithms-from-scratch/]

In [1]:

import altair as alt

# alt.renderers.enable('default')
alt.renderers

Out[1]:

RendererRegistry(active='default', registered=['colab', 'default', 'html', 'json', 'jupyterlab', 'kaggle', 'mimetype', 'notebook', 'nteract', 'png', 'svg', 'zeppelin'])

In [2]:

from vega_datasets import data

We use the Gapminder health and income dataset

In [3]:

health_income = data('gapminder-health-income')
health_income.head()

Out[3]:

	country	income	health	population
0	Afghanistan	1925	57.63	32526562
1	Albania	10620	76.00	2896679
2	Algeria	13434	76.50	39666519
3	Andorra	46577	84.10	70473
4	Angola	7615	61.00	25021974

In [4]:

income_domain = [health_income['income'].min(), health_income['income'].max()]
health_domain = [health_income['health'].min(), health_income['health'].max()]

alt.Chart(health_income).mark_point().encode(
    alt.X('income:Q', scale=alt.Scale(domain=income_domain)),
    alt.Y('health:Q', scale=alt.Scale(domain=health_domain)),
    alt.Size('population:Q'),
    alt.Tooltip('country:N')
).properties(height=600, width=800)

Out[4]:

The process:

Take the values' difference from the smallest one;
Take the value range, that is, the difference between the largest and smallest values;
Divide the reduced values with the range.

$ \text {scaled value} = \frac{value - min} {max - min} $

The first step ensures that the smallest value will become 0. Dividing the reduced values by the range 'compresses' the values so the new maximum becomes 1.

In [5]:

quantitative_columns = ['income', 'health', 'population']

The original minimum and maximum values

In [6]:

health_income.loc[health_income[quantitative_columns].idxmin(), :]

Out[6]:

	country	income	health	population
32	Central African Republic	599	53.8	4900274
93	Lesotho	2598	48.5	2135022
105	Marshall Islands	3661	65.1	52993

In [7]:

minimums = health_income[quantitative_columns].min()
minimums

Out[7]:

income          599.0
health           48.5
population    52993.0
dtype: float64

In [8]:

health_income.loc[health_income[quantitative_columns].idxmax(), :]

Out[8]:

	country	income	health	population
134	Qatar	132877	82.0	2235355
3	Andorra	46577	84.1	70473
35	China	13334	76.9	1376048943

In [9]:

maximums = health_income[quantitative_columns].max()
maximums

Out[9]:

income        1.328770e+05
health        8.410000e+01
population    1.376049e+09
dtype: float64

Difference of values from the column minimum

In [10]:

health_income[quantitative_columns] - minimums

Out[10]:

	income	health	population
0	1326.0	9.13	32473569.0
1	10021.0	27.50	2843686.0
2	12835.0	28.00	39613526.0
3	45978.0	35.60	17480.0
4	7016.0	12.50	24968981.0
...	...	...	...
182	5024.0	28.00	93394608.0
183	3720.0	26.70	4615473.0
184	3288.0	19.10	26779222.0
185	3435.0	10.46	16158774.0
186	1202.0	11.51	15549758.0

187 rows × 3 columns

Value ranges: the difference between the maximum and the minimum

In [11]:

maximums - minimums

Out[11]:

income        1.322780e+05
health        3.560000e+01
population    1.375996e+09
dtype: float64

Let's normalize the dataset

In [12]:

def normalize_dataset(dataset, quantitative_columns):
    dataset = dataset.copy()
    
    minimums = dataset[quantitative_columns].min()
    maximums = dataset[quantitative_columns].max()

    dataset[quantitative_columns] = (dataset[quantitative_columns] - minimums) / (maximums - minimums)
    
    return dataset

In [13]:

normalized_health_income = normalize_dataset(health_income, quantitative_columns)
normalized_health_income

Out[13]:

	country	income	health	population
0	Afghanistan	0.010024	0.256461	0.023600
1	Albania	0.075757	0.772472	0.002067
2	Algeria	0.097030	0.786517	0.028789
3	Andorra	0.347586	1.000000	0.000013
4	Angola	0.053040	0.351124	0.018146
...	...	...	...	...
182	Vietnam	0.037981	0.786517	0.067874
183	West Bank and Gaza	0.028123	0.750000	0.003354
184	Yemen	0.024857	0.536517	0.019462
185	Zambia	0.025968	0.293820	0.011743
186	Zimbabwe	0.009087	0.323315	0.011301

187 rows × 4 columns

The new minimum and maximum values

In [14]:

normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmin(), :]

Out[14]:

	country	income	health	population
32	Central African Republic	0.000000	0.148876	0.003523
93	Lesotho	0.015112	0.000000	0.001513
105	Marshall Islands	0.023148	0.466292	0.000000

In [15]:

normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmax(), :]

Out[15]:

	country	income	health	population
134	Qatar	1.000000	0.941011	0.001586
3	Andorra	0.347586	1.000000	0.000013
35	China	0.096275	0.797753	1.000000

Plotting the normalized data, we got the same results, but with the income, health, and population scales all normalized to the [0, 1] range.

Maximum values¶

In [16]:

alt.Chart(normalized_health_income).mark_point().encode(
    alt.X('income:Q',),
    alt.Y('health:Q'),
    alt.Size('population:Q'),
    alt.Tooltip('country:N')
).properties(height=600, width=800)

Out[16]: