Nobel Prize Exploratory Data Analysis with Lets-Plot

The data is provided by Kaggle.

Preparation

In [1]:
from sys import executable
!{executable} -m pip install colorcet
Requirement already satisfied: colorcet in /Users/asmirnov/opt/anaconda3/envs/lets-plot-docs/lib/python3.8/site-packages (3.0.0)
Requirement already satisfied: param>=1.7.0 in /Users/asmirnov/opt/anaconda3/envs/lets-plot-docs/lib/python3.8/site-packages (from colorcet) (1.12.2)
Requirement already satisfied: pyct>=0.4.4 in /Users/asmirnov/opt/anaconda3/envs/lets-plot-docs/lib/python3.8/site-packages (from colorcet) (0.4.8)
In [2]:
import numpy as np
import pandas as pd
import colorcet as cc

from lets_plot import *
from lets_plot.geo_data import *
LetsPlot.setup_html()
The geodata is provided by © OpenStreetMap contributors and is made available here under the Open Database License (ODbL).
In [3]:
def get_counts_df(local_df, *, column, column_name=None):
    vc_df = local_df[column].value_counts().to_frame('count')
    vc_df.index.name = column_name if column_name else column
    vc_df = vc_df.reset_index()
    return vc_df
In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/JetBrains/lets-plot-docs/master/data/nobel.csv")

all_countries = pd.Series(list(set(list(df.born_country_code) + \
                                   list(df.died_country_code) + \
                                   list(df.country_of_university))))
all_countries = all_countries[~all_countries.isna()]
geocoded_countries_df = geocode_countries(all_countries).ignore_not_found().get_geocodes()
geocoded_countries_dict = geocoded_countries_df.set_index('country').to_dict()['found name']
df = df.replace({
    'born_country_code':
        dict(geocoded_countries_dict, \
             **{str(c): np.nan for c in set(df.born_country_code) - set(geocoded_countries_dict.keys())}),
    'died_country_code':
        dict(geocoded_countries_dict, \
             **{str(c): np.nan for c in set(df.died_country_code) - set(geocoded_countries_dict.keys())}),
    'country_of_university':
        dict(geocoded_countries_dict, \
             **{str(c): np.nan for c in set(df.country_of_university) - set(geocoded_countries_dict.keys())})
})
df = df.rename(columns={'born_country_code': 'born_country', \
                        'died_country_code': 'died_country', \
                        'share': 'prize_share'})

df['decade'] = (df.year / 10).astype(int) * 10
df['fullname'] = df.firstname + ' ' + df.surname

df.head()
Out[4]:
firstname surname born_country died_country gender year category prize_share name_of_university city_of_university country_of_university born_month age age_get_prize decade fullname
0 Wilhelm Conrad Röntgen Deutschland Deutschland male 1901 physics 1 Munich University Munich Deutschland Mar 78 56 1900 Wilhelm Conrad Röntgen
1 Hendrik A. Lorentz Nederland Nederland male 1902 physics 2 Leiden University Leiden NaN Jul 75 49 1900 Hendrik A. Lorentz
2 Pieter Zeeman Nederland Nederland male 1902 physics 2 Amsterdam University Amsterdam NaN May 78 37 1900 Pieter Zeeman
3 Henri Becquerel France France male 1903 physics 2 École Polytechnique Paris France Dec 56 51 1900 Henri Becquerel
4 Pierre Curie France France male 1903 physics 4 École municipale de physique et de chimie indu... Paris France May 47 44 1900 Pierre Curie
In [5]:
country_prizes_df = df[~df.country_of_university.isna()]\
                      .drop_duplicates(subset=['country_of_university', 'year', 'category'])
country_prizes_df = country_prizes_df.rename(columns={'country_of_university': 'country'})
laureates_df = df.drop_duplicates(subset=['fullname'])
not_migrated_laureates_df = df[(~df.died_country.isna())&(df.born_country == df.died_country)]\
                              .drop_duplicates(subset=['born_country', 'died_country', 'fullname'])
migrated_laureates_df = df[(~df.died_country.isna())&(df.born_country != df.died_country)]\
                          .drop_duplicates(subset=['born_country', 'died_country', 'fullname'])
In [6]:
decades = sorted(df.decade.unique())

Visualization

Explore Countries

In [7]:
N = 10
countries_colors = {country: cc.palette['glasbey_dark'][i] \
                    for i, country in enumerate(geocoded_countries_dict.values())}
plots = []
for d, column, counted_name in [(country_prizes_df, 'country', 'Nobel prizes'), \
                                (not_migrated_laureates_df, 'born_country', 'non migrated laureates'), \
                                (migrated_laureates_df, 'died_country', 'immigrated laureates'), \
                                (migrated_laureates_df, 'born_country', 'emigrated laureates')]:
    local_df = get_counts_df(d, column=column, column_name='country')
    local_df['color'] = np.vectorize(countries_colors.get)(local_df.country)
    plots.append(ggplot(local_df) + \
        geom_bar(aes(x='country', y='count', color='color', fill='color'), \
                 stat='identity', sampling=sampling_pick(N), alpha=.75, show_legend=False, \
                 tooltips=layer_tooltips().line('@country').line('{0} number|@count'.format(counted_name))) + \
        scale_color_identity() + scale_fill_identity() + \
        ggtitle('Top {0} Countries by {1}'.format(N, counted_name.title())) + \
        theme(axis_text_x='blank', axis_ticks_x='blank'))

w, h = 400, 300
bunch = GGBunch()
bunch.add_plot(plots[0], 0, 0, w, h)
bunch.add_plot(plots[1], w, 0, w, h)
bunch.add_plot(plots[2], 0, h, w, h)
bunch.add_plot(plots[3], w, h, w, h)
bunch.show()

Obviously, the US is the absolute champion in the Nobel race. A great deal of its success is due to immigrant scientists.

Also here we see that many Nobel laureates have left Poland, Germany and the UK. However, the US, the UK and Germany take top positions regardless of brain drain.

In [8]:
migrated_laureates_df['migration'] = migrated_laureates_df.born_country + ' → ' + migrated_laureates_df.died_country
migration_df = get_counts_df(migrated_laureates_df, column='migration')

ggplot(migration_df[migration_df['count'] > 1]) + \
    geom_bar(aes(x='migration', y='count', fill='count'), stat='identity', show_legend=False, \
             tooltips=layer_tooltips().line('@migration').line('migrated laureates number|@count')) + \
    ggtitle('Popular Migration Directions for Nobel Laureates') + \
    theme(axis_text_x='blank', axis_ticks_x='blank')
Out[8]:

The most popular direction of migration is from the UK to the US.

Except those moving to the US, another popular migration route for scientists is from Poland to Germany.

In [9]:
countries_df = get_counts_df(country_prizes_df, column='country')
top_countries = countries_df[countries_df['count'].cumsum() < 3 * country_prizes_df.shape[0] / 4].country.values
country_prizes_df['half'] = np.where(country_prizes_df.country.isin(top_countries), country_prizes_df.country, 'Other')
country_prizes_df.half = pd.Categorical(country_prizes_df.half, list(top_countries) + ['Other'])
ggplot(country_prizes_df.groupby(['decade', 'half']).count().iloc[:, 0].to_frame('count').groupby(level=0)\
               .apply(lambda x: x / float(x.sum())).sort_values(by='half')\
               .sort_values(by='half').reset_index()) + \
    geom_bar(aes(x='decade', y='count', group='half', fill='half'), stat='identity') + \
    scale_x_continuous(breaks=decades, labels=[str(d) for d in decades]) + ylab('proportion of prizes') + \
    scale_fill_discrete(name='country') + \
    ggtitle('Prize Proportion between Top Countries and Others')
Out[9]:

3/4 of all the Nobel prizes ever awarded belong to the US, the UK and Germany. But the situation changes over time, mostly in favor of the US and not in favor of Germany.

In [10]:
country_boundaries_gdf = geocode_countries().get_boundaries()

ggplot() + \
    geom_map(aes(fill='count'), \
             data=laureates_df.groupby('born_country').count().iloc[:, 0].to_frame('count').reset_index(), \
             map=country_boundaries_gdf, \
             map_join=('born_country', 'country'),
             tooltips=layer_tooltips().line('@born_country').line('laureates number|@count')) + \
    scale_fill_gradient(low='#4575b4', high='#d73027') + \
    ggtitle('Distribution of Nobel Laureates in the World') + \
    theme_classic() + theme(axis='blank')
Out[10]:

Here we see that the Nobel committee prefers to acknowledge the achievements of Western science and ignore almost the whole of Africa.

Explore Universities

In [11]:
N = 10

top_universities = country_prizes_df.name_of_university.value_counts().to_frame('count')[:N].index
ggplot(country_prizes_df[country_prizes_df.name_of_university.isin(top_universities)]) + \
    geom_bar(aes(x='name_of_university', group='category', fill='category'), \
             tooltips=layer_tooltips().line('^x').line('@|@category').line('prizes number|^y')) + \
    xlab('university') + \
    ggtitle('Top {0} Universities by Prize Number'.format(N)) + \
    theme(axis_text_x='blank', axis_ticks_x='blank')
Out[11]:

Most top universities pay attention to a wide range of scientific disciplines, but some specialize in particular areas.

Explore Gender

In [12]:
p1 = ggplot(laureates_df) + \
    geom_bar(aes(x='gender', fill='gender')) + \
    ggtitle('Gender Ratio')
p2 = ggplot(laureates_df) + \
    geom_bar(aes(x='category', group='gender', fill='gender')) + \
    ggtitle('Gender Ratio by Category')
p3 = ggplot(laureates_df) + \
    geom_bar(aes(x='decade', group='gender', fill='gender')) + \
    scale_x_discrete(labels=df.decade.unique().astype(str)) + \
    ggtitle('Gender Ratio by Decade')

w, h = 600, 300

bunch = GGBunch()
bunch.add_plot(p1, 0, 0, w, h)
bunch.add_plot(p2, 0, h, w, h)
bunch.add_plot(p3, 0, 2 * h, w, h)
bunch.show()

We see not only inequality in gender but also a slow change of this trend through the years, except the 1910s and 1950s.

The best possible female/male ratio is seen in peace and literature.

Explore Categories

In [13]:
ggplot(df) + geom_bar(aes(x='category', fill='category')) + ggtitle('Nobel Prizes by Categories')
Out[13]:

Not all categories feature the same number of laureates, mostly due to prize sharing in collective research.

In [14]:
breaks = sorted(df.prize_share.unique())
labels = ['1' if b == 1 else '1/{0}'.format(b) for b in breaks]
ggplot(df) + \
    geom_bar(aes(x='prize_share', group='category', fill='category')) + \
    scale_x_continuous(name='prize share', breaks=breaks, labels=labels) + scale_fill_discrete() + \
    ggtitle('Sharing Prizes')
Out[14]:

In most cases the winner gets the full prize or half of it. For peace and especially for literature, it is unusual to share your prize with someone.

In [15]:
ggplot(df.groupby(['year', 'category']).agg({'decade': 'count', 'age_get_prize': 'mean'}).reset_index()) + \
    geom_point(aes(x='year', y='category', size='decade', color='age_get_prize'), shape=15, \
               tooltips=layer_tooltips().line('laureates number|^size').line('laureates mean age|^color')\
                       .line('@|@year').line('@|@category')) + \
    scale_x_continuous(breaks=decades, labels=[str(d) for d in decades]) + \
    scale_size(name='', range=[1, 2]) + \
    scale_color_gradient(name='age get prize', low='#1a9850', high='#d73027') + \
    ggtitle('Nobel Prizes by Year and Category') + \
    ggsize(900, 200)
Out[15]:

Throughout the years we see gaps in Nobel prizes awarded for some categories, especially peace. Also there is one big common gap during World War II.

Finally we see that the Nobel prize for economics was first awarded in the 1970s.

Explore Ages

In [16]:
ggplot(laureates_df) + \
    geom_density(aes(x='age'), color='#3182bd', fill='#9ecae1') + \
    ggtitle('Death Age Distribution of Nobel Laureates')
Out[16]:

The mean age of death for Nobel laureates is 85 years.

Wouldn't be too bad to achieve the same life span!

In [17]:
ggplot(df) + \
    geom_histogram(aes(x='age_get_prize'), binwidth=5, boundary=22) + \
    ggtitle('Distribution of Nobel Prize Winners Age')
Out[17]:

The mean age of winning the Nobel prize is 60 years.

In [18]:
ggplot(df, aes(x='age', y='age_get_prize')) + \
    geom_bin2d(binwidth=[5, 5]) + \
    scale_fill_gradient(low='#edf8fb', high='#006d2c') + \
    facet_grid(x='gender') + \
    ggtitle('Common Distribution of Death Age and Getting the Prize Age')
Out[18]:

This graph confirms our previous conclusions.

In [19]:
ggplot(df) + \
    geom_boxplot(aes(x='category', y='age_get_prize', fill='category')) + \
    facet_grid(x='gender') + \
    ggtitle('Aggregated Information About Age by Category and Gender')
Out[19]:

If we consider the categories, the mean age would be different. In many cases, physicists were relatively young when they achieved success, but economists tended to win the prize in slightly older age. For women, higher scatter in graph data could be explained by smaller sampling.

In [20]:
ggplot(df, aes(x='year', y='age_get_prize')) + \
    geom_point(aes(color='gender')) + geom_smooth(method='loess', color='black') + \
    scale_x_continuous(breaks=decades, labels=[str(d) for d in decades]) + \
    ggtitle('Distribution of Ages by Years')
Out[20]:

Here we also see that the mean age when Nobel laureates receive their prize is rising over time.

In [21]:
ggplot(df, aes(x='year', y='age_get_prize')) + \
    geom_point(aes(color='gender')) + geom_smooth(method='loess', color='black') + \
    scale_x_continuous(breaks=decades, labels=[str(d) for d in decades]) + \
    facet_grid(y='category') + \
    ggtitle('Category Wise Distribution of Ages by Years')
Out[21]:

Taking categories into account, we realize that the mean age doesn't always increase. It could even decrease in the case of the peace prize. For literature or economics, things don't change.

In [22]:
N = 20

p1 = ggplot(df.sort_values(by='age_get_prize', ascending=False)[:N]) + \
    geom_bar(aes(x='fullname', y='age_get_prize', fill='gender'), stat='identity') + \
    ggtitle('Top {0} Oldest Nobel Prize Laureates'.format(N))
p2 = ggplot(df.sort_values(by='age_get_prize', ascending=False)[-N:]) + \
    geom_bar(aes(x='fullname', y='age_get_prize', fill='gender'), stat='identity') + \
    ggtitle('Top {0} Youngest Nobel Prize Laureates'.format(N))

bunch = GGBunch()
bunch.add_plot(p1, 0, 0, 400, 300)
bunch.add_plot(p2, 400, 0, 400, 300)
bunch.show()

Finally we take a look at the oldest and youngest people who got the prize.

Multiple Laureates

In [23]:
multiple_laureates = list({k: v for k, v in (df.fullname.value_counts() > 1).items() if v}.keys())
ggplot(df[df.fullname.isin(multiple_laureates)]) + \
    geom_point(aes(x='year', y='fullname', color='category', fill='category', \
                   shape='gender', size='age_get_prize'), \
               alpha=.5, tooltips=layer_tooltips().line('@fullname').line('year get prize|@year')\
                                                  .line('prize category|@category')\
                                                  .line('prize share|1/@prize_share')\
                                                  .line('university|@name_of_university')
                                                  .line('@|@gender').line('prize winning age|@age_get_prize')\
                                                  .line('age at death|@age')\
                                                  .line('born country|@born_country')\
                                                  .line('died country|@died_country')) + \
    scale_x_continuous(breaks=decades, labels=[str(d) for d in decades]) + \
    scale_shape_manual(values=[24, 25]) + scale_size(name='prize winning age', range=[4, 8]) + \
    ggsize(600, 400) + ggtitle('Laureates Who Won Nobel Prize More Than Once') + \
    theme(legend_position='bottom', axis_title_y='blank', axis_tooltip='blank')
Out[23]:

By now, there are four people who have received the prize more than once. One of them is a woman. Also she is the only one who moved to a different country. Two of them changed categories in which they achieved the results.

The first case was in 1903 and the last one in 1980.

Look at the graph and you will find out even more fascinating details about these people.