Setup and Context¶

Introduction¶

On November 27, 1895, Alfred Nobel signed his last will in Paris. When it was opened after his death, the will caused a lot of controversy, as Nobel had left much of his wealth for the establishment of a prize.

Alfred Nobel dictates that his entire remaining estate should be used to endow “prizes to those who, during the preceding year, have conferred the greatest benefit to humankind”.

Every year the Nobel Prize is given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace.

Let's see what patterns we can find in the data of the past Nobel laureates. What can we learn about the Nobel prize and our world more generally?

Upgrade plotly (only Google Colab Notebook)¶

Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.

In [ ]:

%pip install --upgrade plotly

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: plotly in /usr/local/lib/python3.8/dist-packages (5.11.0)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.8/dist-packages (from plotly) (8.1.0)

Import Statements¶

In [ ]:

import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

In [ ]:

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Notebook Presentation¶

In [ ]:

pd.options.display.float_format = '{:,.2f}'.format

Read the Data¶

In [ ]:

df_data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Nobel+Prize+Analysis+Start/nobel_prize_data.csv')

Caveats: The exact birth dates for Michael Houghton, Venkatraman Ramakrishnan, and Nadia Murad are unknown. I've substituted them with mid-year estimate of July 2nd.

Data Exploration & Cleaning¶

Preliminary data exploration.

What is the shape of df_data? How many rows and columns?
What are the column names?
In which year was the Nobel prize first awarded?
Which year is the latest year included in the dataset?

In [ ]:

print(df_data.shape)
print(df_data.columns)
print("first year:", df_data.year.min())
print("first year:", df_data.year.max())

(962, 16)
Index(['year', 'category', 'prize', 'motivation', 'prize_share',
       'laureate_type', 'full_name', 'birth_date', 'birth_city',
       'birth_country', 'birth_country_current', 'sex', 'organization_name',
       'organization_city', 'organization_country', 'ISO'],
      dtype='object')
first year: 1901
first year: 2020

Are there any duplicate values in the dataset?
Are there NaN values in the dataset?
Which columns tend to have NaN values?
How many NaN values are there per column?
Why do these columns have NaN values?

Check for Duplicates¶

In [ ]:

print(f'Any duplicates? {df_data.duplicated().values.any()}')

Any duplicates? False

In [ ]:

print(f'Any NaN values among the data? {df_data.isna().values.any()}')

Any NaN values among the data? True

In [ ]:

df_data.isna().sum()

Out[ ]:

year                       0
category                   0
prize                      0
motivation                88
prize_share                0
laureate_type              0
full_name                  0
birth_date                28
birth_city                31
birth_country             28
birth_country_current     28
sex                       28
organization_name        255
organization_city        255
organization_country     254
ISO                       28
dtype: int64

In [ ]:

col_subset = ['year','category', 'laureate_type',
              'birth_date','full_name', 'organization_name']
df_data.loc[df_data.birth_date.isna()][col_subset]

Out[ ]:

	year	category	laureate_type	birth_date	full_name	organization_name
24	1904	Peace	Organization	NaN	Institut de droit international (Institute of ...	NaN
60	1910	Peace	Organization	NaN	Bureau international permanent de la Paix (Per...	NaN
89	1917	Peace	Organization	NaN	Comité international de la Croix Rouge (Intern...	NaN
200	1938	Peace	Organization	NaN	Office international Nansen pour les Réfugiés ...	NaN
215	1944	Peace	Organization	NaN	Comité international de la Croix Rouge (Intern...	NaN
237	1947	Peace	Organization	NaN	American Friends Service Committee (The Quakers)	NaN
238	1947	Peace	Organization	NaN	Friends Service Council (The Quakers)	NaN
283	1954	Peace	Organization	NaN	Office of the United Nations High Commissioner...	NaN
348	1963	Peace	Organization	NaN	Comité international de la Croix Rouge (Intern...	NaN
349	1963	Peace	Organization	NaN	Ligue des Sociétés de la Croix-Rouge (League o...	NaN
366	1965	Peace	Organization	NaN	United Nations Children's Fund (UNICEF)	NaN
399	1969	Peace	Organization	NaN	International Labour Organization (I.L.O.)	NaN
479	1977	Peace	Organization	NaN	Amnesty International	NaN
523	1981	Peace	Organization	NaN	Office of the United Nations High Commissioner...	NaN
558	1985	Peace	Organization	NaN	International Physicians for the Prevention of...	NaN
588	1988	Peace	Organization	NaN	United Nations Peacekeeping Forces	NaN
659	1995	Peace	Organization	NaN	Pugwash Conferences on Science and World Affairs	NaN
682	1997	Peace	Organization	NaN	International Campaign to Ban Landmines (ICBL)	NaN
703	1999	Peace	Organization	NaN	Médecins Sans Frontières	NaN
730	2001	Peace	Organization	NaN	United Nations (U.N.)	NaN
778	2005	Peace	Organization	NaN	International Atomic Energy Agency (IAEA)	NaN
788	2006	Peace	Organization	NaN	Grameen Bank	NaN
801	2007	Peace	Organization	NaN	Intergovernmental Panel on Climate Change (IPCC)	NaN
860	2012	Peace	Organization	NaN	European Union (EU)	NaN
873	2013	Peace	Organization	NaN	Organisation for the Prohibition of Chemical W...	NaN
897	2015	Peace	Organization	NaN	National Dialogue Quartet	NaN
919	2017	Peace	Organization	NaN	International Campaign to Abolish Nuclear Weap...	NaN
958	2020	Peace	Organization	NaN	World Food Programme (WFP)	NaN

We also see that since the organisation's name is in the full_name column, the organisation_name column contains NaN.

Type Conversions¶

Convert the birth_date column to Pandas Datetime objects
Add a Column called share_pct which has the laureates' share as a percentage in the form of a floating-point number.

Convert Year and Birth Date to Datetime¶

In [ ]:

df_data.birth_date = pd.to_datetime(df_data.birth_date)

In [ ]:

separated_values = df_data.prize_share.str.split('/', expand=True)
numerator = pd.to_numeric(separated_values[0])
denomenator = pd.to_numeric(separated_values[1])
df_data['share_pct'] = numerator / denomenator

In [ ]:

df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   year                   962 non-null    int64         
 1   category               962 non-null    object        
 2   prize                  962 non-null    object        
 3   motivation             874 non-null    object        
 4   prize_share            962 non-null    object        
 5   laureate_type          962 non-null    object        
 6   full_name              962 non-null    object        
 7   birth_date             934 non-null    datetime64[ns]
 8   birth_city             931 non-null    object        
 9   birth_country          934 non-null    object        
 10  birth_country_current  934 non-null    object        
 11  sex                    934 non-null    object        
 12  organization_name      707 non-null    object        
 13  organization_city      707 non-null    object        
 14  organization_country   708 non-null    object        
 15  ISO                    934 non-null    object        
 16  share_pct              962 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(14)
memory usage: 127.9+ KB

In [ ]:

df_data.head()

Out[ ]:

	year	category	prize	motivation	prize_share	laureate_type	full_name	birth_date	birth_city	birth_country	birth_country_current	sex	organization_name	organization_city	organization_country	ISO	share_pct
0	1901	Chemistry	The Nobel Prize in Chemistry 1901	"in recognition of the extraordinary services ...	1/1	Individual	Jacobus Henricus van 't Hoff	1852-08-30	Rotterdam	Netherlands	Netherlands	Male	Berlin University	Berlin	Germany	NLD	1.00
1	1901	Literature	The Nobel Prize in Literature 1901	"in special recognition of his poetic composit...	1/1	Individual	Sully Prudhomme	1839-03-16	Paris	France	France	Male	NaN	NaN	NaN	FRA	1.00
2	1901	Medicine	The Nobel Prize in Physiology or Medicine 1901	"for his work on serum therapy, especially its...	1/1	Individual	Emil Adolf von Behring	1854-03-15	Hansdorf (Lawice)	Prussia (Poland)	Poland	Male	Marburg University	Marburg	Germany	POL	1.00
3	1901	Peace	The Nobel Peace Prize 1901	NaN	1/2	Individual	Frédéric Passy	1822-05-20	Paris	France	France	Male	NaN	NaN	NaN	FRA	0.50
4	1901	Peace	The Nobel Peace Prize 1901	NaN	1/2	Individual	Jean Henry Dunant	1828-05-08	Geneva	Switzerland	Switzerland	Male	NaN	NaN	NaN	CHE	0.50

Plotly Donut Chart: Percentage of Male vs. Female Laureates¶

Challenge: Create a donut chart using plotly which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?

In [ ]:

biology = df_data.groupby("sex").agg({"prize": pd.Series.count})

In [ ]:

fig = px.pie(labels=biology.index,
             values=biology["prize"],
             hole=.3,
             names=biology.index)
fig.update_traces(textposition='inside', textfont_size=15, textinfo='percent')
fig.show()

Who were the first 3 Women to Win the Nobel Prize?¶

What are the names of the first 3 female Nobel laureates?
What did the win the prize for?
What do you see in their birth_country? Were they part of an organisation?

In [ ]:

df_data[df_data.sex == "Female"].sort_values("year")[:3]

Out[ ]:

	year	category	prize	motivation	prize_share	laureate_type	full_name	birth_date	birth_city	birth_country	birth_country_current	sex	organization_name	organization_city	organization_country	ISO	share_pct
18	1903	Physics	The Nobel Prize in Physics 1903	"in recognition of the extraordinary services ...	1/4	Individual	Marie Curie, née Sklodowska	1867-11-07	Warsaw	Russian Empire (Poland)	Poland	Female	NaN	NaN	NaN	POL	0.25
29	1905	Peace	The Nobel Peace Prize 1905	NaN	1/1	Individual	Baroness Bertha Sophie Felicita von Suttner, n...	1843-06-09	Prague	Austrian Empire (Czech Republic)	Czech Republic	Female	NaN	NaN	NaN	CZE	1.00
51	1909	Literature	The Nobel Prize in Literature 1909	"in appreciation of the lofty idealism, vivid ...	1/1	Individual	Selma Ottilia Lovisa Lagerlöf	1858-11-20	Mårbacka	Sweden	Sweden	Female	NaN	NaN	NaN	SWE	1.00

Even without looking at the data, you might have already guessed one of the famous names: Marie Curie.

Find the Repeat Winners¶

Challenge: Did some people get a Nobel Prize more than once? If so, who were they?

In [ ]:

is_winner = df_data.duplicated(subset=['full_name'], keep=False)
multiple_winners = df_data[is_winner]
print(f'There are {multiple_winners.full_name.nunique()}' \
      ' winners who were awarded the prize more than once.')

There are 6 winners who were awarded the prize more than once.

In [ ]:

col_subset = ['year', 'category', 'laureate_type', 'full_name']
multiple_winners[col_subset]

Out[ ]:

	year	category	laureate_type	full_name
18	1903	Physics	Individual	Marie Curie, née Sklodowska
62	1911	Chemistry	Individual	Marie Curie, née Sklodowska
89	1917	Peace	Organization	Comité international de la Croix Rouge (Intern...
215	1944	Peace	Organization	Comité international de la Croix Rouge (Intern...
278	1954	Chemistry	Individual	Linus Carl Pauling
283	1954	Peace	Organization	Office of the United Nations High Commissioner...
297	1956	Physics	Individual	John Bardeen
306	1958	Chemistry	Individual	Frederick Sanger
340	1962	Peace	Individual	Linus Carl Pauling
348	1963	Peace	Organization	Comité international de la Croix Rouge (Intern...
424	1972	Physics	Individual	John Bardeen
505	1980	Chemistry	Individual	Frederick Sanger
523	1981	Peace	Organization	Office of the United Nations High Commissioner...

We see that Marie Curie actually got the Nobel prize twice - once in physics and once in chemistry. Linus Carl Pauling got it first in chemistry and later for peace given his work in promoting nuclear disarmament. Also, the International Red Cross was awarded the Peace prize a total of 3 times. The first two times were both during the devastating World Wars.

Number of Prizes per Category¶

In how many categories are prizes awarded?
Create a plotly bar chart with the number of prizes awarded by category.
Use the color scale called Aggrnyl to colour the chart, but don't show a color axis.
Which category has the most number of prizes awarded?
Which category has the fewest number of prizes awarded?

In [ ]:

df_data.category.nunique()

Out[ ]:

In [ ]:

prizes_per_category = df_data.category.value_counts()
v_bar = px.bar(
        x = prizes_per_category.index,
        y = prizes_per_category.values,
        color = prizes_per_category.values,
        color_continuous_scale='Aggrnyl',
        title='Number of Prizes Awarded per Category')

v_bar.update_layout(xaxis_title='Nobel Prize Category',
                    coloraxis_showscale=False,
                    yaxis_title='Number of Prizes')
v_bar.show()

When was the first prize in the field of Economics awarded?
Who did the prize go to?

In [ ]:

df_data[df_data.category == 'Economics'].sort_values('year')[:3]

Out[ ]:

	year	category	prize	motivation	prize_share	laureate_type	full_name	birth_date	birth_city	birth_country	birth_country_current	sex	organization_name	organization_city	organization_country	ISO	share_pct
393	1969	Economics	The Sveriges Riksbank Prize in Economic Scienc...	"for having developed and applied dynamic mode...	1/2	Individual	Jan Tinbergen	1903-04-12	the Hague	Netherlands	Netherlands	Male	The Netherlands School of Economics	Rotterdam	Netherlands	NLD	0.50
394	1969	Economics	The Sveriges Riksbank Prize in Economic Scienc...	"for having developed and applied dynamic mode...	1/2	Individual	Ragnar Frisch	1895-03-03	Oslo	Norway	Norway	Male	University of Oslo	Oslo	Norway	NOR	0.50
402	1970	Economics	The Sveriges Riksbank Prize in Economic Scienc...	"for the scientific work through which he has ...	1/1	Individual	Paul A. Samuelson	1915-05-15	Gary, IN	United States of America	United States of America	Male	Massachusetts Institute of Technology (MIT)	Cambridge, MA	United States of America	USA	1.00

The first Economics prize was awarded in 1969 and it went to Jan Tinbergen.

Male and Female Winners by Category¶

Create a plotly bar chart that shows the split between men and women by category.

Hover over the bar chart. How many prizes went to women in Literature compared to Physics?

In [ ]:

cat_men_women = df_data.groupby(['category', 'sex'],
                               as_index=False).agg({'prize': pd.Series.count})
cat_men_women.sort_values('prize', ascending=False, inplace=True)

In [ ]:

v_bar_split = px.bar(x = cat_men_women.category,
                     y = cat_men_women.prize,
                     color = cat_men_women.sex,
                     title='Number of Prizes Awarded per Category split by Men and Women')

v_bar_split.update_layout(xaxis_title='Nobel Prize Category',
                          yaxis_title='Number of Prizes')
v_bar_split.show()

We see that overall the imbalance is pretty large with physics, economics, and chemistry. Women are somewhat more represented in categories of Medicine, Literature and Peace. Splitting bar charts like this is an incredibly powerful way to show a more granular picture.

Number of Prizes Awarded Over Time¶

Challenge: created? Show the trend in awards visually.

Count the number of prizes awarded every year.
Create a 5 year rolling average of the number of prizes (Hint: see previous lessons analysing Google Trends).
Using Matplotlib superimpose the rolling average on a scatter plot.
Show a tick mark on the x-axis for every 5 years from 1900 to 2020. (Hint: you'll need to use NumPy).
Use the named colours to draw the data points in dogerblue while the rolling average is coloured in crimson.

|

Looking at the chart, did the first and second world wars have an impact on the number of prizes being given out?
What could be the reason for the trend in the chart?

In [ ]:

prize_per_year = df_data.groupby("year").agg({'prize': pd.Series.count})
prize_per_year.sort_values('year', inplace=True)

In [ ]:

prize_per_year

Out[ ]:

	prize
year
1901	6
1902	7
1903	7
1904	6
1905	5
...	...
2016	11
2017	12
2018	13
2019	14
2020	12

117 rows × 1 columns

In [ ]:

moving_average = prize_per_year.rolling(window=5).mean()
moving_average

Out[ ]:

	prize
year
1901	NaN
1902	NaN
1903	NaN
1904	NaN
1905	6.20
...	...
2016	11.60
2017	12.00
2018	12.00
2019	12.20
2020	12.40

117 rows × 1 columns

In [ ]:

plt.figure(figsize=(16,8), dpi=200)
plt.title('Number of Nobel Prizes Awarded per Year', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(ticks=np.arange(1900, 2021, step=5),
           fontsize=14,
           rotation=45)

ax = plt.gca() # get current axis
ax.set_xlim(1900, 2020)

ax.scatter(x=prize_per_year.index,
           y=prize_per_year.values,
           c='dodgerblue',
           alpha=0.7,
           s=100,)

ax.plot(prize_per_year.index,
        moving_average.values,
        c='crimson',
        linewidth=3,)

plt.show()

Are More Prizes Shared Than Before?¶

Investigate if more prizes are shared than before.

Calculate the average prize share of the winners on a year by year basis.
Calculate the 5 year rolling average of the percentage share.
Copy-paste the cell from the chart you created above.
Modify the code to add a secondary axis to your Matplotlib chart.
Plot the rolling average of the prize share on this chart.
See if you can invert the secondary y-axis to make the relationship even more clear.

In [ ]:

yearly_avg_share = df_data.groupby(by='year').agg({'share_pct': pd.Series.mean})
share_moving_average = yearly_avg_share.rolling(window=5).mean()

In [ ]:

plt.figure(figsize=(16,8), dpi=200)
plt.title('Number of Nobel Prizes Awarded per Year', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(ticks=np.arange(1900, 2021, step=5),
           fontsize=14,
           rotation=45)

ax1 = plt.gca()
ax2 = ax1.twinx() # create second y-axis
ax1.set_xlim(1900, 2020)

ax1.scatter(x=prize_per_year.index,
           y=prize_per_year.values,
           c='dodgerblue',
           alpha=0.7,
           s=100,)

ax1.plot(prize_per_year.index,
        moving_average.values,
        c='crimson',
        linewidth=3,)
# Can invert axis
ax2.invert_yaxis()

# Adding prize share plot on second axis
ax2.plot(prize_per_year.index,
        share_moving_average.values,
        c='grey',
        linewidth=3,)

plt.show()

there is clearly an upward trend in the number of prizes being given out as more and more prizes are shared. Also, more prizes are being awarded from 1969 onwards because of the addition of the economics category. We also see that very few prizes were awarded during the first and second world wars. Note that instead of there being a zero entry for those years, we instead see the effect of the wards as missing blue dots.

The Countries with the Most Nobel Prizes¶

Challenge:

Create a Pandas DataFrame called top20_countries that has the two columns. The prize column should contain the total number of prizes won.

Is it best to use birth_country, birth_country_current or organization_country?
What are some potential problems when using birth_country or any of the others? Which column is the least problematic?
Then use plotly to create a horizontal bar chart showing the number of prizes won by each country. Here's what you're after:

What is the ranking for the top 20 countries in terms of the number of prizes?

If you look at the entries in the birth country, you'll see that some countries no longer exist! These include the Soviet Union or Czechoslovakia for example. Hence, using birth_country_current is better, since it has the country name which controls the city where the laureate was born. Now, notice that this does not determine the laureates' nationality since some globetrotting folks gave birth to their future Nobel laureate children while abroad. Also, people's nationalities can change as they emigrate and acquire different citizenship or get married and change citizenship. What this boils down to is that we will have to be clear about the assumptions that we will make in the upcoming analysis.

In [ ]:

top_countries = df_data.groupby("birth_country_current", as_index = False).agg({"prize":pd.Series.count})
top_countries.sort_values(by='prize', inplace=True)
top20_countries = top_countries[-20:]

In [ ]:

h_bar = px.bar(x=top20_countries.prize,
               y=top20_countries.birth_country_current,
               orientation='h',
               color=top20_countries.prize,
               color_continuous_scale='Viridis',
               title='Top 20 Countries by Number of Prizes')

h_bar.update_layout(xaxis_title='Number of Prizes',
                    yaxis_title='Country',
                    coloraxis_showscale=False)
h_bar.show()

In [ ]:

Use a Choropleth Map to Show the Number of Prizes Won by Country¶

Create this choropleth map using the plotly documentation:

Experiment with plotly's available colours. I quite like the sequential colour matter on this map.

Hint: You'll need to use a 3 letter country code for each country.

In [ ]:

df_countries = df_data.groupby(['birth_country_current', 'ISO'],
                               as_index=False).agg({'prize': pd.Series.count})
df_countries.sort_values('prize', ascending=False)

Out[ ]:

	birth_country_current	ISO	prize
74	United States of America	USA	281
73	United Kingdom	GBR	105
26	Germany	DEU	84
25	France	FRA	57
67	Sweden	SWE	29
...	...	...	...
32	Iceland	ISL	1
47	Madagascar	MDG	1
34	Indonesia	IDN	1
36	Iraq	IRQ	1
78	Zimbabwe	ZWE	1

79 rows × 3 columns

In [ ]:

fig = px.choropleth(df_countries, locations="ISO",
                    color="prize", # lifeExp is a column of gapminder
                    hover_name="birth_country_current", # column to add to hover information
                    color_continuous_scale=px.colors.sequential.matter)
fig.show()

In Which Categories are the Different Countries Winning Prizes?¶

Challenge: See if you can divide up the plotly bar chart you created above to show the which categories made up the total number of prizes. Here's what you're aiming for:

In which category are Germany and Japan the weakest compared to the United States?
In which category does Germany have more prizes than the UK?
In which categories does France have more prizes than Germany?
Which category makes up most of Australia's nobel prizes?
Which category makes up half of the prizes in the Netherlands?
Does the United States have more prizes in Economics than all of France? What about in Physics or Medicine?

The hard part is preparing the data for this chart!

Hint: Take a two-step approach. The first step is grouping the data by country and category. Then you can create a DataFrame that looks something like this:

In [ ]:

cat_prize_per_country = df_data.groupby(["birth_country_current", "category"], as_index = False).agg({"prize":pd.Series.count})
cat_prize_per_country.rename(columns={"prize": "cat_prize"}, inplace=True)
cat_prize_per_country

Out[ ]:

	birth_country_current	category	cat_prize
0	Algeria	Literature	1
1	Algeria	Physics	1
2	Argentina	Medicine	2
3	Argentina	Peace	2
4	Australia	Chemistry	1
...	...	...	...
206	United States of America	Physics	70
207	Venezuela	Medicine	1
208	Vietnam	Peace	1
209	Yemen	Peace	1
210	Zimbabwe	Peace	1

211 rows × 3 columns

Join cat_prize_per_country with top_countries

In [ ]:

merged_df = pd.merge(cat_prize_per_country, top20_countries, on='birth_country_current')
# change column names
merged_df.columns = ['birth_country_current', 'category', 'cat_prize', 'total_prize']
merged_df.sort_values(by='total_prize', inplace=True)
merged_df

Out[ ]:

	birth_country_current	category	cat_prize	total_prize
12	Belgium	Peace	3	9
42	Hungary	Chemistry	3	9
43	Hungary	Economics	1	9
52	India	Physics	1	9
51	India	Peace	1	9
...	...	...	...	...
104	United States of America	Chemistry	55	281
105	United States of America	Economics	49	281
106	United States of America	Literature	10	281
107	United States of America	Medicine	78	281
109	United States of America	Physics	70	281

110 rows × 4 columns

In [ ]:

cat_cntry_bar = px.bar(x=merged_df.cat_prize,
                       y=merged_df.birth_country_current,
                       color=merged_df.category,
                       orientation='h',
                       title='Top 20 Countries by Number of Prizes and Category')

cat_cntry_bar.update_layout(xaxis_title='Number of Prizes',
                            yaxis_title='Country')
cat_cntry_bar.show()

we see is that the US has won an incredible proportion of the prizes in the field of Economics. In comparison, Japan and Germany have won very few or no economics prize at all. Also, the US has more prizes in physics or medicine alone than all of France's prizes combined. On the chart, we also see that Germany won more prizes in physics than the UK and that France has won more prizes in peace and literature than Germany, even though Germany has been awarded a higher total number of prizes than France.

Number of Prizes Won by Each Country Over Time¶

When did the United States eclipse every other country in terms of the number of prizes won?
Which country or countries were leading previously?
Calculate the cumulative number of prizes won by each country in every year. Again, use the birth_country_current of the winner to calculate this.
Create a plotly line chart where each country is a coloured line.

In [ ]:

prize_by_year = df_data.groupby(by=['birth_country_current', 'year'], as_index=False).count()
prize_by_year = prize_by_year.sort_values('year')[['year', 'birth_country_current', 'prize']]
prize_by_year

Out[ ]:

	year	birth_country_current	prize
118	1901	France	2
346	1901	Poland	1
159	1901	Germany	1
312	1901	Netherlands	1
440	1901	Switzerland	1
...	...	...	...
31	2019	Austria	1
221	2020	Germany	1
622	2020	United States of America	7
533	2020	United Kingdom	2
158	2020	France	1

627 rows × 3 columns

In [ ]:

cumulative_prizes = prize_by_year.groupby(by=['birth_country_current',
                                              'year']).sum().groupby(level=[0]).cumsum()
cumulative_prizes.reset_index(inplace=True)
cumulative_prizes

Out[ ]:

	birth_country_current	year	prize
0	Algeria	1957	1
1	Algeria	1997	2
2	Argentina	1936	1
3	Argentina	1947	2
4	Argentina	1980	3
...	...	...	...
622	United States of America	2020	281
623	Venezuela	1980	1
624	Vietnam	1973	1
625	Yemen	2011	1
626	Zimbabwe	1960	1

627 rows × 3 columns

In [ ]:

l_chart = px.line(cumulative_prizes,
                  x='year',
                  y='prize',
                  color='birth_country_current',
                  hover_name='birth_country_current')

l_chart.update_layout(xaxis_title='Year',
                      yaxis_title='Number of Prizes')

l_chart.show()

What we see is that the United States really started to take off after the Second World War which decimated Europe. Prior to that, the Nobel prize was pretty much a European affair. Very few laureates were chosen from other parts of the world. This has changed dramatically in the last 40 years or so. There are many more countries represented today than in the early days. Interestingly we also see that the UK and Germany traded places in the 70s and 90s on the total number of prizes won. Sweden being 5th place pretty consistently over many decades is quite interesting too. Perhaps this reflects a little bit of home bias?

What are the Top Research Organisations?¶

Challenge: Create a bar chart showing the organisations affiliated with the Nobel laureates. It should looks something like this:

Which organisations make up the top 20?
How many Nobel prize winners are affiliated with the University of Chicago and Harvard University?

In [ ]:

sorted_organization_data = df_data.groupby("organization_name", as_index= False).agg({"prize": pd.Series.count})
sorted_organization_data.sort_values("prize", ascending = False, inplace = True)
sorted_organization_data.reset_index()

Out[ ]:

	index	organization_name	prize
0	196	University of California	40
1	68	Harvard University	29
2	167	Stanford University	23
3	117	Massachusetts Institute of Technology (MIT)	21
4	198	University of Chicago	20
...	...	...	...
259	110	Long Term Capital Management	1
260	112	Madrid University	1
261	113	Mainz University	1
262	114	Marburg University	1
263	263	École municipale de physique et de chimie indu...	1

264 rows × 3 columns

In [ ]:

sorted_organization_data = sorted_organization_data[:20]
sorted_organization_data

Out[ ]:

	organization_name	prize
196	University of California	40
68	Harvard University	29
167	Stanford University	23
117	Massachusetts Institute of Technology (MIT)	21
198	University of Chicago	20
197	University of Cambridge	18
26	California Institute of Technology (Caltech)	17
38	Columbia University	17
146	Princeton University	15
152	Rockefeller University	13
119	Max-Planck-Institut	13
222	University of Oxford	12
111	MRC Laboratory of Molecular Biology	10
258	Yale University	9
40	Cornell University	8
12	Bell Laboratories	8
109	London University	7
163	Sorbonne University	7
67	Harvard Medical School	7
192	University College London	7

In [ ]:

h_bar = px.bar(x=sorted_organization_data.prize,
               y=sorted_organization_data.organization_name,
               orientation='h',
               color=sorted_organization_data.prize,
               color_continuous_scale='Viridis',
               title='Top 20 Organizations by Number of Prizes')

h_bar.update_layout(xaxis_title='Number of Prizes',
                    yaxis_title='Organizations',
                    coloraxis_showscale=False)
h_bar.show()

Which Cities Make the Most Discoveries?¶

Where do major discoveries take place?

Create another plotly bar chart graphing the top 20 organisation cities of the research institutions associated with a Nobel laureate.
Where is the number one hotspot for discoveries in the world?
Which city in Europe has had the most discoveries?

In [ ]:

top20_org_cities = df_data.organization_city.value_counts()[:20]
top20_org_cities.sort_values(ascending=True, inplace=True)
city_bar2 = px.bar(x = top20_org_cities.values,
                  y = top20_org_cities.index,
                  orientation='h',
                  color=top20_org_cities.values,
                  color_continuous_scale=px.colors.sequential.Plasma,
                  title='Which Cities Do the Most Research?')

city_bar2.update_layout(xaxis_title='Number of Prizes',
                       yaxis_title='City',
                       coloraxis_showscale=False)
city_bar2.show()

In [ ]:

Where are Nobel Laureates Born? Chart the Laureate Birth Cities¶

Create a plotly bar chart graphing the top 20 birth cities of Nobel laureates.
Use a named colour scale called Plasma for the chart.
What percentage of the United States prizes came from Nobel laureates born in New York?
How many Nobel laureates were born in London, Paris and Vienna?
Out of the top 5 cities, how many are in the United States?

In [ ]:

top20_cities = df_data.birth_city.value_counts()[:20]
top20_cities.sort_values(ascending=True, inplace=True)
city_bar = px.bar(x=top20_cities.values,
                  y=top20_cities.index,
                  orientation='h',
                  color=top20_cities.values,
                  color_continuous_scale=px.colors.sequential.Plasma,
                  title='Where were the Nobel Laureates Born?')

city_bar.update_layout(xaxis_title='Number of Prizes',
                       yaxis_title='City of Birth',
                       coloraxis_showscale=False)
city_bar.show()

A higher population definitely means that there's a higher chance of a Nobel laureate to be born there. New York, Paris, and London are all very populous. However, Vienna and Budapest are not and still produced many prize winners. That said, much of the ground-breaking research does not take place in big population centres, so the list of birth cities is quite different from the list above. Cambridge Massachusets, Stanford, Berkely and Cambridge (UK) are all the places where many discoveries are made, but they are not the birthplaces of laureates.

Plotly Sunburst Chart: Combine Country, City, and Organisation¶

Create a DataFrame that groups the number of prizes by organisation.
Then use the plotly documentation to create a sunburst chart
Click around in your chart, what do you notice about Germany and France?

Here's what you're aiming for:

In [ ]:

country_city_org = df_data.groupby(by=['organization_country',
                                       'organization_city',
                                       'organization_name'], as_index=False).agg({'prize': pd.Series.count})

country_city_org = country_city_org.sort_values('prize', ascending=False)

In [ ]:

burst = px.sunburst(country_city_org,
                    path=['organization_country', 'organization_city', 'organization_name'],
                    values='prize',
                    title='Where do Discoveries Take Place?',
                   )

burst.update_layout(xaxis_title='Number of Prizes',
                    yaxis_title='City',
                    coloraxis_showscale=False)

burst.show()

Patterns in the Laureate Age at the Time of the Award¶

How Old Are the Laureates When the Win the Prize? Calculate the age of the laureate in the year of the ceremony and add this as a column called winning_age to the df_data DataFrame. Hint: you can use this to help you.

In [ ]:

birth_years = df_data.birth_date.dt.year

In [ ]:

df_data['winning_age'] = df_data.year - birth_years

Who were the oldest and youngest winners?¶

Challenge:

What are the names of the youngest and oldest Nobel laureate?
What did they win the prize for?
What is the average age of a winner?
75% of laureates are younger than what age when they receive the prize?
Use Seaborn to create histogram to visualise the distribution of laureate age at the time of winning. Experiment with the number of bins to see how the visualisation changes.

In [ ]:

display(df_data.nlargest(n=1, columns='winning_age'))
display(df_data.nsmallest(n=1, columns='winning_age'))

	year	category	prize	motivation	prize_share	laureate_type	full_name	birth_date	birth_city	birth_country	birth_country_current	sex	organization_name	organization_city	organization_country	ISO	share_pct	winning_age
937	2019	Chemistry	The Nobel Prize in Chemistry 2019	“for the development of lithium-ion batteries”	1/3	Individual	John Goodenough	1922-07-25	Jena	Germany	Germany	Male	University of Texas	Austin TX	United States of America	DEU	0.33	97.00

	year	category	prize	motivation	prize_share	laureate_type	full_name	birth_date	birth_city	birth_country	birth_country_current	sex	organization_name	organization_city	organization_country	ISO	share_pct	winning_age
885	2014	Peace	The Nobel Peace Prize 2014	"for their struggle against the suppression of...	1/2	Individual	Malala Yousafzai	1997-07-12	Mingora	Pakistan	Pakistan	Female	NaN	NaN	NaN	PAK	0.50	17.00

In [ ]:

df_data.winning_age.describe()

Out[ ]:

count   934.00
mean     59.95
std      12.62
min      17.00
25%      51.00
50%      60.00
75%      69.00
max      97.00
Name: winning_age, dtype: float64

Descriptive Statistics for the Laureate Age at Time of Award¶

Calculate the descriptive statistics for the age at the time of the award.
Then visualise the distribution in the form of a histogram using Seaborn's .histplot() function.
Experiment with the bin size. Try 10, 20, 30, and 50.

In [ ]:

plt.figure(figsize=(8, 4), dpi=200)
sns.histplot(data=df_data,
             x=df_data.winning_age,
             bins=30)
plt.xlabel('Age')
plt.title('Distribution of Age on Receipt of Prize')
plt.show()

Age at Time of Award throughout History¶

Are Nobel laureates being nominated later in life than before? Have the ages of laureates at the time of the award increased or decreased over time?

Use Seaborn to create a .regplot with a trendline.
Set the lowess parameter to True to show a moving average of the linear fit.
According to the best fit line, how old were Nobel laureates in the years 1900-1940 when they were awarded the prize?
According to the best fit line, what age would it predict for a Nobel laureate in 2020?

In [ ]:

plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style("whitegrid"):
    sns.regplot(data=df_data,
                x='year',
                y='winning_age',
                lowess=True,
                scatter_kws = {'alpha': 0.4},
                line_kws={'color': 'black'})

plt.show()

Winning Age Across the Nobel Prize Categories¶

How does the age of laureates vary by category?

Use Seaborn's .boxplot() to show how the mean, quartiles, max, and minimum values vary across categories. Which category has the longest "whiskers"?
In which prize category are the average winners the oldest?
In which prize category are the average winners the youngest?

In [ ]:

plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style("whitegrid"):
    sns.boxplot(data=df_data,
                x='category',
                y='winning_age')

plt.show()

Now use Seaborn's .lmplot() and the row parameter to create 6 separate charts for each prize category. Again set lowess to True.
What are the winning age trends in each category?
Which category has the age trending up and which category has the age trending down?
Is this .lmplot() telling a different story from the .boxplot()?
Create another chart with Seaborn. This time use .lmplot() to put all 6 categories on the same chart using the hue parameter.

In [ ]:

with sns.axes_style('whitegrid'):
    sns.lmplot(data=df_data,
               x='year',
               y='winning_age',
               row = 'category',
               lowess=True,
               aspect=2,
               scatter_kws = {'alpha': 0.6},
               line_kws = {'color': 'black'},)

plt.show()

We see that winners in physics, chemistry, and medicine have gotten older over time. The ageing trend is strongest for physics. The average age used to be below 50, but now it's over 70. Economics, the newest category, is much more stable in comparison. The peace prize shows the opposite trend where winners are getting younger! As such, our scatter plots showing the best fit lines over time and our box plot of the entire dataset can tell very different stories!

Setup and Context¶

Introduction¶

Upgrade plotly (only Google Colab Notebook)¶

Import Statements¶

Notebook Presentation¶

Read the Data¶

Data Exploration & Cleaning¶

Check for Duplicates¶

Type Conversions¶

Convert Year and Birth Date to Datetime¶

Add a Column with the Prize Share as a Percentage¶

Plotly Donut Chart: Percentage of Male vs. Female Laureates¶

Who were the first 3 Women to Win the Nobel Prize?¶

Find the Repeat Winners¶

Number of Prizes per Category¶

Male and Female Winners by Category¶

Number of Prizes Awarded Over Time¶

Are More Prizes Shared Than Before?¶

The Countries with the Most Nobel Prizes¶

Use a Choropleth Map to Show the Number of Prizes Won by Country¶

In Which Categories are the Different Countries Winning Prizes?¶

Number of Prizes Won by Each Country Over Time¶

What are the Top Research Organisations?¶

Which Cities Make the Most Discoveries?¶

Where are Nobel Laureates Born? Chart the Laureate Birth Cities¶

Plotly Sunburst Chart: Combine Country, City, and Organisation¶

Patterns in the Laureate Age at the Time of the Award¶

Who were the oldest and youngest winners?¶

Descriptive Statistics for the Laureate Age at Time of Award¶

Age at Time of Award throughout History¶

Winning Age Across the Nobel Prize Categories¶