On November 27, 1895, Alfred Nobel signed his last will in Paris. When it was opened after his death, the will caused a lot of controversy, as Nobel had left much of his wealth for the establishment of a prize.
Alfred Nobel dictates that his entire remaining estate should be used to endow “prizes to those who, during the preceding year, have conferred the greatest benefit to humankind”.
Every year the Nobel Prize is given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace.
Let's see what patterns we can find in the data of the past Nobel laureates. What can we learn about the Nobel prize and our world more generally?
Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.
%pip install --upgrade plotly
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: plotly in /usr/local/lib/python3.8/dist-packages (5.11.0) Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.8/dist-packages (from plotly) (8.1.0)
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
pd.options.display.float_format = '{:,.2f}'.format
df_data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Nobel+Prize+Analysis+Start/nobel_prize_data.csv')
Caveats: The exact birth dates for Michael Houghton, Venkatraman Ramakrishnan, and Nadia Murad are unknown. I've substituted them with mid-year estimate of July 2nd.
Preliminary data exploration.
df_data
? How many rows and columns?print(df_data.shape)
print(df_data.columns)
print("first year:", df_data.year.min())
print("first year:", df_data.year.max())
(962, 16) Index(['year', 'category', 'prize', 'motivation', 'prize_share', 'laureate_type', 'full_name', 'birth_date', 'birth_city', 'birth_country', 'birth_country_current', 'sex', 'organization_name', 'organization_city', 'organization_country', 'ISO'], dtype='object') first year: 1901 first year: 2020
print(f'Any duplicates? {df_data.duplicated().values.any()}')
Any duplicates? False
print(f'Any NaN values among the data? {df_data.isna().values.any()}')
Any NaN values among the data? True
df_data.isna().sum()
year 0 category 0 prize 0 motivation 88 prize_share 0 laureate_type 0 full_name 0 birth_date 28 birth_city 31 birth_country 28 birth_country_current 28 sex 28 organization_name 255 organization_city 255 organization_country 254 ISO 28 dtype: int64
col_subset = ['year','category', 'laureate_type',
'birth_date','full_name', 'organization_name']
df_data.loc[df_data.birth_date.isna()][col_subset]
year | category | laureate_type | birth_date | full_name | organization_name | |
---|---|---|---|---|---|---|
24 | 1904 | Peace | Organization | NaN | Institut de droit international (Institute of ... | NaN |
60 | 1910 | Peace | Organization | NaN | Bureau international permanent de la Paix (Per... | NaN |
89 | 1917 | Peace | Organization | NaN | Comité international de la Croix Rouge (Intern... | NaN |
200 | 1938 | Peace | Organization | NaN | Office international Nansen pour les Réfugiés ... | NaN |
215 | 1944 | Peace | Organization | NaN | Comité international de la Croix Rouge (Intern... | NaN |
237 | 1947 | Peace | Organization | NaN | American Friends Service Committee (The Quakers) | NaN |
238 | 1947 | Peace | Organization | NaN | Friends Service Council (The Quakers) | NaN |
283 | 1954 | Peace | Organization | NaN | Office of the United Nations High Commissioner... | NaN |
348 | 1963 | Peace | Organization | NaN | Comité international de la Croix Rouge (Intern... | NaN |
349 | 1963 | Peace | Organization | NaN | Ligue des Sociétés de la Croix-Rouge (League o... | NaN |
366 | 1965 | Peace | Organization | NaN | United Nations Children's Fund (UNICEF) | NaN |
399 | 1969 | Peace | Organization | NaN | International Labour Organization (I.L.O.) | NaN |
479 | 1977 | Peace | Organization | NaN | Amnesty International | NaN |
523 | 1981 | Peace | Organization | NaN | Office of the United Nations High Commissioner... | NaN |
558 | 1985 | Peace | Organization | NaN | International Physicians for the Prevention of... | NaN |
588 | 1988 | Peace | Organization | NaN | United Nations Peacekeeping Forces | NaN |
659 | 1995 | Peace | Organization | NaN | Pugwash Conferences on Science and World Affairs | NaN |
682 | 1997 | Peace | Organization | NaN | International Campaign to Ban Landmines (ICBL) | NaN |
703 | 1999 | Peace | Organization | NaN | Médecins Sans Frontières | NaN |
730 | 2001 | Peace | Organization | NaN | United Nations (U.N.) | NaN |
778 | 2005 | Peace | Organization | NaN | International Atomic Energy Agency (IAEA) | NaN |
788 | 2006 | Peace | Organization | NaN | Grameen Bank | NaN |
801 | 2007 | Peace | Organization | NaN | Intergovernmental Panel on Climate Change (IPCC) | NaN |
860 | 2012 | Peace | Organization | NaN | European Union (EU) | NaN |
873 | 2013 | Peace | Organization | NaN | Organisation for the Prohibition of Chemical W... | NaN |
897 | 2015 | Peace | Organization | NaN | National Dialogue Quartet | NaN |
919 | 2017 | Peace | Organization | NaN | International Campaign to Abolish Nuclear Weap... | NaN |
958 | 2020 | Peace | Organization | NaN | World Food Programme (WFP) | NaN |
We also see that since the organisation's name is in the full_name column, the organisation_name column contains NaN.
birth_date
column to Pandas Datetime
objectsshare_pct
which has the laureates' share as a percentage in the form of a floating-point number.df_data.birth_date = pd.to_datetime(df_data.birth_date)
separated_values = df_data.prize_share.str.split('/', expand=True)
numerator = pd.to_numeric(separated_values[0])
denomenator = pd.to_numeric(separated_values[1])
df_data['share_pct'] = numerator / denomenator
df_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 962 entries, 0 to 961 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 962 non-null int64 1 category 962 non-null object 2 prize 962 non-null object 3 motivation 874 non-null object 4 prize_share 962 non-null object 5 laureate_type 962 non-null object 6 full_name 962 non-null object 7 birth_date 934 non-null datetime64[ns] 8 birth_city 931 non-null object 9 birth_country 934 non-null object 10 birth_country_current 934 non-null object 11 sex 934 non-null object 12 organization_name 707 non-null object 13 organization_city 707 non-null object 14 organization_country 708 non-null object 15 ISO 934 non-null object 16 share_pct 962 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(1), object(14) memory usage: 127.9+ KB
df_data.head()
year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | Individual | Jacobus Henricus van 't Hoff | 1852-08-30 | Rotterdam | Netherlands | Netherlands | Male | Berlin University | Berlin | Germany | NLD | 1.00 |
1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | Individual | Sully Prudhomme | 1839-03-16 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 1.00 |
2 | 1901 | Medicine | The Nobel Prize in Physiology or Medicine 1901 | "for his work on serum therapy, especially its... | 1/1 | Individual | Emil Adolf von Behring | 1854-03-15 | Hansdorf (Lawice) | Prussia (Poland) | Poland | Male | Marburg University | Marburg | Germany | POL | 1.00 |
3 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Frédéric Passy | 1822-05-20 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 0.50 |
4 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Jean Henry Dunant | 1828-05-08 | Geneva | Switzerland | Switzerland | Male | NaN | NaN | NaN | CHE | 0.50 |
Challenge: Create a donut chart using plotly which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?
biology = df_data.groupby("sex").agg({"prize": pd.Series.count})
fig = px.pie(labels=biology.index,
values=biology["prize"],
hole=.3,
names=biology.index)
fig.update_traces(textposition='inside', textfont_size=15, textinfo='percent')
fig.show()
birth_country
? Were they part of an organisation?df_data[df_data.sex == "Female"].sort_values("year")[:3]
year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18 | 1903 | Physics | The Nobel Prize in Physics 1903 | "in recognition of the extraordinary services ... | 1/4 | Individual | Marie Curie, née Sklodowska | 1867-11-07 | Warsaw | Russian Empire (Poland) | Poland | Female | NaN | NaN | NaN | POL | 0.25 |
29 | 1905 | Peace | The Nobel Peace Prize 1905 | NaN | 1/1 | Individual | Baroness Bertha Sophie Felicita von Suttner, n... | 1843-06-09 | Prague | Austrian Empire (Czech Republic) | Czech Republic | Female | NaN | NaN | NaN | CZE | 1.00 |
51 | 1909 | Literature | The Nobel Prize in Literature 1909 | "in appreciation of the lofty idealism, vivid ... | 1/1 | Individual | Selma Ottilia Lovisa Lagerlöf | 1858-11-20 | Mårbacka | Sweden | Sweden | Female | NaN | NaN | NaN | SWE | 1.00 |
Even without looking at the data, you might have already guessed one of the famous names: Marie Curie.
Challenge: Did some people get a Nobel Prize more than once? If so, who were they?
is_winner = df_data.duplicated(subset=['full_name'], keep=False)
multiple_winners = df_data[is_winner]
print(f'There are {multiple_winners.full_name.nunique()}' \
' winners who were awarded the prize more than once.')
There are 6 winners who were awarded the prize more than once.
col_subset = ['year', 'category', 'laureate_type', 'full_name']
multiple_winners[col_subset]
year | category | laureate_type | full_name | |
---|---|---|---|---|
18 | 1903 | Physics | Individual | Marie Curie, née Sklodowska |
62 | 1911 | Chemistry | Individual | Marie Curie, née Sklodowska |
89 | 1917 | Peace | Organization | Comité international de la Croix Rouge (Intern... |
215 | 1944 | Peace | Organization | Comité international de la Croix Rouge (Intern... |
278 | 1954 | Chemistry | Individual | Linus Carl Pauling |
283 | 1954 | Peace | Organization | Office of the United Nations High Commissioner... |
297 | 1956 | Physics | Individual | John Bardeen |
306 | 1958 | Chemistry | Individual | Frederick Sanger |
340 | 1962 | Peace | Individual | Linus Carl Pauling |
348 | 1963 | Peace | Organization | Comité international de la Croix Rouge (Intern... |
424 | 1972 | Physics | Individual | John Bardeen |
505 | 1980 | Chemistry | Individual | Frederick Sanger |
523 | 1981 | Peace | Organization | Office of the United Nations High Commissioner... |
We see that Marie Curie actually got the Nobel prize twice - once in physics and once in chemistry. Linus Carl Pauling got it first in chemistry and later for peace given his work in promoting nuclear disarmament. Also, the International Red Cross was awarded the Peace prize a total of 3 times. The first two times were both during the devastating World Wars.
Aggrnyl
to colour the chart, but don't show a color axis.df_data.category.nunique()
6
prizes_per_category = df_data.category.value_counts()
v_bar = px.bar(
x = prizes_per_category.index,
y = prizes_per_category.values,
color = prizes_per_category.values,
color_continuous_scale='Aggrnyl',
title='Number of Prizes Awarded per Category')
v_bar.update_layout(xaxis_title='Nobel Prize Category',
coloraxis_showscale=False,
yaxis_title='Number of Prizes')
v_bar.show()
df_data[df_data.category == 'Economics'].sort_values('year')[:3]
year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
393 | 1969 | Economics | The Sveriges Riksbank Prize in Economic Scienc... | "for having developed and applied dynamic mode... | 1/2 | Individual | Jan Tinbergen | 1903-04-12 | the Hague | Netherlands | Netherlands | Male | The Netherlands School of Economics | Rotterdam | Netherlands | NLD | 0.50 |
394 | 1969 | Economics | The Sveriges Riksbank Prize in Economic Scienc... | "for having developed and applied dynamic mode... | 1/2 | Individual | Ragnar Frisch | 1895-03-03 | Oslo | Norway | Norway | Male | University of Oslo | Oslo | Norway | NOR | 0.50 |
402 | 1970 | Economics | The Sveriges Riksbank Prize in Economic Scienc... | "for the scientific work through which he has ... | 1/1 | Individual | Paul A. Samuelson | 1915-05-15 | Gary, IN | United States of America | United States of America | Male | Massachusetts Institute of Technology (MIT) | Cambridge, MA | United States of America | USA | 1.00 |
The first Economics prize was awarded in 1969 and it went to Jan Tinbergen.
Create a plotly bar chart that shows the split between men and women by category.
cat_men_women = df_data.groupby(['category', 'sex'],
as_index=False).agg({'prize': pd.Series.count})
cat_men_women.sort_values('prize', ascending=False, inplace=True)
v_bar_split = px.bar(x = cat_men_women.category,
y = cat_men_women.prize,
color = cat_men_women.sex,
title='Number of Prizes Awarded per Category split by Men and Women')
v_bar_split.update_layout(xaxis_title='Nobel Prize Category',
yaxis_title='Number of Prizes')
v_bar_split.show()
We see that overall the imbalance is pretty large with physics, economics, and chemistry. Women are somewhat more represented in categories of Medicine, Literature and Peace. Splitting bar charts like this is an incredibly powerful way to show a more granular picture.
Challenge: created? Show the trend in awards visually.
Count the number of prizes awarded every year.
Create a 5 year rolling average of the number of prizes (Hint: see previous lessons analysing Google Trends).
Using Matplotlib superimpose the rolling average on a scatter plot.
Show a tick mark on the x-axis for every 5 years from 1900 to 2020. (Hint: you'll need to use NumPy).
Use the named colours to draw the data points in dogerblue
while the rolling average is coloured in crimson
.
|
prize_per_year = df_data.groupby("year").agg({'prize': pd.Series.count})
prize_per_year.sort_values('year', inplace=True)
prize_per_year
prize | |
---|---|
year | |
1901 | 6 |
1902 | 7 |
1903 | 7 |
1904 | 6 |
1905 | 5 |
... | ... |
2016 | 11 |
2017 | 12 |
2018 | 13 |
2019 | 14 |
2020 | 12 |
117 rows × 1 columns
moving_average = prize_per_year.rolling(window=5).mean()
moving_average
prize | |
---|---|
year | |
1901 | NaN |
1902 | NaN |
1903 | NaN |
1904 | NaN |
1905 | 6.20 |
... | ... |
2016 | 11.60 |
2017 | 12.00 |
2018 | 12.00 |
2019 | 12.20 |
2020 | 12.40 |
117 rows × 1 columns
plt.figure(figsize=(16,8), dpi=200)
plt.title('Number of Nobel Prizes Awarded per Year', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(ticks=np.arange(1900, 2021, step=5),
fontsize=14,
rotation=45)
ax = plt.gca() # get current axis
ax.set_xlim(1900, 2020)
ax.scatter(x=prize_per_year.index,
y=prize_per_year.values,
c='dodgerblue',
alpha=0.7,
s=100,)
ax.plot(prize_per_year.index,
moving_average.values,
c='crimson',
linewidth=3,)
plt.show()
Investigate if more prizes are shared than before.
yearly_avg_share = df_data.groupby(by='year').agg({'share_pct': pd.Series.mean})
share_moving_average = yearly_avg_share.rolling(window=5).mean()
plt.figure(figsize=(16,8), dpi=200)
plt.title('Number of Nobel Prizes Awarded per Year', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(ticks=np.arange(1900, 2021, step=5),
fontsize=14,
rotation=45)
ax1 = plt.gca()
ax2 = ax1.twinx() # create second y-axis
ax1.set_xlim(1900, 2020)
ax1.scatter(x=prize_per_year.index,
y=prize_per_year.values,
c='dodgerblue',
alpha=0.7,
s=100,)
ax1.plot(prize_per_year.index,
moving_average.values,
c='crimson',
linewidth=3,)
# Can invert axis
ax2.invert_yaxis()
# Adding prize share plot on second axis
ax2.plot(prize_per_year.index,
share_moving_average.values,
c='grey',
linewidth=3,)
plt.show()
there is clearly an upward trend in the number of prizes being given out as more and more prizes are shared. Also, more prizes are being awarded from 1969 onwards because of the addition of the economics category. We also see that very few prizes were awarded during the first and second world wars. Note that instead of there being a zero entry for those years, we instead see the effect of the wards as missing blue dots.
Challenge:
top20_countries
that has the two columns. The prize
column should contain the total number of prizes won.birth_country
, birth_country_current
or organization_country
?birth_country
or any of the others? Which column is the least problematic?If you look at the entries in the birth country, you'll see that some countries no longer exist! These include the Soviet Union or Czechoslovakia for example. Hence, using birth_country_current is better, since it has the country name which controls the city where the laureate was born. Now, notice that this does not determine the laureates' nationality since some globetrotting folks gave birth to their future Nobel laureate children while abroad. Also, people's nationalities can change as they emigrate and acquire different citizenship or get married and change citizenship. What this boils down to is that we will have to be clear about the assumptions that we will make in the upcoming analysis.
top_countries = df_data.groupby("birth_country_current", as_index = False).agg({"prize":pd.Series.count})
top_countries.sort_values(by='prize', inplace=True)
top20_countries = top_countries[-20:]
h_bar = px.bar(x=top20_countries.prize,
y=top20_countries.birth_country_current,
orientation='h',
color=top20_countries.prize,
color_continuous_scale='Viridis',
title='Top 20 Countries by Number of Prizes')
h_bar.update_layout(xaxis_title='Number of Prizes',
yaxis_title='Country',
coloraxis_showscale=False)
h_bar.show()
matter
on this map.Hint: You'll need to use a 3 letter country code for each country.
df_countries = df_data.groupby(['birth_country_current', 'ISO'],
as_index=False).agg({'prize': pd.Series.count})
df_countries.sort_values('prize', ascending=False)
birth_country_current | ISO | prize | |
---|---|---|---|
74 | United States of America | USA | 281 |
73 | United Kingdom | GBR | 105 |
26 | Germany | DEU | 84 |
25 | France | FRA | 57 |
67 | Sweden | SWE | 29 |
... | ... | ... | ... |
32 | Iceland | ISL | 1 |
47 | Madagascar | MDG | 1 |
34 | Indonesia | IDN | 1 |
36 | Iraq | IRQ | 1 |
78 | Zimbabwe | ZWE | 1 |
79 rows × 3 columns
fig = px.choropleth(df_countries, locations="ISO",
color="prize", # lifeExp is a column of gapminder
hover_name="birth_country_current", # column to add to hover information
color_continuous_scale=px.colors.sequential.matter)
fig.show()
Challenge: See if you can divide up the plotly bar chart you created above to show the which categories made up the total number of prizes. Here's what you're aiming for:
The hard part is preparing the data for this chart!
Hint: Take a two-step approach. The first step is grouping the data by country and category. Then you can create a DataFrame that looks something like this:
cat_prize_per_country = df_data.groupby(["birth_country_current", "category"], as_index = False).agg({"prize":pd.Series.count})
cat_prize_per_country.rename(columns={"prize": "cat_prize"}, inplace=True)
cat_prize_per_country
birth_country_current | category | cat_prize | |
---|---|---|---|
0 | Algeria | Literature | 1 |
1 | Algeria | Physics | 1 |
2 | Argentina | Medicine | 2 |
3 | Argentina | Peace | 2 |
4 | Australia | Chemistry | 1 |
... | ... | ... | ... |
206 | United States of America | Physics | 70 |
207 | Venezuela | Medicine | 1 |
208 | Vietnam | Peace | 1 |
209 | Yemen | Peace | 1 |
210 | Zimbabwe | Peace | 1 |
211 rows × 3 columns
Join cat_prize_per_country
with top_countries
merged_df = pd.merge(cat_prize_per_country, top20_countries, on='birth_country_current')
# change column names
merged_df.columns = ['birth_country_current', 'category', 'cat_prize', 'total_prize']
merged_df.sort_values(by='total_prize', inplace=True)
merged_df
birth_country_current | category | cat_prize | total_prize | |
---|---|---|---|---|
12 | Belgium | Peace | 3 | 9 |
42 | Hungary | Chemistry | 3 | 9 |
43 | Hungary | Economics | 1 | 9 |
52 | India | Physics | 1 | 9 |
51 | India | Peace | 1 | 9 |
... | ... | ... | ... | ... |
104 | United States of America | Chemistry | 55 | 281 |
105 | United States of America | Economics | 49 | 281 |
106 | United States of America | Literature | 10 | 281 |
107 | United States of America | Medicine | 78 | 281 |
109 | United States of America | Physics | 70 | 281 |
110 rows × 4 columns
cat_cntry_bar = px.bar(x=merged_df.cat_prize,
y=merged_df.birth_country_current,
color=merged_df.category,
orientation='h',
title='Top 20 Countries by Number of Prizes and Category')
cat_cntry_bar.update_layout(xaxis_title='Number of Prizes',
yaxis_title='Country')
cat_cntry_bar.show()
we see is that the US has won an incredible proportion of the prizes in the field of Economics. In comparison, Japan and Germany have won very few or no economics prize at all. Also, the US has more prizes in physics or medicine alone than all of France's prizes combined. On the chart, we also see that Germany won more prizes in physics than the UK and that France has won more prizes in peace and literature than Germany, even though Germany has been awarded a higher total number of prizes than France.
birth_country_current
of the winner to calculate this.prize_by_year = df_data.groupby(by=['birth_country_current', 'year'], as_index=False).count()
prize_by_year = prize_by_year.sort_values('year')[['year', 'birth_country_current', 'prize']]
prize_by_year
year | birth_country_current | prize | |
---|---|---|---|
118 | 1901 | France | 2 |
346 | 1901 | Poland | 1 |
159 | 1901 | Germany | 1 |
312 | 1901 | Netherlands | 1 |
440 | 1901 | Switzerland | 1 |
... | ... | ... | ... |
31 | 2019 | Austria | 1 |
221 | 2020 | Germany | 1 |
622 | 2020 | United States of America | 7 |
533 | 2020 | United Kingdom | 2 |
158 | 2020 | France | 1 |
627 rows × 3 columns
cumulative_prizes = prize_by_year.groupby(by=['birth_country_current',
'year']).sum().groupby(level=[0]).cumsum()
cumulative_prizes.reset_index(inplace=True)
cumulative_prizes
birth_country_current | year | prize | |
---|---|---|---|
0 | Algeria | 1957 | 1 |
1 | Algeria | 1997 | 2 |
2 | Argentina | 1936 | 1 |
3 | Argentina | 1947 | 2 |
4 | Argentina | 1980 | 3 |
... | ... | ... | ... |
622 | United States of America | 2020 | 281 |
623 | Venezuela | 1980 | 1 |
624 | Vietnam | 1973 | 1 |
625 | Yemen | 2011 | 1 |
626 | Zimbabwe | 1960 | 1 |
627 rows × 3 columns
l_chart = px.line(cumulative_prizes,
x='year',
y='prize',
color='birth_country_current',
hover_name='birth_country_current')
l_chart.update_layout(xaxis_title='Year',
yaxis_title='Number of Prizes')
l_chart.show()
What we see is that the United States really started to take off after the Second World War which decimated Europe. Prior to that, the Nobel prize was pretty much a European affair. Very few laureates were chosen from other parts of the world. This has changed dramatically in the last 40 years or so. There are many more countries represented today than in the early days. Interestingly we also see that the UK and Germany traded places in the 70s and 90s on the total number of prizes won. Sweden being 5th place pretty consistently over many decades is quite interesting too. Perhaps this reflects a little bit of home bias?
Challenge: Create a bar chart showing the organisations affiliated with the Nobel laureates. It should looks something like this:
sorted_organization_data = df_data.groupby("organization_name", as_index= False).agg({"prize": pd.Series.count})
sorted_organization_data.sort_values("prize", ascending = False, inplace = True)
sorted_organization_data.reset_index()
index | organization_name | prize | |
---|---|---|---|
0 | 196 | University of California | 40 |
1 | 68 | Harvard University | 29 |
2 | 167 | Stanford University | 23 |
3 | 117 | Massachusetts Institute of Technology (MIT) | 21 |
4 | 198 | University of Chicago | 20 |
... | ... | ... | ... |
259 | 110 | Long Term Capital Management | 1 |
260 | 112 | Madrid University | 1 |
261 | 113 | Mainz University | 1 |
262 | 114 | Marburg University | 1 |
263 | 263 | École municipale de physique et de chimie indu... | 1 |
264 rows × 3 columns
sorted_organization_data = sorted_organization_data[:20]
sorted_organization_data
organization_name | prize | |
---|---|---|
196 | University of California | 40 |
68 | Harvard University | 29 |
167 | Stanford University | 23 |
117 | Massachusetts Institute of Technology (MIT) | 21 |
198 | University of Chicago | 20 |
197 | University of Cambridge | 18 |
26 | California Institute of Technology (Caltech) | 17 |
38 | Columbia University | 17 |
146 | Princeton University | 15 |
152 | Rockefeller University | 13 |
119 | Max-Planck-Institut | 13 |
222 | University of Oxford | 12 |
111 | MRC Laboratory of Molecular Biology | 10 |
258 | Yale University | 9 |
40 | Cornell University | 8 |
12 | Bell Laboratories | 8 |
109 | London University | 7 |
163 | Sorbonne University | 7 |
67 | Harvard Medical School | 7 |
192 | University College London | 7 |
h_bar = px.bar(x=sorted_organization_data.prize,
y=sorted_organization_data.organization_name,
orientation='h',
color=sorted_organization_data.prize,
color_continuous_scale='Viridis',
title='Top 20 Organizations by Number of Prizes')
h_bar.update_layout(xaxis_title='Number of Prizes',
yaxis_title='Organizations',
coloraxis_showscale=False)
h_bar.show()
Where do major discoveries take place?
top20_org_cities = df_data.organization_city.value_counts()[:20]
top20_org_cities.sort_values(ascending=True, inplace=True)
city_bar2 = px.bar(x = top20_org_cities.values,
y = top20_org_cities.index,
orientation='h',
color=top20_org_cities.values,
color_continuous_scale=px.colors.sequential.Plasma,
title='Which Cities Do the Most Research?')
city_bar2.update_layout(xaxis_title='Number of Prizes',
yaxis_title='City',
coloraxis_showscale=False)
city_bar2.show()
Plasma
for the chart.top20_cities = df_data.birth_city.value_counts()[:20]
top20_cities.sort_values(ascending=True, inplace=True)
city_bar = px.bar(x=top20_cities.values,
y=top20_cities.index,
orientation='h',
color=top20_cities.values,
color_continuous_scale=px.colors.sequential.Plasma,
title='Where were the Nobel Laureates Born?')
city_bar.update_layout(xaxis_title='Number of Prizes',
yaxis_title='City of Birth',
coloraxis_showscale=False)
city_bar.show()
A higher population definitely means that there's a higher chance of a Nobel laureate to be born there. New York, Paris, and London are all very populous. However, Vienna and Budapest are not and still produced many prize winners. That said, much of the ground-breaking research does not take place in big population centres, so the list of birth cities is quite different from the list above. Cambridge Massachusets, Stanford, Berkely and Cambridge (UK) are all the places where many discoveries are made, but they are not the birthplaces of laureates.
Here's what you're aiming for:
country_city_org = df_data.groupby(by=['organization_country',
'organization_city',
'organization_name'], as_index=False).agg({'prize': pd.Series.count})
country_city_org = country_city_org.sort_values('prize', ascending=False)
burst = px.sunburst(country_city_org,
path=['organization_country', 'organization_city', 'organization_name'],
values='prize',
title='Where do Discoveries Take Place?',
)
burst.update_layout(xaxis_title='Number of Prizes',
yaxis_title='City',
coloraxis_showscale=False)
burst.show()
birth_years = df_data.birth_date.dt.year
df_data['winning_age'] = df_data.year - birth_years
Challenge:
bins
to see how the visualisation changes.display(df_data.nlargest(n=1, columns='winning_age'))
display(df_data.nsmallest(n=1, columns='winning_age'))
year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | winning_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
937 | 2019 | Chemistry | The Nobel Prize in Chemistry 2019 | “for the development of lithium-ion batteries” | 1/3 | Individual | John Goodenough | 1922-07-25 | Jena | Germany | Germany | Male | University of Texas | Austin TX | United States of America | DEU | 0.33 | 97.00 |
year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | winning_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
885 | 2014 | Peace | The Nobel Peace Prize 2014 | "for their struggle against the suppression of... | 1/2 | Individual | Malala Yousafzai | 1997-07-12 | Mingora | Pakistan | Pakistan | Female | NaN | NaN | NaN | PAK | 0.50 | 17.00 |
df_data.winning_age.describe()
count 934.00 mean 59.95 std 12.62 min 17.00 25% 51.00 50% 60.00 75% 69.00 max 97.00 Name: winning_age, dtype: float64
bin
size. Try 10, 20, 30, and 50.plt.figure(figsize=(8, 4), dpi=200)
sns.histplot(data=df_data,
x=df_data.winning_age,
bins=30)
plt.xlabel('Age')
plt.title('Distribution of Age on Receipt of Prize')
plt.show()
Are Nobel laureates being nominated later in life than before? Have the ages of laureates at the time of the award increased or decreased over time?
lowess
parameter to True
to show a moving average of the linear fit.plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style("whitegrid"):
sns.regplot(data=df_data,
x='year',
y='winning_age',
lowess=True,
scatter_kws = {'alpha': 0.4},
line_kws={'color': 'black'})
plt.show()
How does the age of laureates vary by category?
.boxplot()
to show how the mean, quartiles, max, and minimum values vary across categories. Which category has the longest "whiskers"?plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style("whitegrid"):
sns.boxplot(data=df_data,
x='category',
y='winning_age')
plt.show()
.lmplot()
and the row
parameter to create 6 separate charts for each prize category. Again set lowess
to True
..lmplot()
telling a different story from the .boxplot()
?.lmplot()
to put all 6 categories on the same chart using the hue
parameter.with sns.axes_style('whitegrid'):
sns.lmplot(data=df_data,
x='year',
y='winning_age',
row = 'category',
lowess=True,
aspect=2,
scatter_kws = {'alpha': 0.6},
line_kws = {'color': 'black'},)
plt.show()
We see that winners in physics, chemistry, and medicine have gotten older over time. The ageing trend is strongest for physics. The average age used to be below 50, but now it's over 70. Economics, the newest category, is much more stable in comparison. The peace prize shows the opposite trend where winners are getting younger! As such, our scatter plots showing the best fit lines over time and our box plot of the entire dataset can tell very different stories!