Canada's Open Government Portal includes NSERC Awards Data from 1995 through 2016.
The awards data (in .csv format) were copied to an Amazon Web Services S3 bucket. This open Jupyter notebook shows the large trends over the data and compares entire selections with each other.
Acknowledgement: I thank Ian Allison and James Colliander of the Pacific Institute for the Mathematical Sciences for building the JupyterHub service and for help with this notebook. -- I. Heisz
import numpy as np
import pandas as pd
import sys
## Bring in a selection of the NSERC awards data starting with 1995 and ending with 2016.
## Throw away as much as you can to keep the DataFrame small enough to manipulate using a laptop.
df = pd.DataFrame()
startYear = 1995
endYear = 2017 # The last year is not included, so if it was 2017 it means we include the 2016 collection but not 2017.
for year in range(startYear, endYear):
file = 'https://s3.ca-central-1.amazonaws.com/open-data-ro/NSERC/NSERC_GRT_FYR' + str(year) + '_AWARD.csv.gz'
df = df.append(pd.read_csv(file,
compression='gzip',
usecols = [1, 2, 3, 4, 5, 7, 9, 11, 12, 17],
encoding='latin-1'
)
)
print(year)
## Rename columns for better readability.
df.columns = ['Name', 'Department', 'OrganizationID',
'Institution', 'ProvinceEN', 'CountryEN',
'FiscalYear', 'AwardAmount', 'ProgramID',
'Committee']
## Strip out any leading or trailing whitespace
df.columns.str.strip()
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Index(['Name', 'Department', 'OrganizationID', 'Institution', 'ProvinceEN', 'CountryEN', 'FiscalYear', 'AwardAmount', 'ProgramID', 'Committee'], dtype='object')
## Quantify data stored in memory.
print("DataFrame: {:4.2f} Mb".format(sys.getsizeof(df) / (1024. * 1024)))
DataFrame: 207.20 Mb
# missing column info check
columnAmounts = df.count()
print(str(columnAmounts))
#
missingData = pd.DataFrame()
missingData.append(df.loc[(df['Name'].isnull())])
#missingData.append(df.loc[(df['Department'].isnull())])
#missingData.append(df.loc[(df['ProvinceEN'].isnull()) & (df['CountryEN'] == 'CANADA')])
#
# df.head()
Name 457894 Department 457677 OrganizationID 457896 Institution 457896 ProvinceEN 454115 CountryEN 457896 FiscalYear 457896 AwardAmount 457896 ProgramID 457896 Committee 457896 dtype: int64
Name | Department | OrganizationID | Institution | ProvinceEN | CountryEN | FiscalYear | AwardAmount | ProgramID | Committee | |
---|---|---|---|---|---|---|---|---|---|---|
16700 | NaN | Head Office | 11846 | Pacific Institute for the Mathematical Sciences | British Columbia | CANADA | 1999 | 40373 | MISPJ | 103 |
16701 | NaN | Head Office | 14687 | National Engineering Week | Ontario | CANADA | 1999 | 7500 | MISPJ | 103 |
# If you do more than one year at a time you risk crashing the kernel,
# the distribution is very similar year to year.
year = 2016 # Year you want to select for
selectedYearData = df.loc[(df['FiscalYear'] == year)] # Gets the data for the selected year
# The sum of all awards that each individual has received (during the selected year)
individualAwardSum = selectedYearData.groupby(['Name'])['AwardAmount'].transform('sum')
# Make a new column to show each researchers total awards received
selectedYearData = selectedYearData.assign(TotalAward = individualAwardSum)
# Drop the duplicate names, now that we have the award amount info from them.
deduplicatedData = selectedYearData.drop_duplicates(subset = 'Name')
# Sort by total award for the given year
byAwardAmount = deduplicatedData.sort_values(by=['TotalAward'], ascending=[True])
## Statistical information about a specific column in the data
# Change this value to get information about the column you are interested in
columnYouWantInformationOn = 'AwardAmount'
mean = df[columnYouWantInformationOn].mean()
print('The mean of ' + str(columnYouWantInformationOn) + ' is ' + str(mean))
median = df[columnYouWantInformationOn].median()
print('The median of ' + str(columnYouWantInformationOn) + ' is ' + str(median))
standardDeviation = df[columnYouWantInformationOn].std()
print('The standard deviation of ' + str(columnYouWantInformationOn) + ' is ' + str(standardDeviation))
The mean of AwardAmount is 38118.8182775 The median of AwardAmount is 21000.0 The standard deviation of AwardAmount is 141524.555172
## Imports For Plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
!pip3 install plotly --user: #Plotly for interactive graphing
import plotly.plotly as py
import plotly.graph_objs as go
Usage: pip install [options] <requirement specifier> [package-index-options] ... pip install [options] -r <requirements file> [package-index-options] ... pip install [options] [-e] <vcs project url> ... pip install [options] [-e] <local project path> ... pip install [options] <archive url/path> ... no such option: --user:
## Note: this is a static plot with hardcoded variables which are not meant to be changed.
## It is also processing a lot of data so it may take a minute.
axes = plt.gca() # creates the axes
## y axis
yAxis = 'AwardAmount'
y = byAwardAmount[yAxis]
y = y/10**6 # scales y
plt.ylabel(yAxis+' In Millions', fontsize=18) # y axis label
## x axis
x = range(byAwardAmount.shape[0])
plt.xlabel('IndividualResearchers', fontsize=16) # x axis label
axes.set_xticklabels([]) # make it so the x axis is not numbered
plt.title(str(year) + ' NSERC Funding Distribution') # plot title
plot = plt.scatter(x, y, s=1, alpha=0.3) # make the plot
plt.show() # draw the plot