In this study, we will do an analysis of data related to college majors.
We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo.
There is a wealth of information in this dataset. We'll explore a couple of things via a variety of visualizations (scatter plots, histograms, scatter matrices, bar charts) aiming to answer questions such as:
Let's start with some preparations to enable data visualization in this notebook.
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
# Enable display of plots inline
%matplotlib inline
Let's read in the data and do some initial exploration.
# Read in the data
recent_grads = pd.read_csv('recent-grads.csv')
# Show count of rows and columns
print ('Row count, column count:', recent_grads.shape)
# Show the first row (in table format)
print ('\n')
print (recent_grads.iloc[0])
# Show the first three and the last three rows
print ('\n')
print (recent_grads.head(3))
print ('\n')
print (recent_grads.tail(3))
# Show key statistics of all (numeric) columns
print ('\n')
print (recent_grads.describe())
To enable data visualization using matplotlib
, there should not be rows with missing values. Let's do the required cleaning (and check how much data was removed).
# Number of rows (before)
raw_data_count = recent_grads.shape[0]
print ('Number of rows in the raw data: ', raw_data_count)
# Drop rows with missing values
recent_grads.dropna(inplace = True)
# Number of rows (after)
cleaned_data_count = recent_grads.shape[0]
print ('Number of rows after removing rows with missing values: ', cleaned_data_count)
One row deleted, 172 rows remaining. We are now ready to explore this data using visualiztions.
To understand the various plots below, take note of what the different columns in the data represent:
Rank - Rank by median earnings (the dataset is ordered by this column).
Major_code - Major code.
Major - Major description.
Major_category - Category of major.
Total - Total number of people with major.
Sample_size - Sample size (unweighted) of full-time.
Men - Male graduates.
Women - Female graduates.
ShareWomen - Women as share of total.
Employed - Number employed.
Median - Median salary of full-time, year-round workers.
Low_wage_jobs - Number in low-wage service jobs.
Full_time - Number employed 35 hours or more.
Part_time - Number employed less than 35 hours.
We'll start with creating some scatter plots to see if we can find answers to these questions:
# Scatter plot showing median salary vs total number of people with the major
recent_grads.plot(x='Total', y='Median', kind='scatter')
# Scatter plot showing the sample size for median salary figures vs total number of people with the major
recent_grads.plot(x='Total', y='Sample_size', kind='scatter')
# Scatter plot showing median salary vs the percentage of women with the major
recent_grads.plot (x='ShareWomen', y='Median', kind = 'Scatter')
# Scatter plot showing the number of full-time employed vs the median salary
recent_grads.plot (y='Full_time', x='Median', kind = 'Scatter')
# Scatter plot showing the number of part-time employed vs the median salary
recent_grads.plot (y='Part_time', x='Median', kind = 'Scatter')
Conclusions:
We'll continue with creating some histograms to find answers to this:
Let's create a histogram to how the 'percentage-of-women' is distributed. Conclusions are written directly below each graph.
fig, ax = plt.subplots()
selected_column = 'ShareWomen'
data_to_show = recent_grads[selected_column]*100 #Multiply by 100 to show as a percentage
data_to_show.hist(bins=20)
ax.set_title(selected_column)
ax.set_xticks
ax.set_xlabel('Percentage of Women')
ax.set_ylabel('Frequency')
plt.show()
What we can see is that there are majors where the percentage of women is (close to) 0% and majors where the percentage of women is (almost) 100%. And everything is between. To answer our question, we can (be-it somewhat clunky) create the same histogram with just two categories.
fig, ax = plt.subplots()
selected_column = 'ShareWomen'
data_to_show = recent_grads[selected_column]*100 #Multiply by 100 to show as a percentage
data_to_show.hist(bins=2)
ax.set_title(selected_column)
ax.set_xticks
ax.set_xlabel('Percentage of Women')
ax.set_ylabel('Frequency')
plt.show()
We can see that the there are almost 80 majors where the percentage of women is below 50%, and almost 100 majors where the percentage of women is above 50%.
Let's now go search for common median saleries, by showing a histogram of the median salaries.
recent_grads['Median'].hist(bins=20)
It looks like the ranges between 25K and 50K are most common. Let's zoom in further on this part.
recent_grads['Median'].hist(bins=25, range=(25000,50000))
It looks like very common (median) salaries are 35-36K and 40-41K.
Let's looking further in the relations between (1)total number of majors (2)the median salary (3) percentage of women by creating a scatter-matrix for these three.
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Total','Median', 'ShareWomen']], figsize = (12,12))
What we can see from these plots is that:
Now let's also create bar-plots for the first 10 and for the last 10 majors in the list to see what we can learn from that.
recent_grads[:10].plot.bar(x='Major', y='ShareWomen')
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen')
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate')
recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate')
What we see:
Let's wrap-up with sharing some of the observations that we made above:
Clearly, we've only be scratching the surface. The dataset contains a wealth of interesting data to explore. Possibly to be continued at another occassion!