Visualizing Earnings Based on College Majors

In this study, we will do an analysis of data related to college majors.

We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo.

There is a wealth of information in this dataset. We'll explore a couple of things via a variety of visualizations (scatter plots, histograms, scatter matrices, bar charts) aiming to answer questions such as:

  • Do students in more popular majors make more money?
  • How many majors are predominantly male? Predominantly female?
  • Which category of majors have the most students?

Preparations and initial data exploration

Let's start with some preparations to enable data visualization in this notebook.

In [32]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt

# Enable display of plots inline
%matplotlib inline

Let's read in the data and do some initial exploration.

In [33]:
# Read in the data
recent_grads = pd.read_csv('recent-grads.csv')

# Show count of rows and columns
print ('Row count, column count:', recent_grads.shape)

# Show the first row (in table format)
print ('\n')
print (recent_grads.iloc[0])

# Show the first three and the last three rows
print ('\n')
print (recent_grads.head(3))
print ('\n')
print (recent_grads.tail(3))

# Show key statistics of all (numeric) columns
print ('\n')
print (recent_grads.describe())
Row count, column count: (173, 21)


Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object


   Rank  Major_code                           Major   Total     Men  Women  \
0     1        2419           PETROLEUM ENGINEERING  2339.0  2057.0  282.0   
1     2        2416  MINING AND MINERAL ENGINEERING   756.0   679.0   77.0   
2     3        2415       METALLURGICAL ENGINEERING   856.0   725.0  131.0   

  Major_category  ShareWomen  Sample_size  Employed  ...  Part_time  \
0    Engineering    0.120564           36      1976  ...        270   
1    Engineering    0.101852            7       640  ...        170   
2    Engineering    0.153037            3       648  ...        133   

   Full_time_year_round  Unemployed  Unemployment_rate  Median  P25th   P75th  \
0                  1207          37           0.018381  110000  95000  125000   
1                   388          85           0.117241   75000  55000   90000   
2                   340          16           0.024096   73000  50000  105000   

   College_jobs  Non_college_jobs  Low_wage_jobs  
0          1534               364            193  
1           350               257             50  
2           456               176              0  

[3 rows x 21 columns]


     Rank  Major_code                  Major   Total    Men   Women  \
170   171        5202    CLINICAL PSYCHOLOGY  2838.0  568.0  2270.0   
171   172        5203  COUNSELING PSYCHOLOGY  4626.0  931.0  3695.0   
172   173        3501        LIBRARY SCIENCE  1098.0  134.0   964.0   

               Major_category  ShareWomen  Sample_size  Employed  ...  \
170  Psychology & Social Work    0.799859           13      2101  ...   
171  Psychology & Social Work    0.798746           21      3777  ...   
172                 Education    0.877960            2       742  ...   

     Part_time  Full_time_year_round  Unemployed  Unemployment_rate  Median  \
170        648                  1293         368           0.149048   25000   
171        965                  2738         214           0.053621   23400   
172        237                   410          87           0.104946   22000   

     P25th  P75th  College_jobs  Non_college_jobs  Low_wage_jobs  
170  25000  40000           986               870            622  
171  19200  26000          2403              1245            308  
172  20000  22000           288               338            192  

[3 rows x 21 columns]


             Rank   Major_code          Total            Men          Women  \
count  173.000000   173.000000     172.000000     172.000000     172.000000   
mean    87.000000  3879.815029   39370.081395   16723.406977   22646.674419   
std     50.084928  1687.753140   63483.491009   28122.433474   41057.330740   
min      1.000000  1100.000000     124.000000     119.000000       0.000000   
25%     44.000000  2403.000000    4549.750000    2177.500000    1778.250000   
50%     87.000000  3608.000000   15104.000000    5434.000000    8386.500000   
75%    130.000000  5503.000000   38909.750000   14631.000000   22553.750000   
max    173.000000  6403.000000  393735.000000  173809.000000  307087.000000   

       ShareWomen  Sample_size       Employed      Full_time      Part_time  \
count  172.000000   173.000000     173.000000     173.000000     173.000000   
mean     0.522223   356.080925   31192.763006   26029.306358    8832.398844   
std      0.231205   618.361022   50675.002241   42869.655092   14648.179473   
min      0.000000     2.000000       0.000000     111.000000       0.000000   
25%      0.336026    39.000000    3608.000000    3154.000000    1030.000000   
50%      0.534024   130.000000   11797.000000   10048.000000    3299.000000   
75%      0.703299   338.000000   31433.000000   25147.000000    9948.000000   
max      0.968954  4212.000000  307933.000000  251540.000000  115172.000000   

       Full_time_year_round    Unemployed  Unemployment_rate         Median  \
count            173.000000    173.000000         173.000000     173.000000   
mean           19694.427746   2416.329480           0.068191   40151.445087   
std            33160.941514   4112.803148           0.030331   11470.181802   
min              111.000000      0.000000           0.000000   22000.000000   
25%             2453.000000    304.000000           0.050306   33000.000000   
50%             7413.000000    893.000000           0.067961   36000.000000   
75%            16891.000000   2393.000000           0.087557   45000.000000   
max           199897.000000  28169.000000           0.177226  110000.000000   

              P25th          P75th   College_jobs  Non_college_jobs  \
count    173.000000     173.000000     173.000000        173.000000   
mean   29501.445087   51494.219653   12322.635838      13284.497110   
std     9166.005235   14906.279740   21299.868863      23789.655363   
min    18500.000000   22000.000000       0.000000          0.000000   
25%    24000.000000   42000.000000    1675.000000       1591.000000   
50%    27000.000000   47000.000000    4390.000000       4595.000000   
75%    33000.000000   60000.000000   14444.000000      11783.000000   
max    95000.000000  125000.000000  151643.000000     148395.000000   

       Low_wage_jobs  
count     173.000000  
mean     3859.017341  
std      6944.998579  
min         0.000000  
25%       340.000000  
50%      1231.000000  
75%      3466.000000  
max     48207.000000  

To enable data visualization using matplotlib, there should not be rows with missing values. Let's do the required cleaning (and check how much data was removed).

In [34]:
# Number of rows (before)
raw_data_count = recent_grads.shape[0]
print ('Number of rows in the raw data: ', raw_data_count)

# Drop rows with missing values
recent_grads.dropna(inplace = True)

# Number of rows (after)
cleaned_data_count = recent_grads.shape[0]
print ('Number of rows after removing rows with missing values: ', cleaned_data_count)
Number of rows in the raw data:  173
Number of rows after removing rows with missing values:  172

One row deleted, 172 rows remaining. We are now ready to explore this data using visualiztions.

Visualizations

To understand the various plots below, take note of what the different columns in the data represent:

Rank - Rank by median earnings (the dataset is ordered by this column).
Major_code - Major code.
Major - Major description.
Major_category - Category of major.
Total - Total number of people with major.
Sample_size - Sample size (unweighted) of full-time.
Men - Male graduates.
Women - Female graduates.
ShareWomen - Women as share of total.
Employed - Number employed.
Median - Median salary of full-time, year-round workers.
Low_wage_jobs - Number in low-wage service jobs.
Full_time - Number employed 35 hours or more.
Part_time - Number employed less than 35 hours.

We'll start with creating some scatter plots to see if we can find answers to these questions:

  • Do students in more popular majors make more money?
  • Do students that majored in subjects that were majority female make more money?
  • Is there any link between the number of full-time employees and median salary?
In [35]:
# Scatter plot showing median salary vs total number of people with the major
recent_grads.plot(x='Total', y='Median', kind='scatter')
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e3260b8>
In [36]:
# Scatter plot showing the sample size for median salary figures vs total number of people with the major
recent_grads.plot(x='Total', y='Sample_size', kind='scatter')
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e3b51d0>
In [37]:
# Scatter plot showing median salary vs the percentage of women with the major
recent_grads.plot (x='ShareWomen', y='Median', kind = 'Scatter')
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e4e57b8>
In [38]:
# Scatter plot showing the number of full-time employed vs the median salary
recent_grads.plot (y='Full_time', x='Median', kind = 'Scatter')
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e180240>
In [39]:
# Scatter plot showing the number of part-time employed vs the median salary
recent_grads.plot (y='Part_time', x='Median', kind = 'Scatter')
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4ced2d68>

Conclusions:

  • Popularity vs (median) salary. From the first plot it seems that we can conclude that for more popular majors the median salary is typically a bit lower. Or to turn it around, in any case: the majors with the highest median salaries are not those with the higher number of graduates! Given the second plot, it must be noted though that the sample sizes that are used for deriving the salary information are very small in comparison to the total number of graduates, and we cannot be sure how representative these samples are.
  • Gender vs (median) salary. From the third plot we can conclude that majors with higher percentages of women, the median saleries are lower.
  • Full_time employees vs (median) salary. From the fourth and fifth plot one may conclude (by comparing the shapes) that full_time employment yields higher (median) salaries. The evidence is not strong, however, and this would require further investigation.

We'll continue with creating some histograms to find answers to this:

  • What percent of majors are predominantly male? Predominantly female?
  • What's the most common median salary range?

Let's create a histogram to how the 'percentage-of-women' is distributed. Conclusions are written directly below each graph.

In [40]:
fig, ax = plt.subplots()
selected_column = 'ShareWomen'
data_to_show = recent_grads[selected_column]*100 #Multiply by 100 to show as a percentage
data_to_show.hist(bins=20)
ax.set_title(selected_column)
ax.set_xticks
ax.set_xlabel('Percentage of Women')
ax.set_ylabel('Frequency')
plt.show()

What we can see is that there are majors where the percentage of women is (close to) 0% and majors where the percentage of women is (almost) 100%. And everything is between. To answer our question, we can (be-it somewhat clunky) create the same histogram with just two categories.

In [41]:
fig, ax = plt.subplots()
selected_column = 'ShareWomen'
data_to_show = recent_grads[selected_column]*100 #Multiply by 100 to show as a percentage
data_to_show.hist(bins=2)
ax.set_title(selected_column)
ax.set_xticks
ax.set_xlabel('Percentage of Women')
ax.set_ylabel('Frequency')
plt.show()

We can see that the there are almost 80 majors where the percentage of women is below 50%, and almost 100 majors where the percentage of women is above 50%.

Let's now go search for common median saleries, by showing a histogram of the median salaries.

In [42]:
recent_grads['Median'].hist(bins=20)
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e71ea58>

It looks like the ranges between 25K and 50K are most common. Let's zoom in further on this part.

In [43]:
recent_grads['Median'].hist(bins=25, range=(25000,50000))
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e7ccc18>

It looks like very common (median) salaries are 35-36K and 40-41K.

Let's looking further in the relations between (1)total number of majors (2)the median salary (3) percentage of women by creating a scatter-matrix for these three.

In [44]:
from pandas.plotting import scatter_matrix 
scatter_matrix(recent_grads[['Total','Median', 'ShareWomen']], figsize = (12,12))
Out[44]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E86D710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E8A1B00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E8DF0F0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E90E668>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E93FC18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E97F208>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E9AC7B8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E9E0DA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AA4E9E0DD8>]],
      dtype=object)

What we can see from these plots is that:

  • majors that are more popular tend to have a somewhat lower median salary
  • majors that are more popular tend to attract a higher percentage of women In line with what we observed earlier

Now let's also create bar-plots for the first 10 and for the last 10 majors in the list to see what we can learn from that.

In [45]:
recent_grads[:10].plot.bar(x='Major', y='ShareWomen')
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen')
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4eb9d550>
In [46]:
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate')
recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate')
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aa4e254710>

What we see:

  • 'Engineering' majors (well represented in the first ten) attract low percentages of women; unemployments rates seem generally low (with the exception of Nuclear Engineering)
  • Majors like psychology and language/education oriented topics attract higher percentages of women; unemployment rates are somewhat higher

Wrapping up

Let's wrap-up with sharing some of the observations that we made above:

  • The most popular majors are not those that result in the highest (median) salaries.
  • Majors with higher percentages of women (which is actually a majority) tend to have lower (median) salaries.

Clearly, we've only be scratching the surface. The dataset contains a wealth of interesting data to explore. Possibly to be continued at another occassion!