Visualising Earnings based on College Majors

Matplotlib and plotting functionalities in pandas library are particularly useful to carry out Descriptive Analysis to establish some basic understanding of our underlying data. In this project, I will attempt to apply this to explore the job-outcomes of students who graduated from college (2010-2012), the dataset for which was released by American Community Survey. A cleaned and aggregated subset of the data can be found on - https://github.com/fivethirtyeight/data/tree/master/college-majors.

Throughout the descriptive analysis, I will pose some relevant questions, attempt to visualise the answer using relevant descriptive analytical tools, and then attempt to infer the answers from them.

Dataset Description

  • Rank - Rank by median earnings (the dataset is ordered by this column).
  • Major_code - Major code.
  • Major - Major description.
  • Major_category - Category of major.
  • Total - Total number of people with major.
  • Sample_size - Sample size (unweighted) of full-time.
  • Men - Male graduates.
  • Women - Female graduates.
  • ShareWomen - Women as share of total.
  • Employed - Number employed.
  • Median - Median salary of full-time, year-round workers.
  • Low_wage_jobs - Number in low-wage service jobs.
  • Full_time - Number employed 35 hours or more.
  • Part_time - Number employed less than 35 hours.

As the first step, I import the required libraries and set up the necessary tools required for our work. Then I try to understand the structure of my dataset by printing a few rows of the dataset, and using the describe() function on it.

In [1]:
# Importing relevant libraries
import matplotlib.pyplot as plt
import pandas as pd
In [2]:
# Running Jupyter magic to display plots inline
%matplotlib inline
In [3]:
# Reading the csv file into a pandas dataframe object class
recent_grads = pd.read_csv('recent-grads.csv')
In [4]:
# Displaying the column name and values first row of the dataframe 
recent_grads.iloc[0]
Out[4]:
Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object
In [23]:
print(recent_grads.describe())
recent_grads[:5]
             rank   major_code          total            men          women  \
count  172.000000   172.000000     172.000000     172.000000     172.000000   
mean    87.377907  3895.953488   39370.081395   16723.406977   22646.674419   
std     49.983181  1679.240095   63483.491009   28122.433474   41057.330740   
min      1.000000  1100.000000     124.000000     119.000000       0.000000   
25%     44.750000  2403.750000    4549.750000    2177.500000    1778.250000   
50%     87.500000  3608.500000   15104.000000    5434.000000    8386.500000   
75%    130.250000  5503.250000   38909.750000   14631.000000   22553.750000   
max    173.000000  6403.000000  393735.000000  173809.000000  307087.000000   

       sharewomen  sample_size      employed      full_time      part_time  \
count  172.000000   172.000000     172.00000     172.000000     172.000000   
mean     0.522223   357.941860   31355.80814   26165.767442    8877.232558   
std      0.231205   619.680419   50777.42865   42957.122320   14679.038729   
min      0.000000     2.000000       0.00000     111.000000       0.000000   
25%      0.336026    42.000000    3734.75000    3181.000000    1013.750000   
50%      0.534024   131.000000   12031.50000   10073.500000    3332.500000   
75%      0.703299   339.000000   31701.25000   25447.250000    9981.000000   
max      0.968954  4212.000000  307933.00000  251540.000000  115172.000000   

       full_time_year_round    unemployed  unemployment_rate         median  \
count            172.000000    172.000000         172.000000     172.000000   
mean           19798.843023   2428.412791           0.068024   40076.744186   
std            33229.227514   4121.730452           0.030340   11461.388773   
min              111.000000      0.000000           0.000000   22000.000000   
25%             2474.750000    299.500000           0.050261   33000.000000   
50%             7436.500000    905.000000           0.067544   36000.000000   
75%            17674.750000   2397.000000           0.087247   45000.000000   
max           199897.000000  28169.000000           0.177226  110000.000000   

              p25th          p75th   college_jobs  non_college_jobs  \
count    172.000000     172.000000     172.000000        172.000000   
mean   29486.918605   51386.627907   12387.401163      13354.325581   
std     9190.769927   14882.278650   21344.967522      23841.326605   
min    18500.000000   22000.000000       0.000000          0.000000   
25%    24000.000000   41750.000000    1744.750000       1594.000000   
50%    27000.000000   47000.000000    4467.500000       4603.500000   
75%    33250.000000   58500.000000   14595.750000      11791.750000   
max    95000.000000  125000.000000  151643.000000     148395.000000   

       low_wage_jobs  share_full_time  
count     172.000000       172.000000  
mean     3878.633721         0.666427  
std      6960.467621         0.102083  
min         0.000000         0.372872  
25%       336.750000         0.597190  
50%      1238.500000         0.673859  
75%      3496.000000         0.734996  
max     48207.000000         0.958949  
Out[23]:
rank major_code major total men women major_category sharewomen sample_size employed ... full_time_year_round unemployed unemployment_rate median p25th p75th college_jobs non_college_jobs low_wage_jobs share_full_time
0 1 2419 PETROLEUM ENGINEERING 2339.0 2057.0 282.0 Engineering 0.120564 36 1976 ... 1207 37 0.018381 110000 95000 125000 1534 364 193 0.790509
1 2 2416 MINING AND MINERAL ENGINEERING 756.0 679.0 77.0 Engineering 0.101852 7 640 ... 388 85 0.117241 75000 55000 90000 350 257 50 0.735450
2 3 2415 METALLURGICAL ENGINEERING 856.0 725.0 131.0 Engineering 0.153037 3 648 ... 340 16 0.024096 73000 50000 105000 456 176 0 0.651869
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING 1258.0 1123.0 135.0 Engineering 0.107313 16 758 ... 692 40 0.050125 70000 43000 80000 529 102 0 0.849762
4 5 2405 CHEMICAL ENGINEERING 32260.0 21239.0 11021.0 Engineering 0.341631 289 25694 ... 16697 1672 0.061098 65000 50000 75000 18314 4440 972 0.718227

5 rows × 22 columns

In [6]:
# Dropping rows with null values from our data set
print(len(recent_grads))
recent_grads = recent_grads.dropna()
print(len(recent_grads))
173
172
In [7]:
# converting column names to lower case (because i dont like upper-case in my code)
recent_grads.columns = recent_grads.columns.str.lower()
recent_grads.columns
Out[7]:
Index(['rank', 'major_code', 'major', 'total', 'men', 'women',
       'major_category', 'sharewomen', 'sample_size', 'employed', 'full_time',
       'part_time', 'full_time_year_round', 'unemployed', 'unemployment_rate',
       'median', 'p25th', 'p75th', 'college_jobs', 'non_college_jobs',
       'low_wage_jobs'],
      dtype='object')

1. Searching for Co-relations between our Column-Variables

Having no leads initially, I will draw random scatterplots between 2 variables that I believe should have a corelation between them. I will start with -

  1. sample_size and median
  2. sample_size and unemployment_rate
  3. full_time and median
  4. sharewomen and unemployment_rate
  5. men and median
  6. women and median
In [8]:
recent_grads.plot('sample_size','median', kind = 'scatter') #--> 1.1
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0xf1e7f0a1c8>
In [9]:
recent_grads.plot('sample_size','unemployment_rate', kind = 'scatter')# --> 1.2
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0xf1e86673c8>
In [10]:
recent_grads.plot('full_time','median', kind = 'scatter') # --> 1.3
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0xf1e86da048>
In [11]:
recent_grads.plot('sharewomen','unemployment_rate', kind = 'scatter') # --> 1.4
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0xf1e87443c8>
In [12]:
recent_grads.plot('men','median', kind = 'scatter') # --> 1.5
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0xf1e87b3c88>
In [13]:
recent_grads.plot('women','median', kind = 'scatter') # --> 1.6
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0xf1e87fba48>

Does a major with more students corelate to a higher median salary?

No. As per the scatterplot, there is a slight negative corelation between the number of students enrolled for a major and the median salary. Also, some of the highest median salaries belong to majors with a medium batch size.

In [14]:
recent_grads.plot('median','total',kind = 'scatter', figsize = (7,5))
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0xf1e8871d08>

1.2 Do Full Time Employees make more \$$$?

Do majors with more percentage of full_time employed students have a greater median salary?

Yes, majors with a higher percentage of full_time employed students seem to witness higher median salaries overall.

In [15]:
recent_grads['share_full_time'] = recent_grads['full_time']/recent_grads['total']
recent_grads.plot('share_full_time','median',kind = 'scatter')
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0xf1e89375c8>

1.3 Do women make more \$$$?

Do majors with a higher share of women have more median salary overall?

No, majors with a higher share of women tend to have a lower median salary.

In [16]:
recent_grads.plot('sharewomen','median',kind = 'scatter')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0xf1e9156348>

1.4 Are majors predominantly Male or Female?

Do majors mostly consist of males or females?

Females, but by a small margin! The histogram below shows visibly higher frequencies of female-majority majors in the 0.5 to 1.0 range of sharewomen.

In [17]:
recent_grads['sharewomen'].hist(bins = 20)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0xf1ea18dc08>

1.5 What median salary range is Most Common?

30,000 to 40,000 range is the most common median salary range among the majors as per the histogram below.

In [18]:
recent_grads['median'].hist(bins = 10, range = (0,100000))
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0xf1ea1c4608>

recent_grads[:8]

1.6 Which major_category has the most (& least) students men (& women) on average?

Using the bar plot below, we can see that

  • Business major category has the highest average number of Male students enrolled.
  • Communication & Journalism major category has the highest average number of Female students enrolled.
In [19]:
from numpy import arange 

categories = recent_grads['major_category'].unique()
avg_of_totals_men = []
avg_of_totals_women = []

for category in categories:
    avg_of_totals_men.append(recent_grads.loc[recent_grads['major_category']==category, 'men'].mean())
    avg_of_totals_women.append(recent_grads.loc[recent_grads['major_category']==category, 'women'].mean())

fig, ax = plt.subplots(figsize = (16,6))

ax.bar(arange(0,16)-0.2, avg_of_totals_men, 0.4,label = 'Men')
ax.bar(arange(0,16)+0.2, avg_of_totals_women, 0.4, label = 'Women')
ax.set_xticks(arange(0,16))
ax.set_xticklabels(categories)
plt.xticks(rotation = 90)
plt.legend()
Out[19]:
<matplotlib.legend.Legend at 0xf1ea380948>
In [20]:
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['sample_size','median']], figsize = (10,8))
Out[20]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA226D88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA5E7508>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA3DDAC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA416BC8>]],
      dtype=object)
In [21]:
scatter_matrix(recent_grads[['sample_size','median', 'unemployment_rate']], figsize = (10,8))
Out[21]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA464488>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA55E708>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA594888>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA80D948>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA845A48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA87EB88>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA8B7C08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA8F0D08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000F1EA8FA908>]],
      dtype=object)

Bonus: HexBin plot

Hex bin plots can be particularly useful in place of some dense scatterplots. Here, I have taken the scatterplot previously drawin in 1.2 as a reference, which yields the same results.

In [22]:
recent_grads.plot.hexbin(x = 'share_full_time',y = 'median', gridsize = 15, cmap='inferno')
plt.xlim(0.4,0.95)
plt.ylim(20000,80000)
plt.xlabel('share_full_time')
Out[22]:
Text(0.5, 0, 'share_full_time')

Conclusion

  1. Less Popular Majors tend to have More median Salary
  2. Full-Time employed students tend to have higher salaries
  3. Women don't tend to make more \$$ than their male counterpart
  4. Majors as a whole are predominantly female
  5. 30k to 40k is the most common salary-bracket for students when employed
  6. Communication & Journalism major has the most female students, while Business major has the most male students

-Author : Raghav_A