Degree...Yeah!, Job...Maybe?

A job is meant to be a given, especially after spending time, effort and in most cases a lot of money on earning a degree. For most it not only validates the effort that they've put in but also confirms place in society as a contributing member.

For many, however this is often far from reality. The sad reality is, many industries are not looking for just that certificate. While it does validate that a graduate is familiar with the subject matter, it does not guarantee his or her skill in managing the various scenarios that may arise during the course of his/her employment.

Some graduates ensure to keep those skills up-to-date even while they are studying through part-time jobs or self experimentation. Others choose to improve on their skills after graduation. Depending on the course that was undertaken and the kind of skill required for employability some graduates go on to take up full time or part time jobs in their own field of study, some others take up jobs in an area unrelated to what they have studies and still some others are left without employment.

The American Commnunity Survey conducts surverys and aggregates data from them. A survey was conducted by them on students who graduated from college between 2010 and 2012 to understand how they faired with being employed after graduation.

The Data

The data is a list of college majors by rank. Details of the the majors themselves include the total number of students that graduated, the classification of the major and the median salary earned by full time employees that graduated. The remaining part of the data pertains to student that undertook the major, including such details as the gender breakup, the number of full time, part time and unemployed students.

The columns are defined below.

  • Rank - Rank by median earnings
  • Major_code - Major code
  • Major - Major description
  • Major_category - Category of major e.g. Engineering, Law
  • Total - Total number of people with major
  • Sample_size - Sample size (unweighted) of full-time employees, year-round ONLY (used for earnings)
  • Men - Male graduates
  • Women - Female graduates
  • ShareWomen - Women as share of total
  • Employed - Number employed
  • Full_time - Employed 35 hours or more
  • Part_time - Employed less than 35 hours
  • Full_time_year_round - Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
  • Unemployed - Number unemployed (ESR == 3)
  • Unemployment_rate - Unemployed / (Unemployed + Employed)
  • Median - Median earnings of full-time, year-round workers
  • P25th - 25th percentile of earnings
  • P75th - 75th percentile of earnings
  • College_jobs - Number with job requiring a college degree
  • Non_college_jobs - Number with job not requiring a college degree
  • Low_wage_jobs - Number in low-wage service jobs

Extracting the Data

In the following cells, the data will be extracted from a cleaned version of the recent-grads.csv file containing the data necessary for analysis.

In [133]:
import pandas as pd
import matplotlib as plt
from pandas.plotting import scatter_matrix
%matplotlib inline
In [110]:
# Function name: print_full(a_list)
# Input: A python list
# Output: The full list
# Description: Jupyter notebooks by default do not display all the data for large datasets. It shows a few lines and summarizes
# the rest using ellipsis. This function helps to see the full python list

def print_full(a_list):
    pd.set_option('display.max_rows', len(a_list))
    print(a_list)
    pd.reset_option('display.max_rows')
In [111]:
# Function name: column_details(the_dataframe, a_column)
# Input: A python dataframe and one of its columns
# Output: Details related to the column
# Description: This function provides analysis information related to a column associated to a python dataframe. 
# It handles only string and numerical columns

def column_details(the_dataframe, a_column):
    print("Number of unique values in the {0} column:".format(a_column), the_dataframe[a_column].unique().shape[0])
    print("---------------------")
    
    if the_dataframe[a_column].dtype == "object":
        print("Statistical data related to the {0} column:\n".format(a_column),the_dataframe[a_column].describe())
    else:
        print("Statistical data related to the {0} column:\n".format(a_column),the_dataframe[a_column].describe().map('{:,.2f}'.format))
    print("---------------------")
    
    print("Unique counts for each entry in the {0} column:\n".format(a_column))
    print(the_dataframe[a_column].value_counts().sort_index(ascending=True))
In [112]:
recent_grads = pd.read_csv("recent-grads.csv")
recent_grads.iloc[0]
Out[112]:
Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object
In [113]:
recent_grads.head(5)
Out[113]:
Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size Employed ... Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
0 1 2419 PETROLEUM ENGINEERING 2339.0 2057.0 282.0 Engineering 0.120564 36 1976 ... 270 1207 37 0.018381 110000 95000 125000 1534 364 193
1 2 2416 MINING AND MINERAL ENGINEERING 756.0 679.0 77.0 Engineering 0.101852 7 640 ... 170 388 85 0.117241 75000 55000 90000 350 257 50
2 3 2415 METALLURGICAL ENGINEERING 856.0 725.0 131.0 Engineering 0.153037 3 648 ... 133 340 16 0.024096 73000 50000 105000 456 176 0
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING 1258.0 1123.0 135.0 Engineering 0.107313 16 758 ... 150 692 40 0.050125 70000 43000 80000 529 102 0
4 5 2405 CHEMICAL ENGINEERING 32260.0 21239.0 11021.0 Engineering 0.341631 289 25694 ... 5180 16697 1672 0.061098 65000 50000 75000 18314 4440 972

5 rows × 21 columns

In [114]:
recent_grads.tail(5)
Out[114]:
Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size Employed ... Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
168 169 3609 ZOOLOGY 8409.0 3050.0 5359.0 Biology & Life Science 0.637293 47 6259 ... 2190 3602 304 0.046320 26000 20000 39000 2771 2947 743
169 170 5201 EDUCATIONAL PSYCHOLOGY 2854.0 522.0 2332.0 Psychology & Social Work 0.817099 7 2125 ... 572 1211 148 0.065112 25000 24000 34000 1488 615 82
170 171 5202 CLINICAL PSYCHOLOGY 2838.0 568.0 2270.0 Psychology & Social Work 0.799859 13 2101 ... 648 1293 368 0.149048 25000 25000 40000 986 870 622
171 172 5203 COUNSELING PSYCHOLOGY 4626.0 931.0 3695.0 Psychology & Social Work 0.798746 21 3777 ... 965 2738 214 0.053621 23400 19200 26000 2403 1245 308
172 173 3501 LIBRARY SCIENCE 1098.0 134.0 964.0 Education 0.877960 2 742 ... 237 410 87 0.104946 22000 20000 22000 288 338 192

5 rows × 21 columns

The Goal

The goal of this project is to visualize the data and in the process extract insights.

Cleaning the Data

In the cells that follow, the data will be analysed to ensure that it can be used to generate visualizations.

The column names are mixed case. Before going further it would be best to have all the columns be in the same case thus assisting with faster analysis.

In [115]:
recent_grads_columns_fixed = []
for a_column in recent_grads.columns:
    recent_grads_columns_fixed.append(a_column.lower())

recent_grads.columns = recent_grads_columns_fixed
recent_grads.columns
Out[115]:
Index(['rank', 'major_code', 'major', 'total', 'men', 'women',
       'major_category', 'sharewomen', 'sample_size', 'employed', 'full_time',
       'part_time', 'full_time_year_round', 'unemployed', 'unemployment_rate',
       'median', 'p25th', 'p75th', 'college_jobs', 'non_college_jobs',
       'low_wage_jobs'],
      dtype='object')

The following is a quick analysis of the numerical columns.

In [116]:
recent_grads.describe()
Out[116]:
rank major_code total men women sharewomen sample_size employed full_time part_time full_time_year_round unemployed unemployment_rate median p25th p75th college_jobs non_college_jobs low_wage_jobs
count 173.000000 173.000000 172.000000 172.000000 172.000000 172.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000
mean 87.000000 3879.815029 39370.081395 16723.406977 22646.674419 0.522223 356.080925 31192.763006 26029.306358 8832.398844 19694.427746 2416.329480 0.068191 40151.445087 29501.445087 51494.219653 12322.635838 13284.497110 3859.017341
std 50.084928 1687.753140 63483.491009 28122.433474 41057.330740 0.231205 618.361022 50675.002241 42869.655092 14648.179473 33160.941514 4112.803148 0.030331 11470.181802 9166.005235 14906.279740 21299.868863 23789.655363 6944.998579
min 1.000000 1100.000000 124.000000 119.000000 0.000000 0.000000 2.000000 0.000000 111.000000 0.000000 111.000000 0.000000 0.000000 22000.000000 18500.000000 22000.000000 0.000000 0.000000 0.000000
25% 44.000000 2403.000000 4549.750000 2177.500000 1778.250000 0.336026 39.000000 3608.000000 3154.000000 1030.000000 2453.000000 304.000000 0.050306 33000.000000 24000.000000 42000.000000 1675.000000 1591.000000 340.000000
50% 87.000000 3608.000000 15104.000000 5434.000000 8386.500000 0.534024 130.000000 11797.000000 10048.000000 3299.000000 7413.000000 893.000000 0.067961 36000.000000 27000.000000 47000.000000 4390.000000 4595.000000 1231.000000
75% 130.000000 5503.000000 38909.750000 14631.000000 22553.750000 0.703299 338.000000 31433.000000 25147.000000 9948.000000 16891.000000 2393.000000 0.087557 45000.000000 33000.000000 60000.000000 14444.000000 11783.000000 3466.000000
max 173.000000 6403.000000 393735.000000 173809.000000 307087.000000 0.968954 4212.000000 307933.000000 251540.000000 115172.000000 199897.000000 28169.000000 0.177226 110000.000000 95000.000000 125000.000000 151643.000000 148395.000000 48207.000000

Next is a more in-dpeth analysis of each column. The column-name field could be replaced in the column below to view a more in-depth analysis of the given column.

In order to visualize data, it must be ensured that rows with null or empty values are removed. The Matplotlib library that is used to generate visualizations expects that the columns of values we pass in have matching lengths, so missing values will cause it to throw errors.

In [117]:
# Number of rows befor dropping rows.
raw_data_count = recent_grads.shape[0]
print(raw_data_count)
173
In [118]:
recent_grads = recent_grads.dropna()
In [119]:
# Number of rows after dropping rows.
cleaned_data_count = recent_grads.shape[0]
print(cleaned_data_count)
172

The column sharewomen indicates the number of women that graduated from the course in terms of percentage. A similar sharemen column would therefore be appropriate and helpful as we go forward with analysis. Since the percentage values are in fraction a simple subtraction would help calculate the value for every row

In [120]:
recent_grads["sharemen"] = 1 - recent_grads["sharewomen"]
In [122]:
category_data = pd.DataFrame()
category_total = {}
category_men = {}
category_women = {}
category_employed = {}
category_unemployed = {}
category_median = {}

for a_category in recent_grads["major_category"]:
    category_total[a_category] = recent_grads[recent_grads["major_category"] == a_category]["total"].sum()
    category_men[a_category] = recent_grads[recent_grads["major_category"] == a_category]["men"].sum()
    category_women[a_category] = recent_grads[recent_grads["major_category"] == a_category]["women"].sum()
    category_employed[a_category] = recen_grads[recent_grads["major_category"] == a_category]["employed"].sum()
    category_unemployed[a_category] = recent_grads[recent_grads["major_category"] == a_category]["unemployed"].sum()
    category_median[a_category] = recent_grads[recent_grads["major_category"] == a_category]["median"].mean()

category_total_series = pd.Seriest(category_total)
category_men_series = pd.Series(category_men)
category_women_series = pd.Series(category_women)
category_employed_series = pd.Series(category_employed)
category_unemployed_series = pd.Series(category_unemployed)
category_median_series = pd.Series(category_median)

category_dataframe = pd.DataFrame(category_total_series, columns = ["totals"])
category_dataframe["men"] = category_men_series
category_dataframe["women"] = category_women_series
category_dataframe["employed"] = category_employed_series
category_dataframe["unemployed"] = category_unemployed_series
category_dataframe["median"] = category_median_series

print(category_dataframe)    
                                        totals       men     women  employed  \
Engineering                           537583.0  408307.0  129276.0    420372   
Business                             1302376.0  667852.0  634524.0   1088742   
Physical Sciences                     185479.0   95390.0   90089.0    139231   
Law & Public Policy                   179107.0   91129.0   87978.0    144790   
Computers & Mathematics               299008.0  208725.0   90283.0    237894   
Industrial Arts & Consumer Services   229792.0  103781.0  126011.0    189043   
Arts                                  357130.0  134390.0  222740.0    288114   
Health                                463230.0   75517.0  387713.0    372147   
Social Science                        529966.0  256834.0  273132.0    401493   
Biology & Life Science                453862.0  184919.0  268943.0    302797   
Education                             559129.0  103526.0  455603.0    479839   
Agriculture & Natural Resources        75620.0   40357.0   35263.0     63794   
Humanities & Liberal Arts             713468.0  272846.0  440622.0    544118   
Psychology & Social Work              481007.0   98115.0  382892.0    380344   
Communications & Journalism           392601.0  131921.0  260680.0    330660   
Interdisciplinary                      12296.0    2817.0    9479.0      9821   

                                     unemployed        median  
Engineering                               29817  57382.758621  
Business                                  79877  43538.461538  
Physical Sciences                          7880  41890.000000  
Law & Public Policy                       13495  42200.000000  
Computers & Mathematics                   18373  42745.454545  
Industrial Arts & Consumer Services       11526  36342.857143  
Arts                                      28228  33062.500000  
Health                                    22213  36825.000000  
Social Science                            42975  37344.444444  
Biology & Life Science                    22854  36421.428571  
Education                                 24969  32350.000000  
Agriculture & Natural Resources            3486  35111.111111  
Humanities & Liberal Arts                 51101  31913.333333  
Psychology & Social Work                  33292  30100.000000  
Communications & Journalism               26852  34500.000000  
Interdisciplinary                           749  35000.000000  

Data, talk to me....

Insights from data in individual columns

Distribution of column data

Understanding the spread of the values in each of the columns would be an excellent start to gaining insights from the data. In order to do this each column is assessed individually. If certain columns display ranges that seem overpopulated, those ranges will be further assessed to understand the frequencies within them.

( You could try analyze the columns by setting a column in the column-name field below and pressing Ctrl+Enter )

In [123]:
# Replace the value in the column_name for an in-depth analysis of the column
# This also helps with making assessments on what ranges could be used.
column_name = "sharemen"
column_details(recent_grads, column_name)

# Replace the number_of_bins value to see the histograms based on the number of bins set
# min_value and max_value helps to define the range of the distribution. The default is the minimum and maximum of each column
number_of_bins = 15
min_value, max_value = recent_grads[column_name].min(),recent_grads[column_name].max()

recent_grads[column_name].hist(bins = (number_of_bins), range = (min_value, max_value), color = 'cyan')
print("-------------")
print("Bin break-up:")
recent_grads[column_name].value_counts(bins = number_of_bins).sort_index()
Number of unique values in the sharemen column: 172
---------------------
Statistical data related to the sharemen column:
 count    172.00
mean       0.48
std        0.23
min        0.03
25%        0.30
50%        0.47
75%        0.66
max        1.00
Name: sharemen, dtype: object
---------------------
Unique counts for each entry in the sharemen column:

0.031046    1
0.032002    1
0.072193    1
0.076255    1
0.089067    1
           ..
0.892687    1
0.898148    1
0.909287    1
0.922547    1
1.000000    1
Name: sharemen, Length: 172, dtype: int64
-------------
Bin break-up:
Out[123]:
(0.029099999999999997, 0.0956]     7
(0.0956, 0.16]                     8
(0.16, 0.225]                      8
(0.225, 0.289]                    18
(0.289, 0.354]                    19
(0.354, 0.419]                    17
(0.419, 0.483]                    13
(0.483, 0.548]                    12
(0.548, 0.612]                    15
(0.612, 0.677]                    14
(0.677, 0.742]                    12
(0.742, 0.806]                    12
(0.806, 0.871]                     8
(0.871, 0.935]                     8
(0.935, 1.0]                       1
Name: sharemen, dtype: int64
In [124]:
# This is to see a more detailed breakup of any of the ranges mentioned above
# Replace the number_of_bins value to see the histograms based on the number of bins set
# min_value and max_value helps to define the range of the distribution.
number_of_bins = 10
min_value, max_value = 0.483, 0.548

recent_grads[column_name].hist(bins = (number_of_bins), range = (min_value, max_value), color = 'grey')
print("----------------------")
print("Detailed Bin break-up for range: {0} - {1}:".format(min_value,max_value))
required_rows = recent_grads[(recent_grads[column_name]>=min_value) & (recent_grads[column_name]<=max_value)]
required_rows[column_name].value_counts(bins = number_of_bins).sort_index()
----------------------
Detailed Bin break-up for range: 0.483 - 0.548:
Out[124]:
(0.483, 0.489]    2
(0.489, 0.494]    2
(0.494, 0.498]    2
(0.498, 0.503]    0
(0.503, 0.507]    1
(0.507, 0.512]    0
(0.512, 0.516]    1
(0.516, 0.521]    0
(0.521, 0.525]    2
(0.525, 0.53]     2
Name: sharemen, dtype: int64

Insights

  • ShareMen - There are 76 courses for which male participation is dominant. Thus the percentage of courses that are male dominant is about 44%.
  • ShareWomen - There 96 courses for which female participation is dominant. Thus the percentage of courses that are female dominant is almost 56%.
  • Unemployment_rate - The most common range of unemployment is 5.5%-7.0% for about 32 courses. There are 11 courses that have an unemployment rate greater than 11%.
  • Median - The most common median salary range is 33733.33 - 39600.0. There are 46 courses that fall in to this range

Look...I mean Luke, I am your father....

Understanding the relationships between the different columns

There may or may not be relationships between columns. The relationships between the data could reveal more insights depending on the questions being asked. The following cells will be looking at asking specific questions and looking at what the data has to reveal.

It can be safely assumed that one of the reasons a course might be popular is its earning capacity. Considering the amount spent on tuition many take up courses that give a higher ROI.

In [125]:
recent_grads.plot(x = "total", y = "median", kind = 'scatter', title = "Total Graduates vs. Median Earnings", 
                  figsize = (7,7), rot = 90, color = 'blue')
Out[125]:
<matplotlib.axes._subplots.AxesSubplot at 0x15996f85e08>

The hexbin plot below attempts to identify overlapping clusters, that should help identify the most common median salaries.

In [126]:
recent_grads.plot.hexbin(x = "total", y = "median", title = "Total Graduates vs. Median Earnings", 
                         gridsize = (15), figsize = (9,7), sharex=False, colormap = 'viridis', rot = 90)
Out[126]:
<matplotlib.axes._subplots.AxesSubplot at 0x15996c4dc08>

Answer: Its straightforward from the data, that popularity of the course does not steer the money earned. A significant number of graduates earn between 33733.33 - 39600.0 irrespective of the popularity of the course across both genders.This was plucked from our median evaluation earlier and is reaffirmed here.

This is understandable as most students who graduate cannot expect very high earnings irrespective of what they have studied. They require skills and experience and need to prove themselves before they are paid. Like all data there are outliers, these have been disregarded.

This would be a whole different picture if the survey was done against graduates that have been working for 4 to 5 years.


Question: Do students that majored in subjects that have a female majority make more money?

Gender bias is a well known factor. Males are usually paid more than females across many industries. While the trend is changing it is yet to be fully undone. This question aims to investigate whether majors that are majority female suffer the same bias.

In [127]:
recent_grads.plot(x = "sharewomen", y = "median", kind = 'scatter', title = "Share of women in  major vs. Median", 
                  figsize = (7,7), color = 'pink')
Out[127]:
<matplotlib.axes._subplots.AxesSubplot at 0x15996f62dc8>

In general, as the share of women increases the median salary decreases. The hexbin plot below helps to identify the clusters.

In [142]:
recent_grads.plot.hexbin(x = "sharewomen", y = "median", gridsize=(15), title = "Share of women in  major vs. Median", 
                         figsize = (9, 7), colormap = 'spring', sharex=False)
Out[142]:
<matplotlib.axes._subplots.AxesSubplot at 0x15997d12908>

Answer: No. It is evident that students in courses that had a female majority made less money. There maybe a variety of factors for this including the type and nature of the course undertaken. More data is required to make a fair assessment.


The more full time employees a course produces the more likely the chances that students who graduate from the course will have a job. This question attempts to discover whether the same success can be translated in to their salaries.

In [141]:
recent_grads.plot(x = "full_time", y = "median", kind = 'scatter', title = "Full Time vs Median", figsize = (7,7), color = 'blue')
Out[141]:
<matplotlib.axes._subplots.AxesSubplot at 0x15997b74d48>

The hexbin plot below should help to identify the clusters.

In [140]:
recent_grads.plot.hexbin(x = 'full_time', y = 'median', title = 'Full time vs. Median', gridsize = (15), 
                         figsize = (9, 7), colormap = 'seismic', sharex = False)
Out[140]:
<matplotlib.axes._subplots.AxesSubplot at 0x15997ae8a88>

Answer: There is no specific relation but what is revealing is that for a majority of the courses full-time employed graduates earned the same common median salary identified. While this was found earlier with regards to popularity of the course, it also holds true for the number of graduates that are fully employed.

This is revealing. Neither the popularity of the course nor the number of full-time employed graduates that it churns out guarantees that it can be give a higher ROI.

Look Deeper...

A more in-depth analysis of the data

Earlier plots helped to determine relationships between two columns. However, there may be multiple columns that are related. Being able to see these relationship would lead to more insights. The following plots will attempt to give insights related to multiple columns.

Question: Can the popularity of a course predict the chances of being employed?

Almost everyone joins college to further their education and there by increase their chances of being emplyed.Certain students may pursue certain courses because the chances of being employed on completing the course may be higher. This question investigates whether the course popularity (determined by totals column) has any relation to the status of employment.

In [143]:
scatter_matrix(recent_grads[['total', 'employed', 'unemployed']], figsize = (12,12))
Out[143]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000015997C30D08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000015997C70A48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000015997CAA148>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000015997D31B48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000015997D6FE88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000015998F92788>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000015998FCC848>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000015999003988>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001599900F588>]],
      dtype=object)

Answer: No it can't. Based on the trend as the popularity of the course increases so does the number of employed and unemployed graduates.

A student undertaking a popular course has an almost equal chance of being employed or unemployed post graduation. This trend is especially visible in the 6th plot (employed vs unemployed)


Question: Does the popularity of a course have an impact on the gender types that participate in the course?

A number of factors play a role when it comes to the choice of a course by a student. Popularity of the course is one of them. The question is, does popularity of a course affect males and female participation?

In [146]:
scatter_matrix(recent_grads[['total', 'men', 'women']], figsize = (12,12))
Out[146]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000015994E3D388>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001599587ABC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000159992272C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001599925D3C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000159992944C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000159992CE608>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001599930DE88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001599933E788>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001599934A388>]],
      dtype=object)

Answer: Yes but more for women than men. If a comparison were to be done between the 2nd and 3rd plot it becomes more obvious. For men despite the popularity of some courses, their participation looks almost withheld when compared against women whose participation remains steadily upward.

However, there is a bias that must be noted. An earlier analysis showed that almost 56% of the courses had a female majority. This could be why there is a tilt to the female side. This could have been considered as unbiased if we had an equal majority of both male and female.


Team Marvel, Team JLA, Team...Rugrats?!!

Categorizing data for grouped analysis

Till now data has been analyzed based on the individual courses. However, the courses are categorized and the values for the same can be seen in the column major-category. Categorized data give a new perspective to the data. The following cells are a visual analysis of the categorized data.

Question: How is the gender distribution among the different major categories?

In [174]:
category_data = pd.DataFrame()
category_total = {}
category_men = {}
category_women = {}
category_employed = {}
category_unemployed = {}
category_median = {}
category_unemp_rate = {}

for a_category in recent_grads["major_category"]:
    category_total[a_category] = recent_grads[recent_grads["major_category"] == a_category]["total"].sum()
    category_men[a_category] = recent_grads[recent_grads["major_category"] == a_category]["men"].sum()
    category_women[a_category] = recent_grads[recent_grads["major_category"] == a_category]["women"].sum()
    category_employed[a_category] = recent_grads[recent_grads["major_category"] == a_category]["employed"].sum()
    category_unemployed[a_category] = recent_grads[recent_grads["major_category"] == a_category]["unemployed"].sum()
    category_median[a_category] = recent_grads[recent_grads["major_category"] == a_category]["median"].mean()
    category_unemp_rate[a_category] = category_unemployed[a_category] / (category_unemployed[a_category] + category_employed[a_category])

category_total_series = pd.Series(category_total)
category_men_series = pd.Series(category_men)
category_women_series = pd.Series(category_women)
category_employed_series = pd.Series(category_employed)
category_unemployed_series = pd.Series(category_unemployed)
category_median_series = pd.Series(category_median)
category_unemp_rate_series = pd.Series(category_unemp_rate)

category_dataframe = pd.DataFrame(category_total_series, columns = ["totals"])
category_dataframe["men"] = category_men_series
category_dataframe["women"] = category_women_series
category_dataframe["employed"] = category_employed_series
category_dataframe["unemployed"] = category_unemployed_series
category_dataframe["median"] = category_median_series
category_dataframe["unemp_rate"] = category_unemp_rate_series


category_dataframe[['men','women']].plot(kind='bar', title = "Major Categories participation by Gender", figsize = (10,10))
Out[174]:
<matplotlib.axes._subplots.AxesSubplot at 0x1599c627548>

Answer:

  • Female participation exceeds male participation in 9 out of 16 categories
  • Both genders equally aspire most to courses in the Business category
  • Males significantly out number females in courses related to Engineering
  • Females significantly outnumber males in multiple categories
  • Females also out number males in courses that are interdisciplinary

Question: How is the median distribution among the different major categories?

In [152]:
# for a_category in recent_grads["major_category"].unique():
#     recent_grads[recent_grads["major_category"] == "a_category"]["median"].plot(kind = 'box')

recent_grads.boxplot(by = "major_category", column = "median", rot = 90, figsize = (20,5))
Out[152]:
<matplotlib.axes._subplots.AxesSubplot at 0x1599a598b48>
In [182]:
category_dataframe["median"].plot(kind='bar', color = 'green')
Out[182]:
<matplotlib.axes._subplots.AxesSubplot at 0x1599e1d1388>

Answer:

  • Engineering pays most. The median salary is roughly 55000.
  • Courses related to Psychology and Social Work yield the least pay.
  • Interdisciplinary courses yield a median salary of 35000. There are only a few interdisciplinary courses.

Question: How is the unemplyment_rate distribution among the different major categories?

In [180]:
recent_grads.boxplot(by = "major_category", column = "unemployment_rate", rot = 90, figsize = (20,5))
Out[180]:
<matplotlib.axes._subplots.AxesSubplot at 0x1599d2e8c48>
In [181]:
category_dataframe["unemp_rate"].plot(kind='bar', color = 'red')
Out[181]:
<matplotlib.axes._subplots.AxesSubplot at 0x1599120dc88>

Answers:

  • Social science is the category most affected by unemployment so much so that the fourth quartile has an unemployment rate that exceeds 11% and goes well beyond 17%
  • Courses in the Education category in comparison are less affected by unemployment. The median is about 5% and in its fourth quartile the unemployment rate does not go beyond 7.5%

Conclusion

Through the above survey data provided by American Community Survey, this report has attempted to glean interesting insights.

By initially going through the data spread in each column it was found that women have a greater share than men in the overall attendance of courses. After analyzing the relationship between certain columns it has come to light that most graduates start off their careers with almost the same pay irrespective of the course they have done. Further analysis of the relationships yielded the fact that popularity of a course cannot guarantee employment on course completion. Finally after analysis of the categorical data, it was found that courses in the Engineering category generally tend to return a higher median salary and that unemployment rates are higher for courses related to Social Science.

What this data tells us is that the course alone has no bearing on how careers grow. Irrespective of which course you start off with there is a lot of work before you get to the top. Given the median salaries in the data, there is nothing wrong with starting with a job that pays less. It gives you more oppurtunity to learn and develop before you hit the big leagues.