A job is meant to be a given, especially after spending time, effort and in most cases a lot of money on earning a degree. For most it not only validates the effort that they've put in but also confirms place in society as a contributing member.
For many, however this is often far from reality. The sad reality is, many industries are not looking for just that certificate. While it does validate that a graduate is familiar with the subject matter, it does not guarantee his or her skill in managing the various scenarios that may arise during the course of his/her employment.
Some graduates ensure to keep those skills up-to-date even while they are studying through part-time jobs or self experimentation. Others choose to improve on their skills after graduation. Depending on the course that was undertaken and the kind of skill required for employability some graduates go on to take up full time or part time jobs in their own field of study, some others take up jobs in an area unrelated to what they have studies and still some others are left without employment.
The American Commnunity Survey conducts surverys and aggregates data from them. A survey was conducted by them on students who graduated from college between 2010 and 2012 to understand how they faired with being employed after graduation.
The data is a list of college majors by rank. Details of the the majors themselves include the total number of students that graduated, the classification of the major and the median salary earned by full time employees that graduated. The remaining part of the data pertains to student that undertook the major, including such details as the gender breakup, the number of full time, part time and unemployed students.
The columns are defined below.
In the following cells, the data will be extracted from a cleaned version of the recent-grads.csv file containing the data necessary for analysis.
import pandas as pd
import matplotlib as plt
from pandas.plotting import scatter_matrix
%matplotlib inline
# Function name: print_full(a_list)
# Input: A python list
# Output: The full list
# Description: Jupyter notebooks by default do not display all the data for large datasets. It shows a few lines and summarizes
# the rest using ellipsis. This function helps to see the full python list
def print_full(a_list):
pd.set_option('display.max_rows', len(a_list))
print(a_list)
pd.reset_option('display.max_rows')
# Function name: column_details(the_dataframe, a_column)
# Input: A python dataframe and one of its columns
# Output: Details related to the column
# Description: This function provides analysis information related to a column associated to a python dataframe.
# It handles only string and numerical columns
def column_details(the_dataframe, a_column):
print("Number of unique values in the {0} column:".format(a_column), the_dataframe[a_column].unique().shape[0])
print("---------------------")
if the_dataframe[a_column].dtype == "object":
print("Statistical data related to the {0} column:\n".format(a_column),the_dataframe[a_column].describe())
else:
print("Statistical data related to the {0} column:\n".format(a_column),the_dataframe[a_column].describe().map('{:,.2f}'.format))
print("---------------------")
print("Unique counts for each entry in the {0} column:\n".format(a_column))
print(the_dataframe[a_column].value_counts().sort_index(ascending=True))
recent_grads = pd.read_csv("recent-grads.csv")
recent_grads.iloc[0]
recent_grads.head(5)
recent_grads.tail(5)
The goal of this project is to visualize the data and in the process extract insights.
In the cells that follow, the data will be analysed to ensure that it can be used to generate visualizations.
The column names are mixed case. Before going further it would be best to have all the columns be in the same case thus assisting with faster analysis.
recent_grads_columns_fixed = []
for a_column in recent_grads.columns:
recent_grads_columns_fixed.append(a_column.lower())
recent_grads.columns = recent_grads_columns_fixed
recent_grads.columns
The following is a quick analysis of the numerical columns.
recent_grads.describe()
Next is a more in-dpeth analysis of each column. The column-name field could be replaced in the column below to view a more in-depth analysis of the given column.
In order to visualize data, it must be ensured that rows with null or empty values are removed. The Matplotlib library that is used to generate visualizations expects that the columns of values we pass in have matching lengths, so missing values will cause it to throw errors.
# Number of rows befor dropping rows.
raw_data_count = recent_grads.shape[0]
print(raw_data_count)
recent_grads = recent_grads.dropna()
# Number of rows after dropping rows.
cleaned_data_count = recent_grads.shape[0]
print(cleaned_data_count)
The column sharewomen indicates the number of women that graduated from the course in terms of percentage. A similar sharemen column would therefore be appropriate and helpful as we go forward with analysis. Since the percentage values are in fraction a simple subtraction would help calculate the value for every row
recent_grads["sharemen"] = 1 - recent_grads["sharewomen"]
category_data = pd.DataFrame()
category_total = {}
category_men = {}
category_women = {}
category_employed = {}
category_unemployed = {}
category_median = {}
for a_category in recent_grads["major_category"]:
category_total[a_category] = recent_grads[recent_grads["major_category"] == a_category]["total"].sum()
category_men[a_category] = recent_grads[recent_grads["major_category"] == a_category]["men"].sum()
category_women[a_category] = recent_grads[recent_grads["major_category"] == a_category]["women"].sum()
category_employed[a_category] = recen_grads[recent_grads["major_category"] == a_category]["employed"].sum()
category_unemployed[a_category] = recent_grads[recent_grads["major_category"] == a_category]["unemployed"].sum()
category_median[a_category] = recent_grads[recent_grads["major_category"] == a_category]["median"].mean()
category_total_series = pd.Seriest(category_total)
category_men_series = pd.Series(category_men)
category_women_series = pd.Series(category_women)
category_employed_series = pd.Series(category_employed)
category_unemployed_series = pd.Series(category_unemployed)
category_median_series = pd.Series(category_median)
category_dataframe = pd.DataFrame(category_total_series, columns = ["totals"])
category_dataframe["men"] = category_men_series
category_dataframe["women"] = category_women_series
category_dataframe["employed"] = category_employed_series
category_dataframe["unemployed"] = category_unemployed_series
category_dataframe["median"] = category_median_series
print(category_dataframe)
Insights from data in individual columns
Understanding the spread of the values in each of the columns would be an excellent start to gaining insights from the data. In order to do this each column is assessed individually. If certain columns display ranges that seem overpopulated, those ranges will be further assessed to understand the frequencies within them.
( You could try analyze the columns by setting a column in the column-name field below and pressing Ctrl+Enter )
# Replace the value in the column_name for an in-depth analysis of the column
# This also helps with making assessments on what ranges could be used.
column_name = "sharemen"
column_details(recent_grads, column_name)
# Replace the number_of_bins value to see the histograms based on the number of bins set
# min_value and max_value helps to define the range of the distribution. The default is the minimum and maximum of each column
number_of_bins = 15
min_value, max_value = recent_grads[column_name].min(),recent_grads[column_name].max()
recent_grads[column_name].hist(bins = (number_of_bins), range = (min_value, max_value), color = 'cyan')
print("-------------")
print("Bin break-up:")
recent_grads[column_name].value_counts(bins = number_of_bins).sort_index()
# This is to see a more detailed breakup of any of the ranges mentioned above
# Replace the number_of_bins value to see the histograms based on the number of bins set
# min_value and max_value helps to define the range of the distribution.
number_of_bins = 10
min_value, max_value = 0.483, 0.548
recent_grads[column_name].hist(bins = (number_of_bins), range = (min_value, max_value), color = 'grey')
print("----------------------")
print("Detailed Bin break-up for range: {0} - {1}:".format(min_value,max_value))
required_rows = recent_grads[(recent_grads[column_name]>=min_value) & (recent_grads[column_name]<=max_value)]
required_rows[column_name].value_counts(bins = number_of_bins).sort_index()
Understanding the relationships between the different columns
There may or may not be relationships between columns. The relationships between the data could reveal more insights depending on the questions being asked. The following cells will be looking at asking specific questions and looking at what the data has to reveal.
It can be safely assumed that one of the reasons a course might be popular is its earning capacity. Considering the amount spent on tuition many take up courses that give a higher ROI.
recent_grads.plot(x = "total", y = "median", kind = 'scatter', title = "Total Graduates vs. Median Earnings",
figsize = (7,7), rot = 90, color = 'blue')
The hexbin plot below attempts to identify overlapping clusters, that should help identify the most common median salaries.
recent_grads.plot.hexbin(x = "total", y = "median", title = "Total Graduates vs. Median Earnings",
gridsize = (15), figsize = (9,7), sharex=False, colormap = 'viridis', rot = 90)
Answer: Its straightforward from the data, that popularity of the course does not steer the money earned. A significant number of graduates earn between 33733.33 - 39600.0 irrespective of the popularity of the course across both genders.This was plucked from our median evaluation earlier and is reaffirmed here.
This is understandable as most students who graduate cannot expect very high earnings irrespective of what they have studied. They require skills and experience and need to prove themselves before they are paid. Like all data there are outliers, these have been disregarded.
This would be a whole different picture if the survey was done against graduates that have been working for 4 to 5 years.
Gender bias is a well known factor. Males are usually paid more than females across many industries. While the trend is changing it is yet to be fully undone. This question aims to investigate whether majors that are majority female suffer the same bias.
recent_grads.plot(x = "sharewomen", y = "median", kind = 'scatter', title = "Share of women in major vs. Median",
figsize = (7,7), color = 'pink')
In general, as the share of women increases the median salary decreases. The hexbin plot below helps to identify the clusters.
recent_grads.plot.hexbin(x = "sharewomen", y = "median", gridsize=(15), title = "Share of women in major vs. Median",
figsize = (9, 7), colormap = 'spring', sharex=False)
Answer: No. It is evident that students in courses that had a female majority made less money. There maybe a variety of factors for this including the type and nature of the course undertaken. More data is required to make a fair assessment.
The more full time employees a course produces the more likely the chances that students who graduate from the course will have a job. This question attempts to discover whether the same success can be translated in to their salaries.
recent_grads.plot(x = "full_time", y = "median", kind = 'scatter', title = "Full Time vs Median", figsize = (7,7), color = 'blue')
The hexbin plot below should help to identify the clusters.
recent_grads.plot.hexbin(x = 'full_time', y = 'median', title = 'Full time vs. Median', gridsize = (15),
figsize = (9, 7), colormap = 'seismic', sharex = False)
Answer: There is no specific relation but what is revealing is that for a majority of the courses full-time employed graduates earned the same common median salary identified. While this was found earlier with regards to popularity of the course, it also holds true for the number of graduates that are fully employed.
This is revealing. Neither the popularity of the course nor the number of full-time employed graduates that it churns out guarantees that it can be give a higher ROI.
A more in-depth analysis of the data
Earlier plots helped to determine relationships between two columns. However, there may be multiple columns that are related. Being able to see these relationship would lead to more insights. The following plots will attempt to give insights related to multiple columns.
Almost everyone joins college to further their education and there by increase their chances of being emplyed.Certain students may pursue certain courses because the chances of being employed on completing the course may be higher. This question investigates whether the course popularity (determined by totals column) has any relation to the status of employment.
scatter_matrix(recent_grads[['total', 'employed', 'unemployed']], figsize = (12,12))
Answer: No it can't. Based on the trend as the popularity of the course increases so does the number of employed and unemployed graduates.
A student undertaking a popular course has an almost equal chance of being employed or unemployed post graduation. This trend is especially visible in the 6th plot (employed vs unemployed)
A number of factors play a role when it comes to the choice of a course by a student. Popularity of the course is one of them. The question is, does popularity of a course affect males and female participation?
scatter_matrix(recent_grads[['total', 'men', 'women']], figsize = (12,12))
Answer: Yes but more for women than men. If a comparison were to be done between the 2nd and 3rd plot it becomes more obvious. For men despite the popularity of some courses, their participation looks almost withheld when compared against women whose participation remains steadily upward.
However, there is a bias that must be noted. An earlier analysis showed that almost 56% of the courses had a female majority. This could be why there is a tilt to the female side. This could have been considered as unbiased if we had an equal majority of both male and female.
Categorizing data for grouped analysis
Till now data has been analyzed based on the individual courses. However, the courses are categorized and the values for the same can be seen in the column major-category. Categorized data give a new perspective to the data. The following cells are a visual analysis of the categorized data.
category_data = pd.DataFrame()
category_total = {}
category_men = {}
category_women = {}
category_employed = {}
category_unemployed = {}
category_median = {}
category_unemp_rate = {}
for a_category in recent_grads["major_category"]:
category_total[a_category] = recent_grads[recent_grads["major_category"] == a_category]["total"].sum()
category_men[a_category] = recent_grads[recent_grads["major_category"] == a_category]["men"].sum()
category_women[a_category] = recent_grads[recent_grads["major_category"] == a_category]["women"].sum()
category_employed[a_category] = recent_grads[recent_grads["major_category"] == a_category]["employed"].sum()
category_unemployed[a_category] = recent_grads[recent_grads["major_category"] == a_category]["unemployed"].sum()
category_median[a_category] = recent_grads[recent_grads["major_category"] == a_category]["median"].mean()
category_unemp_rate[a_category] = category_unemployed[a_category] / (category_unemployed[a_category] + category_employed[a_category])
category_total_series = pd.Series(category_total)
category_men_series = pd.Series(category_men)
category_women_series = pd.Series(category_women)
category_employed_series = pd.Series(category_employed)
category_unemployed_series = pd.Series(category_unemployed)
category_median_series = pd.Series(category_median)
category_unemp_rate_series = pd.Series(category_unemp_rate)
category_dataframe = pd.DataFrame(category_total_series, columns = ["totals"])
category_dataframe["men"] = category_men_series
category_dataframe["women"] = category_women_series
category_dataframe["employed"] = category_employed_series
category_dataframe["unemployed"] = category_unemployed_series
category_dataframe["median"] = category_median_series
category_dataframe["unemp_rate"] = category_unemp_rate_series
category_dataframe[['men','women']].plot(kind='bar', title = "Major Categories participation by Gender", figsize = (10,10))
Answer:
# for a_category in recent_grads["major_category"].unique():
# recent_grads[recent_grads["major_category"] == "a_category"]["median"].plot(kind = 'box')
recent_grads.boxplot(by = "major_category", column = "median", rot = 90, figsize = (20,5))
category_dataframe["median"].plot(kind='bar', color = 'green')
Answer:
recent_grads.boxplot(by = "major_category", column = "unemployment_rate", rot = 90, figsize = (20,5))
category_dataframe["unemp_rate"].plot(kind='bar', color = 'red')
Answers:
Through the above survey data provided by American Community Survey, this report has attempted to glean interesting insights.
By initially going through the data spread in each column it was found that women have a greater share than men in the overall attendance of courses. After analyzing the relationship between certain columns it has come to light that most graduates start off their careers with almost the same pay irrespective of the course they have done. Further analysis of the relationships yielded the fact that popularity of a course cannot guarantee employment on course completion. Finally after analysis of the categorical data, it was found that courses in the Engineering category generally tend to return a higher median salary and that unemployment rates are higher for courses related to Social Science.
What this data tells us is that the course alone has no bearing on how careers grow. Irrespective of which course you start off with there is a lot of work before you get to the top. Given the median salaries in the data, there is nothing wrong with starting with a job that pays less. It gives you more oppurtunity to learn and develop before you hit the big leagues.