Visualizing Earnings Based on College Majors¶

A lot of concerns mostly parents and also a number of students have risen over the years to determine what courses(majors) in college have a higher probability of assertaining their success. Which may be as a result of trying to gain "Return on Investment(ROA)" based on the resources and time which is put in while in college. However is not to say that some majors are irrelevant only that some are considered more valuable than the others in the society and world at large.

This project contains a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job could be gotten from American community Survey, which conducts surveys and aggregates the data. However, we would use a cleaned version of the data released by FiveThirtyEight on their Github repo.

Aim of the project¶

This project focuses on answering and exploring the following questions using several visualization techniques provided in the Matplot library as below:

Do students in more popular majors make more money?
- Using scatter plots
How many majors are predominantly male? Predominantly female?
- Using histograms
Which category of majors have the most students?
- Using bar plots?

Lastly, below are the columns in the dataset and their respective definitions:

Rank - Rank by median earnings (the dataset is ordered by this column).
Major_code - Major code.
Major - Major description.
Major_category - Category of major.
Total - Total number of people with majors.
Sample_size - Sample size (unweighted) of full-time.
Men - Male graduates.
Women - Female graduates.
ShareWomen - Women as a share of total.
Employed - Number employed.
Median - Median salary of full-time, year-round workers.
Low_wage_jobs - Number in low-wage service jobs.
Full_time - Number employed 35 hours or more.
Part_time - Number employed less than 35 hours.
Full_time_year_round - Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35).
Unemployed - Number unemployed (ESR == 3)
Unemployment_rate - Unemployed / (Unemployed + Employed).
Median - Median earnings of full-time, year-round workers.
P25th - 25th percentile of earnings.
P50th - 75th percentile of earnings.
College_jobs - Number with job requiring a college degree.
Non_college_jobs - Number with job not requiring a college degree.
Low_wage_jobs - Number in low-wage service jobs.

Importing the libraries¶

The various libraries (pandas and matplotlib) are required to enable proper data cleaning steps, exploration, analysis and visualization.

In [1]:

import pandas as pd
import matplotlib.pyplot as plt

# jupyter magic function to display inline plots
%matplotlib inline 

Data exploration¶

We need to read in the dataset to examine and explore the dataset, also to identity the contents contained in the dataset. e.g: patterns, outliers, values, changing column names (if need be) e.t.c

In [2]:

# reading the dataset
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.iloc[0]    # returns the first row of the dataset

Out[2]:

Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object

In [3]:

recent_grads.head()    # returns the first 5 elements of the dataset

Out[3]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
0	1	2419	PETROLEUM ENGINEERING	2339.0	2057.0	282.0	Engineering	0.120564	36	1976	...	270	1207	37	0.018381	110000	95000	125000	1534	364	193
1	2	2416	MINING AND MINERAL ENGINEERING	756.0	679.0	77.0	Engineering	0.101852	7	640	...	170	388	85	0.117241	75000	55000	90000	350	257	50
2	3	2415	METALLURGICAL ENGINEERING	856.0	725.0	131.0	Engineering	0.153037	3	648	...	133	340	16	0.024096	73000	50000	105000	456	176	0
3	4	2417	NAVAL ARCHITECTURE AND MARINE ENGINEERING	1258.0	1123.0	135.0	Engineering	0.107313	16	758	...	150	692	40	0.050125	70000	43000	80000	529	102	0
4	5	2405	CHEMICAL ENGINEERING	32260.0	21239.0	11021.0	Engineering	0.341631	289	25694	...	5180	16697	1672	0.061098	65000	50000	75000	18314	4440	972

5 rows × 21 columns

The dataframe displayed above consists of the first five elements of the dataset, which gives a better understanding of the dataset worked with.

In [4]:

recent_grads.tail()    # returns the last 5 elements of the dataset

Out[4]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
168	169	3609	ZOOLOGY	8409.0	3050.0	5359.0	Biology & Life Science	0.637293	47	6259	...	2190	3602	304	0.046320	26000	20000	39000	2771	2947	743
169	170	5201	EDUCATIONAL PSYCHOLOGY	2854.0	522.0	2332.0	Psychology & Social Work	0.817099	7	2125	...	572	1211	148	0.065112	25000	24000	34000	1488	615	82
170	171	5202	CLINICAL PSYCHOLOGY	2838.0	568.0	2270.0	Psychology & Social Work	0.799859	13	2101	...	648	1293	368	0.149048	25000	25000	40000	986	870	622
171	172	5203	COUNSELING PSYCHOLOGY	4626.0	931.0	3695.0	Psychology & Social Work	0.798746	21	3777	...	965	2738	214	0.053621	23400	19200	26000	2403	1245	308
172	173	3501	LIBRARY SCIENCE	1098.0	134.0	964.0	Education	0.877960	2	742	...	237	410	87	0.104946	22000	20000	22000	288	338	192

5 rows × 21 columns

The dataframe displayed above consists of the last five elements of the dataset, to also get a better intuition of the dataset worked with.

Changing column names¶

Notice that the column names begin with a capital letter, which is not much of a problem, but changing all column names to lower case ensures consistency which is a good thing and could help us carry out exploration and analysis even faster.

Therefore, the column names would be converted to lowercase.

In [5]:

# returns the column names in the dataset
recent_grads.columns

Out[5]:

Index(['Rank', 'Major_code', 'Major', 'Total', 'Men', 'Women',
       'Major_category', 'ShareWomen', 'Sample_size', 'Employed', 'Full_time',
       'Part_time', 'Full_time_year_round', 'Unemployed', 'Unemployment_rate',
       'Median', 'P25th', 'P75th', 'College_jobs', 'Non_college_jobs',
       'Low_wage_jobs'],
      dtype='object')

In [6]:

# converting all column names to lowercase
lowercase_recent_grad = []    # stores the lowercase column names

for name in recent_grads.columns:
    name = name.lower()
    lowercase_recent_grad.append(name)
    
recent_grads.columns = lowercase_recent_grad    # replaces the old columns with the new columns in the dataset

In [7]:

recent_grads.columns

Out[7]:

Index(['rank', 'major_code', 'major', 'total', 'men', 'women',
       'major_category', 'sharewomen', 'sample_size', 'employed', 'full_time',
       'part_time', 'full_time_year_round', 'unemployed', 'unemployment_rate',
       'median', 'p25th', 'p75th', 'college_jobs', 'non_college_jobs',
       'low_wage_jobs'],
      dtype='object')

All the column names have now been converted to lowercase, which brings about a consistent name format for the columns.

In [8]:

recent_grads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 21 columns):
rank                    173 non-null int64
major_code              173 non-null int64
major                   173 non-null object
total                   172 non-null float64
men                     172 non-null float64
women                   172 non-null float64
major_category          173 non-null object
sharewomen              172 non-null float64
sample_size             173 non-null int64
employed                173 non-null int64
full_time               173 non-null int64
part_time               173 non-null int64
full_time_year_round    173 non-null int64
unemployed              173 non-null int64
unemployment_rate       173 non-null float64
median                  173 non-null int64
p25th                   173 non-null int64
p75th                   173 non-null int64
college_jobs            173 non-null int64
non_college_jobs        173 non-null int64
low_wage_jobs           173 non-null int64
dtypes: float64(5), int64(14), object(2)
memory usage: 28.5+ KB

The information above shows that most of the columns in the dataset contain numeric values of int64 and float64, only two columns namely Major_category and Major contain string values.

In [9]:

# returns statistical information of of all the values 
# in the dataset
recent_grads.describe(include='all')

Out[9]:

	rank	major_code	major	total	men	women	major_category	sharewomen	sample_size	employed	...	part_time	full_time_year_round	unemployed	unemployment_rate	median	p25th	p75th	college_jobs	non_college_jobs	low_wage_jobs
count	173.000000	173.000000	173	172.000000	172.000000	172.000000	173	172.000000	173.000000	173.000000	...	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000
unique	NaN	NaN	173	NaN	NaN	NaN	16	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
top	NaN	NaN	ADVERTISING AND PUBLIC RELATIONS	NaN	NaN	NaN	Engineering	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
freq	NaN	NaN	1	NaN	NaN	NaN	29	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
mean	87.000000	3879.815029	NaN	39370.081395	16723.406977	22646.674419	NaN	0.522223	356.080925	31192.763006	...	8832.398844	19694.427746	2416.329480	0.068191	40151.445087	29501.445087	51494.219653	12322.635838	13284.497110	3859.017341
std	50.084928	1687.753140	NaN	63483.491009	28122.433474	41057.330740	NaN	0.231205	618.361022	50675.002241	...	14648.179473	33160.941514	4112.803148	0.030331	11470.181802	9166.005235	14906.279740	21299.868863	23789.655363	6944.998579
min	1.000000	1100.000000	NaN	124.000000	119.000000	0.000000	NaN	0.000000	2.000000	0.000000	...	0.000000	111.000000	0.000000	0.000000	22000.000000	18500.000000	22000.000000	0.000000	0.000000	0.000000
25%	44.000000	2403.000000	NaN	4549.750000	2177.500000	1778.250000	NaN	0.336026	39.000000	3608.000000	...	1030.000000	2453.000000	304.000000	0.050306	33000.000000	24000.000000	42000.000000	1675.000000	1591.000000	340.000000
50%	87.000000	3608.000000	NaN	15104.000000	5434.000000	8386.500000	NaN	0.534024	130.000000	11797.000000	...	3299.000000	7413.000000	893.000000	0.067961	36000.000000	27000.000000	47000.000000	4390.000000	4595.000000	1231.000000
75%	130.000000	5503.000000	NaN	38909.750000	14631.000000	22553.750000	NaN	0.703299	338.000000	31433.000000	...	9948.000000	16891.000000	2393.000000	0.087557	45000.000000	33000.000000	60000.000000	14444.000000	11783.000000	3466.000000
max	173.000000	6403.000000	NaN	393735.000000	173809.000000	307087.000000	NaN	0.968954	4212.000000	307933.000000	...	115172.000000	199897.000000	28169.000000	0.177226	110000.000000	95000.000000	125000.000000	151643.000000	148395.000000	48207.000000

11 rows × 21 columns

The information above shows that the dataset contains some columns with missing data specifically total, men, women and sharewomen.

Dropping rows with missing data¶

Using Matplotlib, it is expected that our dataset contain matching rows of data else it throws an error. Since it has been identified that there are some rows with missing data as stated before they need to be removed.

In [10]:

# displays the column with missing values 
print(recent_grads.isnull().sum())

rank                    0
major_code              0
major                   0
total                   1
men                     1
women                   1
major_category          0
sharewomen              1
sample_size             0
employed                0
full_time               0
part_time               0
full_time_year_round    0
unemployed              0
unemployment_rate       0
median                  0
p25th                   0
p75th                   0
college_jobs            0
non_college_jobs        0
low_wage_jobs           0
dtype: int64

Above, notice that there are only 4 columns with missing values and each of those columns contain only one row of missing data

In [11]:

# returns the total number of rows in the dataset, an
# alternative could be dataFrame.count()
raw_data_count = recent_grads.index
raw_data_count

Out[11]:

RangeIndex(start=0, stop=173, step=1)

In [12]:

# dropping rows with missing values
recent_grads = recent_grads.dropna(axis='index')

In [13]:

cleaned_data_count = recent_grads.index
cleaned_data_count

Out[13]:

Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            163, 164, 165, 166, 167, 168, 169, 170, 171, 172],
           dtype='int64', length=172)

Notice now that there's a difference between recent_data_count (length=173) and the new cleaned_data_count (length=172). This shows that only one row in the dataset contained missing values and was dropped.

Analysing the Category of Students that tend to Have Higher Income -- Using Scatter plots¶

In order to determing the category of students who earn higher amounts, a scatter plot would be created to compare and determine the disparity and correlation between some columns of our dataset. Such as: sample_size, median, unemployment_rate e.t.c

Since pandas also has some plotting functionalities DataFrame.plot() which enables us create different types of plots very quickly by passing in some arguments, that is what would be used in creating the plots. E.g recent_grads.plot(x='Sample_size', y='Employed', kind='scatter', title='Employed vs. Sample_size', figsize=(5,10))

They plots created aids in answering the following important questions:

Do students in more popular majors make more money?
Do students that majored in subjects that were majority female make more money?
Is there any link between the number of full-time employees and median salary?

In [14]:

# creating a scatter plot sample_size vs. median
ax = recent_grads.plot(x='sample_size', y='median', 
                  kind='scatter')
ax.set_title('Median Salary vs. Sample_size')
ax.set_xlim(0,4300)
ax.set_ylim(0,120000)

Out[14]:

(0, 120000)

Q. Do students in more popular majors make more money?

A. NO! Beacause, based on the display shown above, there is little or no signifiant correlation between median earnings of graduates and what the majored in.

However there are two significant observations.

The 75th percentile of the sample_size is at 338, which means majority of the sample data collected fell within a range of 500.Therefore, it's preferable to zoom in to get a better view of the relationships. Also, they're are also outliers included in the data collected.
Furthermore, scatter plot uses earning information for an unweighted sample of people. Therefore, this may not be a good representation of graduates with more popular majors.

In [15]:

# scatter plot with a sample size at the 75th percentile 
# and median values
ax = recent_grads.plot(x='sample_size', y='median',
                      kind='scatter')
ax.set_title('Median Salary vs. Sample_size')
ax.set_xlim(0,338) # 75th percentile value of sample_size
ax.set_ylim(0,120000)

Out[15]:

(0, 120000)

The diagram above still shows no correlation, given a smaller range of the sample size.

However, in a given sample size of 50 there seem to be an increase in the median earnings of graduates.

In [16]:

# scatter plot of sample_size vs. unemployment_rate
ax = recent_grads.plot(x='sample_size', y='unemployment_rate', 
                  kind='scatter')
ax.set_title('Unemployment_rate vs. Sample_size')
ax.set_xlim(0, 4300) # max value in sample_size
ax.set_ylim(-0.02, 0.2)

Out[16]:

(-0.02, 0.2)

Q. Do students in more popular majors make more money?

A. The diagram above also shows little or no correlation between unemployment_rate against sample_size.

However, with a sample size in the range 500, it's observed that there's an increase in unemployment rate.

For better intuition it's better to make analyses within a smaller range of sample size. Say 1000 (this would seem appropriate).

In [17]:

# creates a scatter plot at the 75th percentile of sample_size)
ax = recent_grads.plot(x='sample_size', y='unemployment_rate',
                      kind='scatter')
ax.set_title('Unemployment_rate vs. Sample_size')
ax.set_xlim(0,1000) # 75th percentile value of sample_size

Out[17]:

(0, 1000)

The diagram above still shows a lot of variations(no correlation) between sample_size and unemployment_rate.

Furthermore, exploring some of the rows in the dataset explains the reason for its variations.

There's a noticeable difference between sample size and that of unemployed and employed graduates in a given row. For example: Petroleum Engineering(rank 1), the sample size is 36, while unemployed+employed is (1976+37).

In [18]:

# scatter plot: Full_time vs Median
ax = recent_grads.plot(x='full_time', y='median', kind='scatter')
ax.set_title('Median Salary  vs. Full_time employed grads')
ax.set_xlim(0,)

Out[18]:

(0, 300000.0)

Q. Is there any link between the number of full-time employees and median salary?

A. The diagram above also shows a lot of variations (no correlation) between full_time graduates and their expected earnings, most especially for organizations within 50,000 range of full time employees.

Asides the minute range of sample size, other variations could be as a result of the various companies/organizations. such as:

Big companies may employ few workers and be willing to pay more for less hours of work, because the target the best workers and focus on productivity.
Medium companies could also do as stated above or, decide to employ lots of workers and pay less for more hours of work in order to also achieve productivity. Doing this they target very good and average workers.
Small companies are willing to employ a large number of workers (grads) and pay less for more working hours due to lack of sufficient resources. Also, fair or average grads fall into this companies since they're easier to get.

However, in my opinion its expected that the longer the hours put into work, the more their earnings.

In [19]:

# scatter plot: ShareWomen vs. Unemployment_rate
ax = recent_grads.plot(x='sharewomen', y='unemployment_rate',
                 kind='scatter')
ax.set_title('Unemployment_rate vs. Fraction of Graduate women')
ax.set_xlim(0,)

Out[19]:

(0, 1.2000000000000002)

The diagram above shows no correlation between sharewomen and unemployment_rate.

In [20]:

# scatter plot: Sharewomen vs. median
ax = recent_grads.plot(x='sharewomen', y='median',
                      kind='scatter')
ax.set_title('Median Salary vs. Fraction of graduate women')
ax.set_xlim(0,1.0)
ax.set_ylim(10000,)

Out[20]:

(10000, 120000.0)

Q. Do students that majored in subjects that were majority female make more money?

A. No! The scatter plot above shows a weak negative correlation.

This means females who concluded their college degrees in less female prospective majors earned more as shown in the diagram 0 - 0.2 (0 - 2%) of female had the highest earnings, while fe,ale concentrated majors (0.2) above had less earnings.

In [21]:

# scatter plot: Men vs. Median
ax = recent_grads.plot(x='men', y='median', kind='scatter')
ax.set_title('Median Salary vs. Male graduates')
ax.set_xlim(0,)
ax.set_ylim(20000,)

Out[21]:

(20000, 120000.0)

There's no correlation between Male graduates and their average earnings.

In [22]:

# scatter plot: Women vs. Median
ax = recent_grads.plot(x='women', y='median', kind='scatter')
ax.set_title('Median Salary vs. Female graduates')
ax.set_xlim(0,)
ax.set_ylim(20000,)

Out[22]:

(20000, 120000.0)

There's equally no correlation between Male graduates and their average earnings.

Visualizing Graduate information using Histograms¶

This sector focuses on answering two questions:

What percent of majors are predominantly male? Predominantly female?
What's the most common median salary range?

NB: dataFrame[col_name].plot(kind='hist') was not used in generating histograms because it's difficult to control the binning strategy. Rather this would be more preferable dataFrame[col_name].hist(bins=<digit>, range=(<digits>).

For better understanding check out Series.hist()

In [23]:

# histogram exploring sample_size
ax = recent_grads['sample_size'].hist()
ax.set_title('Distribution of Sample size data')
ax.set_xlabel('Sample_size')
ax.set_ylabel('Frequency')

Out[23]:

<matplotlib.text.Text at 0x7fae82d700f0>

The diagram above shows most of the sample_size data collected were below 500. A mored detailed view of the sample_size could be achieved by looking into the data in the range 500.

In [24]:

# Histogram of sample_size of range 500
ax = recent_grads['sample_size'].hist(bins=20, range=(0,500))
ax.set_title('Distribution of Sample size data')
ax.set_xlabel('Sample_size')
ax.set_ylabel('Frequecy')

Out[24]:

<matplotlib.text.Text at 0x7fae82c5cc18>

Moving further, it's observed that most of Sample size fell within the range 100. With such little sample size the median earning of grads may not be so accurate.

In [25]:

# Histogram of Median earnings
ax = recent_grads['median'].hist(bins=50, range=(20000,80000))
ax.set_title('Median distribution')
ax.set_xlabel('Median')
ax.set_ylabel('Frequency')

Out[25]:

<matplotlib.text.Text at 0x7fae82d1c908>

Q. What's the most common median salary range?

A. The diagram shows the most common median salary to be at $30,000 - $40,000. Next been the $40,000 - $50,000 or $50,000 - $60,000 which is quite hard to tell without further analysis or visualization.

In [26]:

# histogram for Employed grads
ax = recent_grads['employed'].hist(bins=25, range=(0,30000))
ax.set_title('Distribution of employed graduates')
ax.set_xlabel('Employed')
ax.set_ylabel('Frequecy')

Out[26]:

<matplotlib.text.Text at 0x7fae82eb6940>

The diagram above shows that most organizations or companies have at least 5000 employed graduates working with them.

There's also a reasonable level of distribution at 5000 - 15,000 point which shows assertions that bigger companies could have up to 15,000 or more employed grads.

It would be fair to say that the number of students employed per major is affected by the the number of students that have taken the major.

I'd examine the columns to determine if any relationship exists.

In [27]:

# E.g. the total number of people vs the number employed 
# for the largest majors
# Filter by majors with total > 39,000 (75th percentile)
largest_majors = recent_grads.loc[recent_grads['total'] > 39000, ['major_code', 'major', 'total', 'employed']]
largest_majors.sort_values(by='total', ascending=False).head(10)

Out[27]:

	major_code	major	total	employed
145	5200	PSYCHOLOGY	393735.0	307933
76	6203	BUSINESS MANAGEMENT AND ADMINISTRATION	329927.0	276234
123	3600	BIOLOGY	280709.0	182295
57	6200	GENERAL BUSINESS	234590.0	190183
93	1901	COMMUNICATIONS	213996.0	179633
34	6107	NURSING	209394.0	180903
77	6206	MARKETING AND MARKETING RESEARCH	205211.0	178862
40	6201	ACCOUNTING	198633.0	165527
137	3301	ENGLISH LANGUAGE AND LITERATURE	194673.0	149180
78	5506	POLITICAL SCIENCE AND GOVERNMENT	182621.0	133454

Above we notice an obvious relationship between the total number of grads with majors and and the number employed.

To better illustrate this fact a scatter plot would be used to show the relationship.

In [28]:

# Scatter plot: Total vs. employed
ax = recent_grads.plot(x='total', y='employed', kind='scatter')
ax.set_title('Total grads with majors vs. Employed grads')
ax.set_xlim(0,)
ax.set_ylim(0,)
ax.set_xlabel('Total')
ax.set_ylabel('Employed')

Out[28]:

<matplotlib.text.Text at 0x7fae850175c0>

In [29]:

# histogram of full time employed grads
ax = recent_grads['full_time'].hist(bins=25, range=(0,250000))
ax.set_title('Distribution of Full time employees')
ax.set_xlabel('Full Time Employees')
ax.set_ylabel('Frequecy')

Out[29]:

<matplotlib.text.Text at 0x7fae82cb4d30>

This above diagram goes to say that in a company there are about 50,000 full-time grad workers actively with them, which is logical because companies also consits of interns, remote workers, e.t.c which lasts only for a given period of time.

In [30]:

# histogram for sharewomen
ax = recent_grads['sharewomen'].hist(bins=20)
ax.set_title('Distribution of a Fraction of Graduate women')
ax.set_xlabel('Sharewomen')
ax.set_ylabel('Frequency')

Out[30]:

<matplotlib.text.Text at 0x7fae82ffd438>

It appears that just over 50% of all majors are mainly females, with the highest frequency at 70 - 80% female.

In [31]:

# Evaluating majors with the higest category 
# of females (0.6 - 0.8)
largest_female_share = recent_grads.loc[(recent_grads['sharewomen'] > 0.6) & 
                                        (recent_grads['sharewomen'] <= 0.8)][['major_code', 'major', 'total', 
                                                                              'men', 'women', 'sharewomen',
                                                                             'employed', 'unemployed']]
print(largest_female_share.shape)
largest_female_share.sort_values(by='sharewomen', ascending=False).head()      

(54, 8)

Out[31]:

	major_code	major	total	men	women	sharewomen	employed	unemployed
170	5202	CLINICAL PSYCHOLOGY	2838.0	568.0	2270.0	0.799859	2101	368
155	5299	MISCELLANEOUS PSYCHOLOGY	9628.0	1936.0	7692.0	0.798920	7653	419
171	5203	COUNSELING PSYCHOLOGY	4626.0	931.0	3695.0	0.798746	3777	214
118	6110	COMMUNITY AND PUBLIC HEALTH	19735.0	4103.0	15632.0	0.792095	14512	1833
145	5200	PSYCHOLOGY	393735.0	86648.0	307087.0	0.779933	307933	28169

Looking at the dataFrame comparing the columns men, women and total, the histogram confirms the fact that majority of the major consists of more women than men.

In [32]:

# histogram of unemployment_rate
ax = recent_grads['unemployment_rate'].hist()
ax.set_title('Distribution of unemployment_rate')
ax.set_xlabel('Unemployment Rate')
ax.set_ylabel('Frequency')

Out[32]:

<matplotlib.text.Text at 0x7fae82bb5da0>

The majors with the highest unemployment rate is at 6-7%, while the majors with the least unemployment rate is at 14%.

I would examine the both cases below.

In [33]:

# Majors with higher unemployment rates
highest_majors_unemployed = recent_grads.loc[(recent_grads['unemployment_rate'] >= 0.06) &
                                         (recent_grads['unemployment_rate'] <= 0.07)][['major_code', 'major', 'major_category', 'unemployment_rate']]
highest_majors_unemployed.sort_values(by='unemployment_rate', ascending=False).head()

Out[33]:

	major_code	major	major_category	unemployment_rate
121	6106	HEALTH AND MEDICAL PREPARATORY PROGRAMS	Health	0.069780
40	6201	ACCOUNTING	Business	0.069749
96	1902	JOURNALISM	Communications & Journalism	0.069176
101	3608	PHYSIOLOGY	Biology & Life Science	0.069163
151	5404	SOCIAL WORK	Psychology & Social Work	0.068828

Displayed above shows the top 5 majors with the highest unemployment_rate with Health and Medical Preparatory programs at the top.

In [34]:

# Majors with the least unemployment rates
highest_majors_unemployed = recent_grads.loc[(recent_grads['unemployment_rate'] >= 0.12) &
                                         (recent_grads['unemployment_rate'] <= 0.14)][['major_code', 'major', 'major_category', 'median', 'unemployment_rate']]
highest_majors_unemployed.sort_values(by='unemployment_rate', ascending=False).head()

Out[34]:

	major_code	major	major_category	median	unemployment_rate
29	5402	PUBLIC POLICY	Law & Public Policy	50000	0.128426

PUBLIC POLICY shows the highest prospect of employment amongst all the other majors with also a very good average salary of $50,000

In [35]:

# histogram distribution of men
ax = recent_grads['men'].hist(bins=25, range=(0,200000))
ax.set_title('Distribution of Men')
ax.set_xlabel('Men')
ax.set_ylabel('Frequency')

Out[35]:

<matplotlib.text.Text at 0x7fae82b20828>

Q. What percent of majors are predominantly male?

A. The diagram above shows most companies have a high percentage of Male grad workers. It could go as high as 80% (estimate) male workers in an organization.

However, it doesn't determine the majors significantly dominated by males.

In [36]:

# determining majors dominated by male grads
male_dominated_majors = recent_grads.loc[recent_grads['men'] >= 0, ['major_code', 'major', 'major_category', 'median', 'men', 'women']]
male_dominated_majors.sort_values(by='men', ascending=False).head(5)

Out[36]:

	major_code	major	major_category	median	men	women
76	6203	BUSINESS MANAGEMENT AND ADMINISTRATION	Business	38000	173809.0	156118.0
57	6200	GENERAL BUSINESS	Business	40000	132238.0	102352.0
35	6207	FINANCE	Business	47000	115030.0	59476.0
123	3600	BIOLOGY	Biology & Life Science	33400	111762.0	168947.0
20	2102	COMPUTER SCIENCE	Computers & Mathematics	53000	99743.0	28576.0

The majors signifcantly dominated by males are BUSINESS MANAGEMENT AND ADMINISTRATION, GENERAL BUSINESS and FINANCE.

In [37]:

# histogram of unemployment_rate
ax = recent_grads['women'].hist(bins=25, range=(0,200000))
ax.set_title('Distribution of women')
ax.set_xlabel('Women')
ax.set_ylabel('Frequency')
ax.set_ylim(0,120)

Out[37]:

(0, 120)

Q. What percent of majors are predominantly Female?

A. The diagram above shows most companies also have a high percentage of Female grad workers. It could also go as high as 75% (estimate) female workers in an organization.

In [38]:

# determining majors dominated by male grads
female_dominated_majors = recent_grads.loc[recent_grads['women'] >= 0, ['major_code', 'major', 'major_category', 'median', 'men', 'women']]
female_dominated_majors.sort_values(by='women', ascending=False).head(5)

Out[38]:

	major_code	major	major_category	median	men	women
145	5200	PSYCHOLOGY	Psychology & Social Work	31500	86648.0	307087.0
34	6107	NURSING	Health	48000	21773.0	187621.0
123	3600	BIOLOGY	Biology & Life Science	33400	111762.0	168947.0
138	2304	ELEMENTARY EDUCATION	Education	32000	13029.0	157833.0
76	6203	BUSINESS MANAGEMENT AND ADMINISTRATION	Business	38000	173809.0	156118.0

The majors signifcantly dominated by females are PSYCHOLOGY, NURSING, BIOLOGY.

Exploring potential relationships and distributions of columns using Scatter matrix plot¶

In other to evalutate the relationship between multiple columns more efficiently, a Scatter Matrix plot is the best viable solution.

A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us explore potential relationships and distributions simultaneously.

In [39]:

# importing scatter_matrix
from pandas.plotting import scatter_matrix

In [40]:

# A 2 by 2 scatter matrix plot of Sample_size 
# and Median Salary
scatter_matrix(recent_grads[['sample_size', 'median']],
              figsize=(10,10))

Out[40]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae82a05438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82968438>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae82930780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae828ed240>]],
      dtype=object)

The diagram above shows most sample sizes to be less than 1000 (top-left histogram). The scatter plot of Median vs. Sample_size (bottom-left) suggests that the median salary to be somewhere around $30,000 - $40,000.

However, the scatter plot of Sample_size vs. Median (top-right) suggests the increase in sample size doesn't necessarily affect the Median salary values.

In [41]:

# A 3 x 3 scatter matrix plot of sample_size, median and
# unemployment columns
scatter_matrix(recent_grads[['sample_size', 'median', 'unemployment_rate']],
              figsize=(10,10))

Out[41]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae8280ff98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae8280ac50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82759668>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae827160f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae826df128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae8269e8d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae82666fd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82626710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae825f3940>]],
      dtype=object)

There's no correlation in the scatter matrix plot above. It is a good way to show a faster relationship between columns which was shown in the cells above. E.g the total students with a major vs number employed or total students with major vs number unemployed.

In [42]:

# Total vs unemployed scatter matrix
scatter_matrix(recent_grads[['total', 'unemployed']], figsize=(10,10))

Out[42]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae826cdef0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82454fd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae824229b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae823de630>]],
      dtype=object)

This shows a weak postitive correlation between total students with majors, meaning only a small fraction of students with majors are unemployed.

In [43]:

# Total vs employed scatter matrix
scatter_matrix(recent_grads[['total', 'employed']], figsize=(10,10))

Out[43]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae82383fd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae823010f0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae822c8a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82285668>]],
      dtype=object)

Above exists a very strong positive correlation between the total number of students with majors vs. their rate of employment. This simply means majority of student with majors are employed.

Visualizing data using bar plot¶

Bar plots can be created using Series object: df[range][col].plot(kind='bar') or DataFrame object: df[range].plot.bar(x=labels, y=data for bars)

In [44]:

# Bar plot of sharewomen from first ten rows vs sharewomen
# from last ten rows
ax1 = recent_grads[:10].plot.bar(x='major', y='sharewomen', 
                                 title='Fraction of female grads from the top 10 courses with the highest median salary')

ax2 = recent_grads[-10:].plot.bar(x='major', y='sharewomen',
                                 title='Fraction of female grads from the bottom 10 courses with the least median salary')

Above we observe that courses with the highest median salaries have a lower share of female grads than those with the lowest median salaries, which in this case are majorly females (i.e. more than 50% of grads in the lowest median salaries are females).

We can calculate how large the difference is below:

In [45]:

# Calculating the average proportion of female grads for the 
# top and bottom 10 courses
# NB: slices using .loc  includes the index of both the start 
# and stop index contrary to using normal python lists
top_10_female_share = recent_grads.loc[:9, 'sharewomen'].mean()
bottom_10_female_share = recent_grads[-10:]['sharewomen'].mean()

In [46]:

top_10 = ('The 10 highest paying courses have an average '
          'amount of female share to be: {:.2f}'.format(
          top_10_female_share))
          
bottom_10 = ('The 10 lowest paying courses have an average '
          'amount of female share to be: {:.2f}'.format(
          bottom_10_female_share))
             
print(top_10)
print(bottom_10)

The 10 highest paying courses have an average amount of female share to be: 0.23
The 10 lowest paying courses have an average amount of female share to be: 0.79

There's an obvious difference in the average proportion of top and bottom 10 courses for female grads (in terms of median pay), which is more than 50%.

Next, we check out the difference in the unemployment rate between the top and bottom 10 courses.

In [47]:

# Unemployment rate for top and bottom 10 courses
ax1 = recent_grads[:10].plot.bar(x='major', y='unemployment_rate',
                                title='Unemployment rate for top 10 courses.')

ax2 = recent_grads[-10:].plot.bar(x='major', y='unemployment_rate',
                                title='Unemployment rate for top 10 courses.')

For the top 10 courses in general, the unemployment rate is relatively low, however 2 courses NUCLEAR ENGINEERING and MINING AND MINERAL ENGINEERING seem to be outstandingly high.

While for the bottom 10 courses, the unemployment rate seem to be moderately high with 3-5 courses affriming it.

We can analyse this further by looking at the average unemployment rates.

In [48]:

# calculating the average unemployment rates for the
# top and bottom 10 courses
# NB: slices using .loc  includes the index of both the start 
# and stop index contrary to using normal python lists
mean_unemp_rate = recent_grads['unemployment_rate'].mean()
top_10_unemp_rate = recent_grads.loc[:9, 'unemployment_rate'].mean()
bottom_10_unemp_rate = recent_grads[-10:]['unemployment_rate'].mean()

print(('The average unemployment rate for all majors is {:.2f}'
       .format(mean_unemp_rate)))
print(('The average unemployment rate for the top 10 '
       'majors is {:.2f}'.format(top_10_unemp_rate)))
print(('The average unemployment rate for the bottom 10 '
       'majors is {:.2f}'.format(bottom_10_unemp_rate)))

The average unemployment rate for all majors is 0.07
The average unemployment rate for the top 10 majors is 0.07
The average unemployment rate for the bottom 10 majors is 0.08

The average unemployment rate for all majors tend to be similar to that of the top and bottom 10 courses. However, in the top 10 you'd that there are only two course which are distinctively high, while in the bottom 10 there are about 3-5 courses.

We could examine this further below:

In [49]:

# creating a new column to calculate the difference btw
# unemployment rate in the top 10 courses
top_10_outliers = (recent_grads[:10]
                   .loc[recent_grads[:10]['unemployment_rate'] > mean_unemp_rate])
top_10_outliers['mean_difference'] = (
    top_10_outliers['unemployment_rate'] - mean_unemp_rate)

In [50]:

# creating a new column to calculate the difference btw
# unemployment rate in the bottom 10 courses
bottom_10_outliers = (recent_grads[-10:]
                   .loc[recent_grads[-10:]['unemployment_rate'] > mean_unemp_rate])
bottom_10_outliers['mean_difference'] = (
    bottom_10_outliers['unemployment_rate'] - mean_unemp_rate)

In [51]:

"""Plotting the courses from the top and bottom 10 with the
average unemployment rates and their mean_difference"""
top_10_outliers.plot.bar(x='major', y='mean_difference',
                        title='Majors in the top 10 above average unemployment rate.')

bottom_10_outliers.plot.bar(x='major', y='mean_difference',
                        title='Majors in the bottom 10 above average unemployment rate.')

Out[51]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fae820f1be0>

In the top 10 courses, the course NUCLEAR ENGINEERING is particularly responsible for shooting up the unemployment rate. While for the bottom 10, CLINICAL PSYCHOLOGY is particularly responsible for shooting up the unemployment rate.

Further Analysis¶

Moving on, I'd generate visualiztions to carry out more indept analysis of the following questions:

Comparing the number of men and women in each category of majors using a grouped bar plot.
Explore the distributions of median salaries and unemployment rate using box plot
Explore columns with denser scatter plots using hexagonal bin plot, from the scatter plots created above.

NB: For more fantastic plots using pandas check out the documentation plotting in pandas.

In [52]:

# comparing the number of men and women in each category major
# NB: df.plot.bar(stacked=True) stacks the plot on top of each
# other    
ax1 = (recent_grads.groupby('major_category')[['men','women']]
 .sum().plot.bar(stacked=True))
ax1.set_title('Category Majors vs. Number of Men and women')
ax1.set_ylabel('Total')

Out[52]:

<matplotlib.text.Text at 0x7fae8206fef0>

Q. Comparing the number of men and women in each category of majors using a grouped bar plot.

A. Above it is noticed that the Business category has the highest number of male and female graduates combined, which are also somewhat evenly distributed.

Generally, there is a higher percentage of female grads than male grads across the various major categories with some exceptions such as ENGINEERING and COMPUTER & MATHEMATICS which are dominated by male grads.

In [65]:

# box plot: exploring the distrbutions of median salaries
recent_grads['median'].plot.box(title='Distribution of Median salaries')

Out[65]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fae81846320>

The above figure shows that most common median salaries for majors are between $30,000 - $40,000. We also observe that there are outliers in the median salary for majors which range somewhere around $60,000 - $80,000.

In [66]:

# box plot: exploring the distrbutions of unemployment rate
ax = recent_grads['unemployment_rate'].plot.box(title='Distribution of unemployment rate')

In [68]:

# hexagonal bin plot: total vs employed
recent_grads.plot.hexbin(x='employed', y='total', gridsize=30)

Out[68]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fae81a4a908>

In [69]:

recent_grads.plot.hexbin(x='men', y='median', gridsize=30)

Out[69]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fae820af588>

In [ ]: