Visualizing Earnings Based on College Majors

A lot of concerns mostly parents and also a number of students have risen over the years to determine what courses(majors) in college have a higher probability of assertaining their success. Which may be as a result of trying to gain "Return on Investment(ROA)" based on the resources and time which is put in while in college. However is not to say that some majors are irrelevant only that some are considered more valuable than the others in the society and world at large.

This project contains a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job could be gotten from American community Survey, which conducts surveys and aggregates the data. However, we would use a cleaned version of the data released by FiveThirtyEight on their Github repo.

Aim of the project

This project focuses on answering and exploring the following questions using several visualization techniques provided in the Matplot library as below:

  • Do students in more popular majors make more money?
    • Using scatter plots
  • How many majors are predominantly male? Predominantly female?
    • Using histograms
  • Which category of majors have the most students?
    • Using bar plots?

Lastly, below are the columns in the dataset and their respective definitions:

  • Rank - Rank by median earnings (the dataset is ordered by this column).
  • Major_code - Major code.
  • Major - Major description.
  • Major_category - Category of major.
  • Total - Total number of people with majors.
  • Sample_size - Sample size (unweighted) of full-time.
  • Men - Male graduates.
  • Women - Female graduates.
  • ShareWomen - Women as a share of total.
  • Employed - Number employed.
  • Median - Median salary of full-time, year-round workers.
  • Low_wage_jobs - Number in low-wage service jobs.
  • Full_time - Number employed 35 hours or more.
  • Part_time - Number employed less than 35 hours.
  • Full_time_year_round - Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35).
  • Unemployed - Number unemployed (ESR == 3)
  • Unemployment_rate - Unemployed / (Unemployed + Employed).
  • Median - Median earnings of full-time, year-round workers.
  • P25th - 25th percentile of earnings.
  • P50th - 75th percentile of earnings.
  • College_jobs - Number with job requiring a college degree.
  • Non_college_jobs - Number with job not requiring a college degree.
  • Low_wage_jobs - Number in low-wage service jobs.

Importing the libraries

The various libraries (pandas and matplotlib) are required to enable proper data cleaning steps, exploration, analysis and visualization.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

# jupyter magic function to display inline plots
%matplotlib inline 

Data exploration

We need to read in the dataset to examine and explore the dataset, also to identity the contents contained in the dataset. e.g: patterns, outliers, values, changing column names (if need be) e.t.c

In [2]:
# reading the dataset
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.iloc[0]    # returns the first row of the dataset
Out[2]:
Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object
In [3]:
recent_grads.head()    # returns the first 5 elements of the dataset
Out[3]:
Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size Employed ... Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
0 1 2419 PETROLEUM ENGINEERING 2339.0 2057.0 282.0 Engineering 0.120564 36 1976 ... 270 1207 37 0.018381 110000 95000 125000 1534 364 193
1 2 2416 MINING AND MINERAL ENGINEERING 756.0 679.0 77.0 Engineering 0.101852 7 640 ... 170 388 85 0.117241 75000 55000 90000 350 257 50
2 3 2415 METALLURGICAL ENGINEERING 856.0 725.0 131.0 Engineering 0.153037 3 648 ... 133 340 16 0.024096 73000 50000 105000 456 176 0
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING 1258.0 1123.0 135.0 Engineering 0.107313 16 758 ... 150 692 40 0.050125 70000 43000 80000 529 102 0
4 5 2405 CHEMICAL ENGINEERING 32260.0 21239.0 11021.0 Engineering 0.341631 289 25694 ... 5180 16697 1672 0.061098 65000 50000 75000 18314 4440 972

5 rows × 21 columns

The dataframe displayed above consists of the first five elements of the dataset, which gives a better understanding of the dataset worked with.

In [4]:
recent_grads.tail()    # returns the last 5 elements of the dataset
Out[4]:
Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size Employed ... Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
168 169 3609 ZOOLOGY 8409.0 3050.0 5359.0 Biology & Life Science 0.637293 47 6259 ... 2190 3602 304 0.046320 26000 20000 39000 2771 2947 743
169 170 5201 EDUCATIONAL PSYCHOLOGY 2854.0 522.0 2332.0 Psychology & Social Work 0.817099 7 2125 ... 572 1211 148 0.065112 25000 24000 34000 1488 615 82
170 171 5202 CLINICAL PSYCHOLOGY 2838.0 568.0 2270.0 Psychology & Social Work 0.799859 13 2101 ... 648 1293 368 0.149048 25000 25000 40000 986 870 622
171 172 5203 COUNSELING PSYCHOLOGY 4626.0 931.0 3695.0 Psychology & Social Work 0.798746 21 3777 ... 965 2738 214 0.053621 23400 19200 26000 2403 1245 308
172 173 3501 LIBRARY SCIENCE 1098.0 134.0 964.0 Education 0.877960 2 742 ... 237 410 87 0.104946 22000 20000 22000 288 338 192

5 rows × 21 columns

The dataframe displayed above consists of the last five elements of the dataset, to also get a better intuition of the dataset worked with.

Changing column names

Notice that the column names begin with a capital letter, which is not much of a problem, but changing all column names to lower case ensures consistency which is a good thing and could help us carry out exploration and analysis even faster.

Therefore, the column names would be converted to lowercase.

In [5]:
# returns the column names in the dataset
recent_grads.columns
Out[5]:
Index(['Rank', 'Major_code', 'Major', 'Total', 'Men', 'Women',
       'Major_category', 'ShareWomen', 'Sample_size', 'Employed', 'Full_time',
       'Part_time', 'Full_time_year_round', 'Unemployed', 'Unemployment_rate',
       'Median', 'P25th', 'P75th', 'College_jobs', 'Non_college_jobs',
       'Low_wage_jobs'],
      dtype='object')
In [6]:
# converting all column names to lowercase
lowercase_recent_grad = []    # stores the lowercase column names

for name in recent_grads.columns:
    name = name.lower()
    lowercase_recent_grad.append(name)
    
recent_grads.columns = lowercase_recent_grad    # replaces the old columns with the new columns in the dataset
In [7]:
recent_grads.columns
Out[7]:
Index(['rank', 'major_code', 'major', 'total', 'men', 'women',
       'major_category', 'sharewomen', 'sample_size', 'employed', 'full_time',
       'part_time', 'full_time_year_round', 'unemployed', 'unemployment_rate',
       'median', 'p25th', 'p75th', 'college_jobs', 'non_college_jobs',
       'low_wage_jobs'],
      dtype='object')

All the column names have now been converted to lowercase, which brings about a consistent name format for the columns.

In [8]:
recent_grads.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 21 columns):
rank                    173 non-null int64
major_code              173 non-null int64
major                   173 non-null object
total                   172 non-null float64
men                     172 non-null float64
women                   172 non-null float64
major_category          173 non-null object
sharewomen              172 non-null float64
sample_size             173 non-null int64
employed                173 non-null int64
full_time               173 non-null int64
part_time               173 non-null int64
full_time_year_round    173 non-null int64
unemployed              173 non-null int64
unemployment_rate       173 non-null float64
median                  173 non-null int64
p25th                   173 non-null int64
p75th                   173 non-null int64
college_jobs            173 non-null int64
non_college_jobs        173 non-null int64
low_wage_jobs           173 non-null int64
dtypes: float64(5), int64(14), object(2)
memory usage: 28.5+ KB

The information above shows that most of the columns in the dataset contain numeric values of int64 and float64, only two columns namely Major_category and Major contain string values.

In [9]:
# returns statistical information of of all the values 
# in the dataset
recent_grads.describe(include='all')
Out[9]:
rank major_code major total men women major_category sharewomen sample_size employed ... part_time full_time_year_round unemployed unemployment_rate median p25th p75th college_jobs non_college_jobs low_wage_jobs
count 173.000000 173.000000 173 172.000000 172.000000 172.000000 173 172.000000 173.000000 173.000000 ... 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000 173.000000
unique NaN NaN 173 NaN NaN NaN 16 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top NaN NaN ADVERTISING AND PUBLIC RELATIONS NaN NaN NaN Engineering NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq NaN NaN 1 NaN NaN NaN 29 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean 87.000000 3879.815029 NaN 39370.081395 16723.406977 22646.674419 NaN 0.522223 356.080925 31192.763006 ... 8832.398844 19694.427746 2416.329480 0.068191 40151.445087 29501.445087 51494.219653 12322.635838 13284.497110 3859.017341
std 50.084928 1687.753140 NaN 63483.491009 28122.433474 41057.330740 NaN 0.231205 618.361022 50675.002241 ... 14648.179473 33160.941514 4112.803148 0.030331 11470.181802 9166.005235 14906.279740 21299.868863 23789.655363 6944.998579
min 1.000000 1100.000000 NaN 124.000000 119.000000 0.000000 NaN 0.000000 2.000000 0.000000 ... 0.000000 111.000000 0.000000 0.000000 22000.000000 18500.000000 22000.000000 0.000000 0.000000 0.000000
25% 44.000000 2403.000000 NaN 4549.750000 2177.500000 1778.250000 NaN 0.336026 39.000000 3608.000000 ... 1030.000000 2453.000000 304.000000 0.050306 33000.000000 24000.000000 42000.000000 1675.000000 1591.000000 340.000000
50% 87.000000 3608.000000 NaN 15104.000000 5434.000000 8386.500000 NaN 0.534024 130.000000 11797.000000 ... 3299.000000 7413.000000 893.000000 0.067961 36000.000000 27000.000000 47000.000000 4390.000000 4595.000000 1231.000000
75% 130.000000 5503.000000 NaN 38909.750000 14631.000000 22553.750000 NaN 0.703299 338.000000 31433.000000 ... 9948.000000 16891.000000 2393.000000 0.087557 45000.000000 33000.000000 60000.000000 14444.000000 11783.000000 3466.000000
max 173.000000 6403.000000 NaN 393735.000000 173809.000000 307087.000000 NaN 0.968954 4212.000000 307933.000000 ... 115172.000000 199897.000000 28169.000000 0.177226 110000.000000 95000.000000 125000.000000 151643.000000 148395.000000 48207.000000

11 rows × 21 columns

The information above shows that the dataset contains some columns with missing data specifically total, men, women and sharewomen.

Dropping rows with missing data

Using Matplotlib, it is expected that our dataset contain matching rows of data else it throws an error. Since it has been identified that there are some rows with missing data as stated before they need to be removed.

In [10]:
# displays the column with missing values 
print(recent_grads.isnull().sum())
rank                    0
major_code              0
major                   0
total                   1
men                     1
women                   1
major_category          0
sharewomen              1
sample_size             0
employed                0
full_time               0
part_time               0
full_time_year_round    0
unemployed              0
unemployment_rate       0
median                  0
p25th                   0
p75th                   0
college_jobs            0
non_college_jobs        0
low_wage_jobs           0
dtype: int64

Above, notice that there are only 4 columns with missing values and each of those columns contain only one row of missing data

In [11]:
# returns the total number of rows in the dataset, an
# alternative could be dataFrame.count()
raw_data_count = recent_grads.index
raw_data_count
Out[11]:
RangeIndex(start=0, stop=173, step=1)
In [12]:
# dropping rows with missing values
recent_grads = recent_grads.dropna(axis='index')
In [13]:
cleaned_data_count = recent_grads.index
cleaned_data_count
Out[13]:
Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            163, 164, 165, 166, 167, 168, 169, 170, 171, 172],
           dtype='int64', length=172)

Notice now that there's a difference between recent_data_count (length=173) and the new cleaned_data_count (length=172). This shows that only one row in the dataset contained missing values and was dropped.

Analysing the Category of Students that tend to Have Higher Income -- Using Scatter plots

In order to determing the category of students who earn higher amounts, a scatter plot would be created to compare and determine the disparity and correlation between some columns of our dataset. Such as: sample_size, median, unemployment_rate e.t.c

Since pandas also has some plotting functionalities DataFrame.plot() which enables us create different types of plots very quickly by passing in some arguments, that is what would be used in creating the plots. E.g recent_grads.plot(x='Sample_size', y='Employed', kind='scatter', title='Employed vs. Sample_size', figsize=(5,10))

They plots created aids in answering the following important questions:

  • Do students in more popular majors make more money?
  • Do students that majored in subjects that were majority female make more money?
  • Is there any link between the number of full-time employees and median salary?
In [14]:
# creating a scatter plot sample_size vs. median
ax = recent_grads.plot(x='sample_size', y='median', 
                  kind='scatter')
ax.set_title('Median Salary vs. Sample_size')
ax.set_xlim(0,4300)
ax.set_ylim(0,120000)
Out[14]:
(0, 120000)

Q. Do students in more popular majors make more money?

A. NO! Beacause, based on the display shown above, there is little or no signifiant correlation between median earnings of graduates and what the majored in.

However there are two significant observations.

  1. The 75th percentile of the sample_size is at 338, which means majority of the sample data collected fell within a range of 500.Therefore, it's preferable to zoom in to get a better view of the relationships. Also, they're are also outliers included in the data collected.
  2. Furthermore, scatter plot uses earning information for an unweighted sample of people. Therefore, this may not be a good representation of graduates with more popular majors.
In [15]:
# scatter plot with a sample size at the 75th percentile 
# and median values
ax = recent_grads.plot(x='sample_size', y='median',
                      kind='scatter')
ax.set_title('Median Salary vs. Sample_size')
ax.set_xlim(0,338) # 75th percentile value of sample_size
ax.set_ylim(0,120000)
Out[15]:
(0, 120000)

The diagram above still shows no correlation, given a smaller range of the sample size.

However, in a given sample size of 50 there seem to be an increase in the median earnings of graduates.

In [16]:
# scatter plot of sample_size vs. unemployment_rate
ax = recent_grads.plot(x='sample_size', y='unemployment_rate', 
                  kind='scatter')
ax.set_title('Unemployment_rate vs. Sample_size')
ax.set_xlim(0, 4300) # max value in sample_size
ax.set_ylim(-0.02, 0.2)
Out[16]:
(-0.02, 0.2)

Q. Do students in more popular majors make more money?

A. The diagram above also shows little or no correlation between unemployment_rate against sample_size.

However, with a sample size in the range 500, it's observed that there's an increase in unemployment rate.

For better intuition it's better to make analyses within a smaller range of sample size. Say 1000 (this would seem appropriate).

In [17]:
# creates a scatter plot at the 75th percentile of sample_size)
ax = recent_grads.plot(x='sample_size', y='unemployment_rate',
                      kind='scatter')
ax.set_title('Unemployment_rate vs. Sample_size')
ax.set_xlim(0,1000) # 75th percentile value of sample_size
Out[17]:
(0, 1000)

The diagram above still shows a lot of variations(no correlation) between sample_size and unemployment_rate.

Furthermore, exploring some of the rows in the dataset explains the reason for its variations.

  • There's a noticeable difference between sample size and that of unemployed and employed graduates in a given row. For example: Petroleum Engineering(rank 1), the sample size is 36, while unemployed+employed is (1976+37).
In [18]:
# scatter plot: Full_time vs Median
ax = recent_grads.plot(x='full_time', y='median', kind='scatter')
ax.set_title('Median Salary  vs. Full_time employed grads')
ax.set_xlim(0,)
Out[18]:
(0, 300000.0)

Q. Is there any link between the number of full-time employees and median salary?

A. The diagram above also shows a lot of variations (no correlation) between full_time graduates and their expected earnings, most especially for organizations within 50,000 range of full time employees.

Asides the minute range of sample size, other variations could be as a result of the various companies/organizations. such as:

  • Big companies may employ few workers and be willing to pay more for less hours of work, because the target the best workers and focus on productivity.
  • Medium companies could also do as stated above or, decide to employ lots of workers and pay less for more hours of work in order to also achieve productivity. Doing this they target very good and average workers.
  • Small companies are willing to employ a large number of workers (grads) and pay less for more working hours due to lack of sufficient resources. Also, fair or average grads fall into this companies since they're easier to get.

However, in my opinion its expected that the longer the hours put into work, the more their earnings.

In [19]:
# scatter plot: ShareWomen vs. Unemployment_rate
ax = recent_grads.plot(x='sharewomen', y='unemployment_rate',
                 kind='scatter')
ax.set_title('Unemployment_rate vs. Fraction of Graduate women')
ax.set_xlim(0,)
Out[19]:
(0, 1.2000000000000002)

The diagram above shows no correlation between sharewomen and unemployment_rate.

In [20]:
# scatter plot: Sharewomen vs. median
ax = recent_grads.plot(x='sharewomen', y='median',
                      kind='scatter')
ax.set_title('Median Salary vs. Fraction of graduate women')
ax.set_xlim(0,1.0)
ax.set_ylim(10000,)
Out[20]:
(10000, 120000.0)

Q. Do students that majored in subjects that were majority female make more money?

A. No! The scatter plot above shows a weak negative correlation.

This means females who concluded their college degrees in less female prospective majors earned more as shown in the diagram 0 - 0.2 (0 - 2%) of female had the highest earnings, while fe,ale concentrated majors (0.2) above had less earnings.

In [21]:
# scatter plot: Men vs. Median
ax = recent_grads.plot(x='men', y='median', kind='scatter')
ax.set_title('Median Salary vs. Male graduates')
ax.set_xlim(0,)
ax.set_ylim(20000,)
Out[21]:
(20000, 120000.0)

There's no correlation between Male graduates and their average earnings.

In [22]:
# scatter plot: Women vs. Median
ax = recent_grads.plot(x='women', y='median', kind='scatter')
ax.set_title('Median Salary vs. Female graduates')
ax.set_xlim(0,)
ax.set_ylim(20000,)
Out[22]:
(20000, 120000.0)

There's equally no correlation between Male graduates and their average earnings.

Visualizing Graduate information using Histograms

This sector focuses on answering two questions:

  1. What percent of majors are predominantly male? Predominantly female?
  2. What's the most common median salary range?

NB: dataFrame[col_name].plot(kind='hist') was not used in generating histograms because it's difficult to control the binning strategy. Rather this would be more preferable dataFrame[col_name].hist(bins=<digit>, range=(<digits>).

For better understanding check out Series.hist()

In [23]:
# histogram exploring sample_size
ax = recent_grads['sample_size'].hist()
ax.set_title('Distribution of Sample size data')
ax.set_xlabel('Sample_size')
ax.set_ylabel('Frequency')
Out[23]:
<matplotlib.text.Text at 0x7fae82d700f0>

The diagram above shows most of the sample_size data collected were below 500. A mored detailed view of the sample_size could be achieved by looking into the data in the range 500.

In [24]:
# Histogram of sample_size of range 500
ax = recent_grads['sample_size'].hist(bins=20, range=(0,500))
ax.set_title('Distribution of Sample size data')
ax.set_xlabel('Sample_size')
ax.set_ylabel('Frequecy')
Out[24]:
<matplotlib.text.Text at 0x7fae82c5cc18>

Moving further, it's observed that most of Sample size fell within the range 100. With such little sample size the median earning of grads may not be so accurate.

In [25]:
# Histogram of Median earnings
ax = recent_grads['median'].hist(bins=50, range=(20000,80000))
ax.set_title('Median distribution')
ax.set_xlabel('Median')
ax.set_ylabel('Frequency')
Out[25]:
<matplotlib.text.Text at 0x7fae82d1c908>

Q. What's the most common median salary range?

A. The diagram shows the most common median salary to be at $30,000 - $40,000. Next been the $40,000 - $50,000 or $50,000 - $60,000 which is quite hard to tell without further analysis or visualization.

In [26]:
# histogram for Employed grads
ax = recent_grads['employed'].hist(bins=25, range=(0,30000))
ax.set_title('Distribution of employed graduates')
ax.set_xlabel('Employed')
ax.set_ylabel('Frequecy')
Out[26]:
<matplotlib.text.Text at 0x7fae82eb6940>

The diagram above shows that most organizations or companies have at least 5000 employed graduates working with them.

There's also a reasonable level of distribution at 5000 - 15,000 point which shows assertions that bigger companies could have up to 15,000 or more employed grads.

It would be fair to say that the number of students employed per major is affected by the the number of students that have taken the major.

I'd examine the columns to determine if any relationship exists.

In [27]:
# E.g. the total number of people vs the number employed 
# for the largest majors
# Filter by majors with total > 39,000 (75th percentile)
largest_majors = recent_grads.loc[recent_grads['total'] > 39000, ['major_code', 'major', 'total', 'employed']]
largest_majors.sort_values(by='total', ascending=False).head(10)
Out[27]:
major_code major total employed
145 5200 PSYCHOLOGY 393735.0 307933
76 6203 BUSINESS MANAGEMENT AND ADMINISTRATION 329927.0 276234
123 3600 BIOLOGY 280709.0 182295
57 6200 GENERAL BUSINESS 234590.0 190183
93 1901 COMMUNICATIONS 213996.0 179633
34 6107 NURSING 209394.0 180903
77 6206 MARKETING AND MARKETING RESEARCH 205211.0 178862
40 6201 ACCOUNTING 198633.0 165527
137 3301 ENGLISH LANGUAGE AND LITERATURE 194673.0 149180
78 5506 POLITICAL SCIENCE AND GOVERNMENT 182621.0 133454

Above we notice an obvious relationship between the total number of grads with majors and and the number employed.

To better illustrate this fact a scatter plot would be used to show the relationship.

In [28]:
# Scatter plot: Total vs. employed
ax = recent_grads.plot(x='total', y='employed', kind='scatter')
ax.set_title('Total grads with majors vs. Employed grads')
ax.set_xlim(0,)
ax.set_ylim(0,)
ax.set_xlabel('Total')
ax.set_ylabel('Employed')
Out[28]:
<matplotlib.text.Text at 0x7fae850175c0>
In [29]:
# histogram of full time employed grads
ax = recent_grads['full_time'].hist(bins=25, range=(0,250000))
ax.set_title('Distribution of Full time employees')
ax.set_xlabel('Full Time Employees')
ax.set_ylabel('Frequecy')
Out[29]:
<matplotlib.text.Text at 0x7fae82cb4d30>

This above diagram goes to say that in a company there are about 50,000 full-time grad workers actively with them, which is logical because companies also consits of interns, remote workers, e.t.c which lasts only for a given period of time.

In [30]:
# histogram for sharewomen
ax = recent_grads['sharewomen'].hist(bins=20)
ax.set_title('Distribution of a Fraction of Graduate women')
ax.set_xlabel('Sharewomen')
ax.set_ylabel('Frequency')
Out[30]:
<matplotlib.text.Text at 0x7fae82ffd438>

It appears that just over 50% of all majors are mainly females, with the highest frequency at 70 - 80% female.

In [31]:
# Evaluating majors with the higest category 
# of females (0.6 - 0.8)
largest_female_share = recent_grads.loc[(recent_grads['sharewomen'] > 0.6) & 
                                        (recent_grads['sharewomen'] <= 0.8)][['major_code', 'major', 'total', 
                                                                              'men', 'women', 'sharewomen',
                                                                             'employed', 'unemployed']]
print(largest_female_share.shape)
largest_female_share.sort_values(by='sharewomen', ascending=False).head()      
(54, 8)
Out[31]:
major_code major total men women sharewomen employed unemployed
170 5202 CLINICAL PSYCHOLOGY 2838.0 568.0 2270.0 0.799859 2101 368
155 5299 MISCELLANEOUS PSYCHOLOGY 9628.0 1936.0 7692.0 0.798920 7653 419
171 5203 COUNSELING PSYCHOLOGY 4626.0 931.0 3695.0 0.798746 3777 214
118 6110 COMMUNITY AND PUBLIC HEALTH 19735.0 4103.0 15632.0 0.792095 14512 1833
145 5200 PSYCHOLOGY 393735.0 86648.0 307087.0 0.779933 307933 28169

Looking at the dataFrame comparing the columns men, women and total, the histogram confirms the fact that majority of the major consists of more women than men.

In [32]:
# histogram of unemployment_rate
ax = recent_grads['unemployment_rate'].hist()
ax.set_title('Distribution of unemployment_rate')
ax.set_xlabel('Unemployment Rate')
ax.set_ylabel('Frequency')
Out[32]:
<matplotlib.text.Text at 0x7fae82bb5da0>

The majors with the highest unemployment rate is at 6-7%, while the majors with the least unemployment rate is at 14%.

I would examine the both cases below.

In [33]:
# Majors with higher unemployment rates
highest_majors_unemployed = recent_grads.loc[(recent_grads['unemployment_rate'] >= 0.06) &
                                         (recent_grads['unemployment_rate'] <= 0.07)][['major_code', 'major', 'major_category', 'unemployment_rate']]
highest_majors_unemployed.sort_values(by='unemployment_rate', ascending=False).head()
Out[33]:
major_code major major_category unemployment_rate
121 6106 HEALTH AND MEDICAL PREPARATORY PROGRAMS Health 0.069780
40 6201 ACCOUNTING Business 0.069749
96 1902 JOURNALISM Communications & Journalism 0.069176
101 3608 PHYSIOLOGY Biology & Life Science 0.069163
151 5404 SOCIAL WORK Psychology & Social Work 0.068828

Displayed above shows the top 5 majors with the highest unemployment_rate with Health and Medical Preparatory programs at the top.

In [34]:
# Majors with the least unemployment rates
highest_majors_unemployed = recent_grads.loc[(recent_grads['unemployment_rate'] >= 0.12) &
                                         (recent_grads['unemployment_rate'] <= 0.14)][['major_code', 'major', 'major_category', 'median', 'unemployment_rate']]
highest_majors_unemployed.sort_values(by='unemployment_rate', ascending=False).head()
Out[34]:
major_code major major_category median unemployment_rate
29 5402 PUBLIC POLICY Law & Public Policy 50000 0.128426

PUBLIC POLICY shows the highest prospect of employment amongst all the other majors with also a very good average salary of $50,000

In [35]:
# histogram distribution of men
ax = recent_grads['men'].hist(bins=25, range=(0,200000))
ax.set_title('Distribution of Men')
ax.set_xlabel('Men')
ax.set_ylabel('Frequency')
Out[35]:
<matplotlib.text.Text at 0x7fae82b20828>

Q. What percent of majors are predominantly male?

A. The diagram above shows most companies have a high percentage of Male grad workers. It could go as high as 80% (estimate) male workers in an organization.

However, it doesn't determine the majors significantly dominated by males.

In [36]:
# determining majors dominated by male grads
male_dominated_majors = recent_grads.loc[recent_grads['men'] >= 0, ['major_code', 'major', 'major_category', 'median', 'men', 'women']]
male_dominated_majors.sort_values(by='men', ascending=False).head(5)
Out[36]:
major_code major major_category median men women
76 6203 BUSINESS MANAGEMENT AND ADMINISTRATION Business 38000 173809.0 156118.0
57 6200 GENERAL BUSINESS Business 40000 132238.0 102352.0
35 6207 FINANCE Business 47000 115030.0 59476.0
123 3600 BIOLOGY Biology & Life Science 33400 111762.0 168947.0
20 2102 COMPUTER SCIENCE Computers & Mathematics 53000 99743.0 28576.0

The majors signifcantly dominated by males are BUSINESS MANAGEMENT AND ADMINISTRATION, GENERAL BUSINESS and FINANCE.

In [37]:
# histogram of unemployment_rate
ax = recent_grads['women'].hist(bins=25, range=(0,200000))
ax.set_title('Distribution of women')
ax.set_xlabel('Women')
ax.set_ylabel('Frequency')
ax.set_ylim(0,120)
Out[37]:
(0, 120)

Q. What percent of majors are predominantly Female?

A. The diagram above shows most companies also have a high percentage of Female grad workers. It could also go as high as 75% (estimate) female workers in an organization.

In [38]:
# determining majors dominated by male grads
female_dominated_majors = recent_grads.loc[recent_grads['women'] >= 0, ['major_code', 'major', 'major_category', 'median', 'men', 'women']]
female_dominated_majors.sort_values(by='women', ascending=False).head(5)
Out[38]:
major_code major major_category median men women
145 5200 PSYCHOLOGY Psychology & Social Work 31500 86648.0 307087.0
34 6107 NURSING Health 48000 21773.0 187621.0
123 3600 BIOLOGY Biology & Life Science 33400 111762.0 168947.0
138 2304 ELEMENTARY EDUCATION Education 32000 13029.0 157833.0
76 6203 BUSINESS MANAGEMENT AND ADMINISTRATION Business 38000 173809.0 156118.0

The majors signifcantly dominated by females are PSYCHOLOGY, NURSING, BIOLOGY.

Exploring potential relationships and distributions of columns using Scatter matrix plot

In other to evalutate the relationship between multiple columns more efficiently, a Scatter Matrix plot is the best viable solution.

A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us explore potential relationships and distributions simultaneously.

In [39]:
# importing scatter_matrix
from pandas.plotting import scatter_matrix
In [40]:
# A 2 by 2 scatter matrix plot of Sample_size 
# and Median Salary
scatter_matrix(recent_grads[['sample_size', 'median']],
              figsize=(10,10))
Out[40]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae82a05438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82968438>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae82930780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae828ed240>]],
      dtype=object)

The diagram above shows most sample sizes to be less than 1000 (top-left histogram). The scatter plot of Median vs. Sample_size (bottom-left) suggests that the median salary to be somewhere around $30,000 - $40,000.

However, the scatter plot of Sample_size vs. Median (top-right) suggests the increase in sample size doesn't necessarily affect the Median salary values.

In [41]:
# A 3 x 3 scatter matrix plot of sample_size, median and
# unemployment columns
scatter_matrix(recent_grads[['sample_size', 'median', 'unemployment_rate']],
              figsize=(10,10))
Out[41]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae8280ff98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae8280ac50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82759668>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae827160f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae826df128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae8269e8d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae82666fd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82626710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae825f3940>]],
      dtype=object)

There's no correlation in the scatter matrix plot above. It is a good way to show a faster relationship between columns which was shown in the cells above. E.g the total students with a major vs number employed or total students with major vs number unemployed.

In [42]:
# Total vs unemployed scatter matrix
scatter_matrix(recent_grads[['total', 'unemployed']], figsize=(10,10))
Out[42]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae826cdef0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82454fd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae824229b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae823de630>]],
      dtype=object)

This shows a weak postitive correlation between total students with majors, meaning only a small fraction of students with majors are unemployed.

In [43]:
# Total vs employed scatter matrix
scatter_matrix(recent_grads[['total', 'employed']], figsize=(10,10))
Out[43]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae82383fd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae823010f0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae822c8a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae82285668>]],
      dtype=object)

Above exists a very strong positive correlation between the total number of students with majors vs. their rate of employment. This simply means majority of student with majors are employed.

Visualizing data using bar plot

Bar plots can be created using Series object: df[range][col].plot(kind='bar') or DataFrame object: df[range].plot.bar(x=labels, y=data for bars)

In [44]:
# Bar plot of sharewomen from first ten rows vs sharewomen
# from last ten rows
ax1 = recent_grads[:10].plot.bar(x='major', y='sharewomen', 
                                 title='Fraction of female grads from the top 10 courses with the highest median salary')

ax2 = recent_grads[-10:].plot.bar(x='major', y='sharewomen',
                                 title='Fraction of female grads from the bottom 10 courses with the least median salary')

Above we observe that courses with the highest median salaries have a lower share of female grads than those with the lowest median salaries, which in this case are majorly females (i.e. more than 50% of grads in the lowest median salaries are females).

We can calculate how large the difference is below:

In [45]:
# Calculating the average proportion of female grads for the 
# top and bottom 10 courses
# NB: slices using .loc  includes the index of both the start 
# and stop index contrary to using normal python lists
top_10_female_share = recent_grads.loc[:9, 'sharewomen'].mean()
bottom_10_female_share = recent_grads[-10:]['sharewomen'].mean()
In [46]:
top_10 = ('The 10 highest paying courses have an average '
          'amount of female share to be: {:.2f}'.format(
          top_10_female_share))
          
bottom_10 = ('The 10 lowest paying courses have an average '
          'amount of female share to be: {:.2f}'.format(
          bottom_10_female_share))
             
print(top_10)
print(bottom_10)
The 10 highest paying courses have an average amount of female share to be: 0.23
The 10 lowest paying courses have an average amount of female share to be: 0.79

There's an obvious difference in the average proportion of top and bottom 10 courses for female grads (in terms of median pay), which is more than 50%.

Next, we check out the difference in the unemployment rate between the top and bottom 10 courses.

In [47]:
# Unemployment rate for top and bottom 10 courses
ax1 = recent_grads[:10].plot.bar(x='major', y='unemployment_rate',
                                title='Unemployment rate for top 10 courses.')

ax2 = recent_grads[-10:].plot.bar(x='major', y='unemployment_rate',
                                title='Unemployment rate for top 10 courses.')

For the top 10 courses in general, the unemployment rate is relatively low, however 2 courses NUCLEAR ENGINEERING and MINING AND MINERAL ENGINEERING seem to be outstandingly high.

While for the bottom 10 courses, the unemployment rate seem to be moderately high with 3-5 courses affriming it.

We can analyse this further by looking at the average unemployment rates.

In [48]:
# calculating the average unemployment rates for the
# top and bottom 10 courses
# NB: slices using .loc  includes the index of both the start 
# and stop index contrary to using normal python lists
mean_unemp_rate = recent_grads['unemployment_rate'].mean()
top_10_unemp_rate = recent_grads.loc[:9, 'unemployment_rate'].mean()
bottom_10_unemp_rate = recent_grads[-10:]['unemployment_rate'].mean()

print(('The average unemployment rate for all majors is {:.2f}'
       .format(mean_unemp_rate)))
print(('The average unemployment rate for the top 10 '
       'majors is {:.2f}'.format(top_10_unemp_rate)))
print(('The average unemployment rate for the bottom 10 '
       'majors is {:.2f}'.format(bottom_10_unemp_rate)))
The average unemployment rate for all majors is 0.07
The average unemployment rate for the top 10 majors is 0.07
The average unemployment rate for the bottom 10 majors is 0.08

The average unemployment rate for all majors tend to be similar to that of the top and bottom 10 courses. However, in the top 10 you'd that there are only two course which are distinctively high, while in the bottom 10 there are about 3-5 courses.

We could examine this further below:

In [49]:
# creating a new column to calculate the difference btw
# unemployment rate in the top 10 courses
top_10_outliers = (recent_grads[:10]
                   .loc[recent_grads[:10]['unemployment_rate'] > mean_unemp_rate])
top_10_outliers['mean_difference'] = (
    top_10_outliers['unemployment_rate'] - mean_unemp_rate)
In [50]:
# creating a new column to calculate the difference btw
# unemployment rate in the bottom 10 courses
bottom_10_outliers = (recent_grads[-10:]
                   .loc[recent_grads[-10:]['unemployment_rate'] > mean_unemp_rate])
bottom_10_outliers['mean_difference'] = (
    bottom_10_outliers['unemployment_rate'] - mean_unemp_rate)
In [51]:
"""Plotting the courses from the top and bottom 10 with the
average unemployment rates and their mean_difference"""
top_10_outliers.plot.bar(x='major', y='mean_difference',
                        title='Majors in the top 10 above average unemployment rate.')

bottom_10_outliers.plot.bar(x='major', y='mean_difference',
                        title='Majors in the bottom 10 above average unemployment rate.')
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fae820f1be0>

In the top 10 courses, the course NUCLEAR ENGINEERING is particularly responsible for shooting up the unemployment rate. While for the bottom 10, CLINICAL PSYCHOLOGY is particularly responsible for shooting up the unemployment rate.

Further Analysis

Moving on, I'd generate visualiztions to carry out more indept analysis of the following questions:

  • Comparing the number of men and women in each category of majors using a grouped bar plot.
  • Explore the distributions of median salaries and unemployment rate using box plot
  • Explore columns with denser scatter plots using hexagonal bin plot, from the scatter plots created above.

NB: For more fantastic plots using pandas check out the documentation plotting in pandas.

In [52]:
# comparing the number of men and women in each category major
# NB: df.plot.bar(stacked=True) stacks the plot on top of each
# other    
ax1 = (recent_grads.groupby('major_category')[['men','women']]
 .sum().plot.bar(stacked=True))
ax1.set_title('Category Majors vs. Number of Men and women')
ax1.set_ylabel('Total')
Out[52]:
<matplotlib.text.Text at 0x7fae8206fef0>

Q. Comparing the number of men and women in each category of majors using a grouped bar plot.

A. Above it is noticed that the Business category has the highest number of male and female graduates combined, which are also somewhat evenly distributed.

Generally, there is a higher percentage of female grads than male grads across the various major categories with some exceptions such as ENGINEERING and COMPUTER & MATHEMATICS which are dominated by male grads.

In [65]:
# box plot: exploring the distrbutions of median salaries
recent_grads['median'].plot.box(title='Distribution of Median salaries')
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fae81846320>

The above figure shows that most common median salaries for majors are between $30,000 - $40,000. We also observe that there are outliers in the median salary for majors which range somewhere around $60,000 - $80,000.

In [66]:
# box plot: exploring the distrbutions of unemployment rate
ax = recent_grads['unemployment_rate'].plot.box(title='Distribution of unemployment rate')
In [68]:
# hexagonal bin plot: total vs employed
recent_grads.plot.hexbin(x='employed', y='total', gridsize=30)
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fae81a4a908>
In [69]:
recent_grads.plot.hexbin(x='men', y='median', gridsize=30)
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fae820af588>
In [ ]: