We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo.
Here are the columns in the dataset:
Using visualizations, we can start to explore questions from the dataset like:
Using scatter plots
Using histograms
Using bar plots
Let's first import the libraries we need and remove rows containing null values.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
We'll now read the dataset into a DataFrame and start exploring the data.
recent_grads = pd.read_csv("recent-grads.csv")
recent_grads.iloc[0]
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
recent_grads.head()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
recent_grads.tail()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
As the initial data exploration shows, Engineering majors earn the highest salaries while Physology & Social Work and Education majors earn the lowest salaries.
import numpy as np
recent_grads.describe(exclude=[np.number])
Major | Major_category | |
---|---|---|
count | 173 | 173 |
unique | 173 | 16 |
top | COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY | Engineering |
freq | 1 | 29 |
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
raw_data_count = recent_grads.count()
raw_data_count
Rank 173 Major_code 173 Major 173 Total 172 Men 172 Women 172 Major_category 173 ShareWomen 172 Sample_size 173 Employed 173 Full_time 173 Part_time 173 Full_time_year_round 173 Unemployed 173 Unemployment_rate 173 Median 173 P25th 173 P75th 173 College_jobs 173 Non_college_jobs 173 Low_wage_jobs 173 dtype: int64
Inital data exploration shows some missing values. The plots that we'll make will give errors if we have missing values. That's why we'll simply drop the missing values. They are not so many missing values though.
recent_grads = recent_grads.dropna()
cleaned_raw_data_count = recent_grads.count()
cleaned_raw_data_count
Rank 172 Major_code 172 Major 172 Total 172 Men 172 Women 172 Major_category 172 ShareWomen 172 Sample_size 172 Employed 172 Full_time 172 Part_time 172 Full_time_year_round 172 Unemployed 172 Unemployment_rate 172 Median 172 P25th 172 P75th 172 College_jobs 172 Non_college_jobs 172 Low_wage_jobs 172 dtype: int64
We'll now generate some scatter plots with pandas' plotting functionality rather than using matplotlib to see if there is any link between variables.
We'll explore the relations between:
#ScatterPlot1
recent_grads.plot(x="Sample_size", y="Median", kind="scatter")
<matplotlib.axes._subplots.AxesSubplot at 0x7f0dac3aadd8>
Scatter Plot1 does not show a strong correlation between Median and Sample_size columns. However, we can conclude that majority of the median salary is between $20K to $40K when the sample_size is below 1000.
#ScatterPlot2
recent_grads.plot(x="Sample_size", y="Unemployment_rate", kind="scatter")
<matplotlib.axes._subplots.AxesSubplot at 0x7f0daa2662b0>
Scatter Plot2 does not show a strong correlation between Unemployment Rate and Sample_size columns either. But, we can say that unemployment_rate ranges between 5% to 10% most of the time.
#ScatterPlot3
recent_grads.plot(x="Full_time", y="Median", kind="scatter")
<matplotlib.axes._subplots.AxesSubplot at 0x7f0daa278710>
There is no visible link btw Median and Full_time, based on Scatter Plot3.
#ScatterPlot4
recent_grads.plot(x="ShareWomen", y="Unemployment_rate", kind="scatter")
<matplotlib.axes._subplots.AxesSubplot at 0x7f0daa257d30>
No correlation between Unemployment_rate and ShareWomen according to Scatter Plot4.
#ScatterPlot5-6
recent_grads.plot(x="Men", y="Median", kind="scatter")
recent_grads.plot(x="Women", y="Median", kind="scatter")
<matplotlib.axes._subplots.AxesSubplot at 0x7f0daa12b550>
It seems that there is no significant link between the variables compared in above scatter plots(5&6) either. In order for stronger conclusions, we will draw some other visuals and see if there is any links between our columns.
We'll now create histograms to see the distribution of our columns.
cols = ["Sample_size", "Median", "Employed", "Full_time", "ShareWomen", "Unemployment_rate", "Men", "Women"]
fig = plt.figure(figsize=(7,30))
for i in range(1,8):
ax = fig.add_subplot(8,1,i)
ax = recent_grads[cols[i]].plot(kind="hist")
ax.set_title(cols[i])
The most common salary range is $30K to $40K, acc to first histogram.
The fourth one tells us that women make up around 70 percent of all graduates in 25 to 30 majors. But remember we have 172 majors in our dataset.
The fifth histogram indicates that the unemployment rate is around 6% most of the time.
For the rest of the histograms, # of bins might be increased and we can do better analysis that way. However, we'll now proceed to other visuals to save some time.
Over the past few minutes, we created histograms to visualize the distributions of individual columns. We'll now use scatter matrix to combine both. A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously.
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Women', 'Men']], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f0da9d73780>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f0daa19de48>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f0da9d20240>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f0da9d569e8>]], dtype=object)
scatter_matrix(recent_grads[["Sample_size","Median"]], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f0da9c94c18>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f0da9c00be0>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f0da9bcf550>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f0da9b89978>]], dtype=object)
scatter_matrix(recent_grads[["Sample_size","Median","Unemployment_rate"]],
figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f0da9ab9710>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f0da9a26ac8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f0da99f54e0>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f0da99af710>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f0da9980128>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f0da993c278>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f0da9903e48>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f0da98c0e80>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f0da9887fd0>]], dtype=object)
The 3*3 matrix shows that there is some positive correlation, albeit very weak, between
columns.
This is one one of favorite plots in pandas. The bars are self explanatory. We'll use them compare the first majors (in terms of median salary) and the last 10 rows on women ration as well as unemployment rate.
recent_grads.head(10).plot.bar(x="Major",y="ShareWomen")
recent_grads.tail(10).plot.bar(x="Major", y="ShareWomen")
<matplotlib.axes._subplots.AxesSubplot at 0x7f0da3424fd0>
It is interesting to see that the high income majors have low women participation (1st plot) while the low income majors are dominated by women(2nd plot). This conclusion is made given the dataset was originally ranked by median salary on a descending order.
recent_grads.head(10).plot.bar(x="Major", y="Unemployment_rate")
recent_grads.tail(10).plot.bar(x="Major", y="Unemployment_rate")
<matplotlib.axes._subplots.AxesSubplot at 0x7f0da8504470>
The two above plots shows a clear picture that high income jobs have relatively lower uneployment rate comparing to the bottom of the list. However, it is also important to note that ___nuclear engineering majors___ have one of the highest unemployment rates ___(15%)___ although it is among the high income category based on the ranking.
Let's also create some other interesting visuals to explore the dataset further.
Grouped bar plot to compare # of men and that of women in different category of majors
Couple box plots to see distributions of median salaries and that of unemployment_rate columns
Hexagonal bin plot to visualize densely scattered columns
recent_grads.groupby("Major_category")["Women","Men"].sum().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7f0da8539780>
The grouped bar plot shows that business category is the most popular area for students while the interdisciplinary areas are the least preferred.
Women dearly outweight men in many categories including education, communications, health arts, humanities and psychology. Business seems to be evenly distributed between men and women.
Men dominates engineering fields and computers & math.
recent_grads[["Median"]].boxplot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f0da838d2e8>
recent_grads[["Unemployment_rate"]].boxplot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f0da8354748>
The two box and whisker plots show us there are some outliers in our dataset where the unemployment rate could rise up to 18% for some majors. The median salary for college graduates might go up to $75,000 in some cases but these could be considered outliers as well.
recent_grads.plot.hexbin(x='Men', y='Median', gridsize=30)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0da82c2160>
recent_grads.plot.hexbin(x='Women', y='Median', gridsize=30)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0da81d50f0>
Hexagonals show us that women and men are similar in their median earnings however women have two core points: some $35.000 and some $40,000.
Median earnings for men is around $35,000 most of the time.