In this project, We'll explore various kinds of plots to visualize job outcomes of students who graduated from college between 2010 and 2012. The dataset which was released by the American Community Surevey was cleaned and uploaded a (Github repo)[https://github.com/fivethirtyeight/data/tree/master/college-majors] of FiveThirtyEight.
Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
recent_grads = pd.read_csv('recent-grads.csv')
print(recent_grads.iloc[0])
Rank 1 Major_code 2419 Major PETROLEUM ENGINEERING Total 2339 Men 2057 Women 282 Major_category Engineering ShareWomen 0.120564 Sample_size 36 Employed 1976 Full_time 1849 Part_time 270 Full_time_year_round 1207 Unemployed 37 Unemployment_rate 0.0183805 Median 110000 P25th 95000 P75th 125000 College_jobs 1534 Non_college_jobs 364 Low_wage_jobs 193 Name: 0, dtype: object
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.00000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 |
mean | 87.377907 | 3895.953488 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 357.941860 | 31355.80814 | 26165.767442 | 8877.232558 | 19798.843023 | 2428.412791 | 0.068024 | 40076.744186 | 29486.918605 | 51386.627907 | 12387.401163 | 13354.325581 | 3878.633721 |
std | 49.983181 | 1679.240095 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 619.680419 | 50777.42865 | 42957.122320 | 14679.038729 | 33229.227514 | 4121.730452 | 0.030340 | 11461.388773 | 9190.769927 | 14882.278650 | 21344.967522 | 23841.326605 | 6960.467621 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.00000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.750000 | 2403.750000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 42.000000 | 3734.75000 | 3181.000000 | 1013.750000 | 2474.750000 | 299.500000 | 0.050261 | 33000.000000 | 24000.000000 | 41750.000000 | 1744.750000 | 1594.000000 | 336.750000 |
50% | 87.500000 | 3608.500000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 131.000000 | 12031.50000 | 10073.500000 | 3332.500000 | 7436.500000 | 905.000000 | 0.067544 | 36000.000000 | 27000.000000 | 47000.000000 | 4467.500000 | 4603.500000 | 1238.500000 |
75% | 130.250000 | 5503.250000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 339.000000 | 31701.25000 | 25447.250000 | 9981.000000 | 17674.750000 | 2397.000000 | 0.087247 | 45000.000000 | 33250.000000 | 58500.000000 | 14595.750000 | 11791.750000 | 3496.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.00000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
recent_grads.head(5)
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 rows × 21 columns
recent_grads.tail(5)
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
5 rows × 21 columns
recent_grads.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 173 entries, 0 to 172 Data columns (total 21 columns): Rank 173 non-null int64 Major_code 173 non-null int64 Major 173 non-null object Total 172 non-null float64 Men 172 non-null float64 Women 172 non-null float64 Major_category 173 non-null object ShareWomen 172 non-null float64 Sample_size 173 non-null int64 Employed 173 non-null int64 Full_time 173 non-null int64 Part_time 173 non-null int64 Full_time_year_round 173 non-null int64 Unemployed 173 non-null int64 Unemployment_rate 173 non-null float64 Median 173 non-null int64 P25th 173 non-null int64 P75th 173 non-null int64 College_jobs 173 non-null int64 Non_college_jobs 173 non-null int64 Low_wage_jobs 173 non-null int64 dtypes: float64(5), int64(14), object(2) memory usage: 28.5+ KB
Oberserve that a feww of our oclumns (Total, Men, Women & ShareWomen) have missing values. Let's take a deep understanding of this missing values and find a way around it
print(recent_grads.isnull().sum())
Rank 0 Major_code 0 Major 0 Total 1 Men 1 Women 1 Major_category 0 ShareWomen 1 Sample_size 0 Employed 0 Full_time 0 Part_time 0 Full_time_year_round 0 Unemployed 0 Unemployment_rate 0 Median 0 P25th 0 P75th 0 College_jobs 0 Non_college_jobs 0 Low_wage_jobs 0 dtype: int64
#check number of rows before dropping missing values
raw_data_count = len(recent_grads)
raw_data_count
173
#Drop rows with missing valeus
recent_grads = recent_grads.dropna(axis = 0)
cleaned_data_count = recent_grads
# Check number or rows after row-removal operation
len(cleaned_data_count)
172
#check number of missing valeus after above operations
print(cleaned_data_count.isnull().sum(), '\n')
print("Number of rows before cleaning:", raw_data_count)
print("Number of rows after cleaning:", len(cleaned_data_count))
Rank 0 Major_code 0 Major 0 Total 0 Men 0 Women 0 Major_category 0 ShareWomen 0 Sample_size 0 Employed 0 Full_time 0 Part_time 0 Full_time_year_round 0 Unemployed 0 Unemployment_rate 0 Median 0 P25th 0 P75th 0 College_jobs 0 Non_college_jobs 0 Low_wage_jobs 0 dtype: int64 Number of rows before cleaning: 173 Number of rows after cleaning: 172
Missing values in our data has been neatly handled. We can now proceed with other stuff
We'll generate scatter plots of the following columns for a an improved exploration and answer some few questions:
ax = recent_grads.plot(x = 'Sample_size', y = 'Median', kind = 'scatter')
ax.set_title('Sample Size Vs Median')
<matplotlib.text.Text at 0x7f72566a3908>
From the above plot, observe that about < 1000 of graduates working full time have a salary range from 20000 to 45000, quite a few follows with a range from 46000 to 78000 and a very few number of about 1000 to 5000 fall within a salary range of about 28000 to 65000
ax = recent_grads.plot(x = 'Sample_size', y = 'Unemployment_rate', kind = 'scatter')
ax.set_title('Sample Size Vs Unemployment_rate')
<matplotlib.text.Text at 0x7f725673d8d0>
The figure above shows the unemployment rate of grauates from 2010 t0 2012. It easy to note that a good number of graduates are unemployed and the number only gets higher over time.
ax = recent_grads.plot(x = 'Full_time', y = 'Median', kind = 'scatter')
ax.set_title('Full_time Vs Median')
<matplotlib.text.Text at 0x7f7256953d68>
MOst of the full time employed graduates have a salary median of about 20000 to 40000, followed by a few number earning from 41000 to about 79000.
ax = recent_grads.plot(x = 'ShareWomen', y = 'Unemployment_rate', kind = 'scatter')
ax.set_title('ShareWomen Vs Unemployment_rate')
<matplotlib.text.Text at 0x7f7256930c50>
These two columns (Unemployment_rate and ShareWomen) have a very weak correlation therefore, no meaniingful insight can be gained from it.
ax = recent_grads.plot(x = 'Men', y = 'Median', kind = 'scatter')
ax.set_title('Men Vs Median')
<matplotlib.text.Text at 0x7f725683bb38>
The above plot reveals that a good number of men (< 50000) earn a salary within the range of 20,000 t0 45000 and a few earn from 46000 to 79000
ax = recent_grads.plot(x = 'Women', y = 'Median', kind = 'scatter')
ax.set_title('Women Vs Median')
<matplotlib.text.Text at 0x7f725632a9e8>
The above plot reveals that a good number of women (< 50000) earn a salary within the range of 20,000 t0 45000 and a few earn from 46000 to 79000. Take note of some few outliers in this plot.
Let's generate histograms to answers some more questions about our data
cols = ["Sample_size", "Median", "Employed", "Full_time",
"ShareWomen", "Unemployment_rate", "Men", "Women"]
fig = plt.figure(figsize=(5, 12))
for plot in range(1, 5):
ax = fig.add_subplot(4, 1, plot)
ax = recent_grads[cols[plot]].plot(kind ='hist', rot =50)
cols = ["Sample_size", "Median", "Employed", "Full_time",
"ShareWomen", "Unemployment_rate", "Men", "Women"]
fig = plt.figure(figsize=(5, 12))
for plot in range(4, 8):
ax = fig.add_subplot(4, 1, plot-3)
ax = recent_grads[cols[plot]].plot(kind ='hist', rot =50)
we generate a scatter_matrix plot to help us coombine the functionalities of both scatter plot and histogram
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10, 10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f72560840b8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f72561486d8>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f725617abe0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f72561f8978>]], dtype=object)
scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']], figsize=(10, 10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f72568882b0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f72573bdac8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f7256ad5470>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f72577da860>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f72562ceb00>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f725767ba90>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f725643cb38>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f72564d4978>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f72566007b8>]], dtype=object)
Let's compare the percentages of some columns with bar plots
recent_grads[:10].plot.bar(x= 'Major', y= 'ShareWomen', legend =False)
recent_grads[-10:].plot.bar(x= 'Major', y= 'ShareWomen', legend =False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7255e0fba8>
recent_grads[-10:].plot.bar(x= 'Major', y= 'Unemployment_rate', legend =False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f724f9f35c0>
recent_grads[-10:].plot.bar(x= 'Major', y= 'Unemployment_rate', legend =False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f724f858a58>
Let's use a grouped bar plot to compare the number of men with the number of women in each category of majors.
recent_grads.groupby('Major_category')['Men', 'Women'].sum().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7f724f885c88>
From the above figure, we can conclude that most of the majors are dominated by the female folks
recent_grads[['Median']].boxplot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f724ffb22b0>
The above figure shows that the bottom of median ranges from 32000 to about 35000 and the top from 3600 to 45000
Nxt up, We use a hexagonal bin plot to visualize the columns with dense scatter plots
recent_grads.plot.hexbin(x = 'Men', y='Median', gridsize=30)
<matplotlib.axes._subplots.AxesSubplot at 0x7f724fb7d978>
recent_grads.plot.hexbin(x = 'Women', y='Median', gridsize=30)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7254230c88>