This data is taken from FiveThirtyEight (Github).
This dataset is on the job outcomes of students who graduated from college between 2010 and 2012.
Rank - Rank by median earnings (the dataset is ordered by this column).
Major_code - Major code.
Major - Major description.
Major_category - Category of major.
Total - Total number of people with major.
Sample_size - Sample size (unweighted) of full-time.
Men - Male graduates.
Women - Female graduates.
ShareWomen - Women as share of total.
Employed - Number employed.
Median - Median salary of full-time, year-round workers.
Low_wage_jobs - Number in low-wage service jobs.
Full_time - Number employed 35 hours or more.
Part_time - Number employed less than 35 hours.
*Visualising data using Histograms, ScatterPlots, ScatterMatrix Plots and Bar Charts, and see if useful insights can be drawn from them.*
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 100)
recent_grads = pd.read_csv("recent-grads.csv")
print("Recent-Grads First Row\n")
recent_grads.head()
Recent-Grads First Row
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | 1849 | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | 556 | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | 558 | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | 1069 | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | 23170 | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
raw_data_count = len(recent_grads.index)
print(raw_data_count)
173
recent_grads = recent_grads.dropna()
cleaned_data_count = len(recent_grads.index)
print(cleaned_data_count)
172
*There was only one row with null Values, which has been removed*
# Sample_size vs unemployment_rate
sample_vs_unemployment_rate = recent_grads.plot(x="Sample_size", y="Unemployment_rate", kind="scatter")
# ShareWomen vs Unemployment_rate
ShareWomen_vs_Unemployment_rate = recent_grads.plot(x="ShareWomen", y="Unemployment_rate", kind="scatter")
# Sample_size vs Median
sample_vs_median = recent_grads.plot(x="Sample_size", y="Median", kind="scatter")
# Full_time vs Median
Full_time_vs_Median = recent_grads.plot(x="Full_time", y="Median", kind="scatter")
# Men_vs_median
Men_vs_median = recent_grads.plot(x="Men", y="Median", kind="scatter")
# Women_vs_median
Women_vs_median = recent_grads.plot(x="Women", y="Median", kind="scatter")
*From the 2nd plot 'Share Women' vs 'Unemployment Rate', we can see that there is no correlation between these two points.
And from all other scatter plots as well, there is not much useful info which we can gather.
Now we will try to get some insights from Histograms*
cols = ["Sample_size", "Median", "Employed", "Full_time", "ShareWomen", "Unemployment_rate", "Men", "Women"]
fig = plt.figure(figsize = (8, 16))
for x in range(4):
ax = fig.add_subplot(4, 1, x+1)
ax = recent_grads[cols[x]].plot(kind='hist')
ax.set_title(cols[x])
fig1 = plt.figure(figsize=(8, 16))
for x in range(4,8):
ax = fig1.add_subplot(4, 1, (x-3))
ax = recent_grads[cols[x]].plot(kind='hist')
ax.set_title(cols[x])
*From the 6th histogram, we can observe that 85% of the majors have an unemployment rate less then 10%
Rest of the histogram dont give much useful insights, let's use Scatter Matrix Plot*
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[["Sample_size", "Median"]], figsize=(10, 6))
scatter_matrix(recent_grads[["Median", "ShareWomen", "Unemployment_rate"]], figsize=(15, 9))
plt.show()
*Here we can observe a negative correlation between ShareWomen and Median Salary, which means fields having Higher Median Salary tend to have less women ratio.
It is possibly due to the fact that high paying fields like engineering tend to have lesser women ratio.
Lets see if our theory can be showcased using Bar Plots.*
import numpy as np
import matplotlib.pyplot as plt
# Preparing Dataframes for Plotting
Bottom_10_share_woman = recent_grads[['ShareWomen', 'Median']][:10]
Bottom_10_share_woman.set_index(pd.Series([x[:14] for x in recent_grads['Major'][:10]]), inplace=True)
Top_10_share_woman = recent_grads[['ShareWomen', 'Median']][-10:]
Top_10_share_woman.set_index(pd.Series([x[:14] for x in recent_grads['Major'][-10:]]), inplace=True)
# Plotting Data
fig = plt.figure(figsize=(14, 6))
ax1= fig.add_subplot(1, 2, 1)
ax2= fig.add_subplot(1, 2, 2)
Bottom_10_share_woman.plot.bar(ax=ax1, secondary_y='ShareWomen', title= 'Least Share of Women')
Top_10_share_woman.plot.bar(ax=ax2, secondary_y='ShareWomen', ylim=(0, 115000), title= 'Most Share of Women')
# Setting up axis
ax1.set_ylim(0, 115000)
ax1.set_xlabel('')
ax2.set_xlabel('')
ax1.right_ax.set_ylim(0, 1.0)
plt.show()
*Here we can observe that 'Majors' with Higher Women Share have lesser Median Salary.
Meanwhile, 'Majors' with Lesser Women Share have higher Median Salary.*
*Both these plots support our theory that high paying fields like engineering tend to have lesser women ratio.
Which in turn leads to negative correlation between ShareWomen and Higher Median Salary.*