This notebook is created as a guided project in "Exploratory Data Visualization" course on DataQuest.io to visualize earnings based on college majors.
Data set can be downloaded from here
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
%matplotlib inline
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.head()
recent_grads.describe()
recent_grads.shape
recent_grads.info()
We see that some of the columns have 172 values, we drop missing values so that all columns have equal number of values.
recent_grads.dropna(inplace=True)
recent_grads.shape
len(recent_grads['Major_category'].unique())
recent_grads.groupby('Major_category')['Men', 'Women'].sum().plot(kind='bar')
We see that there is a significant gender gap in Education, Engineering, Health and Psychology & Social Work.
recent_grads.groupby('Major_category')['Employed', 'Unemployed'].sum().plot(kind='bar')
0 unemployment in Agriculture & Natural Resources.
recent_grads.groupby('Major_category')['College_jobs','Non_college_jobs',
'Low_wage_jobs'].sum().plot(kind='bar')
recent_grads.groupby('Major_category')['Median'].sum().plot(kind='bar')
Engineering has highest Median salary while Business has second highest.
recent_grads[recent_grads['Major_category']=='Engineering']['Median']
recent_grads['Median'].hist(bins=20, range=(recent_grads['Median'].min(),recent_grads['Median'].max()))
30k - 35k is most common salary figure for Engineering students.
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))