we'll explore how using the pandas plotting functionality to analyse dataset on the job outcomes of students who graduated from college between 2010 and 2012. We will first read recent-grads.csv into pandas and assign the resulting DataFrame to recent_grads. Then use recent_grads.iloc[] to return the first row formatted as a table.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
recent_grads = pd.read_csv('recent-grads.csv')
print(recent_grads.iloc[0])
We will have to use recent_grads.head() and recent_grads.tail() to become familiar with how the data is structured
recent_grads.head()
recent_grads.tail()
We use recent_grads.describe()To generate a summary statistics of all the numeric columns
recent_grads.describe()
To clean up the dataset, we have to drop rows with missing values. Because columns of values we pass in with matching lengths and missing values will cause matplotlib to throw errors.
raw_data_count = 173
Use recent_grads.dropna() to drop rows containing missing values and assign the resulting DataFrame back to recent_grads
recent_grads = recent_grads.dropna()
recent_grads.info()
cleaned_data_count = 172
If you compare cleaned_data_count and raw_data_count, you'll notice that only one row contained missing values and was dropped.
We will generate scatter plots to explore the following relations:
Sample_size and Median
Sample_size and Unemployment_rate
Full_time and Median
ShareWomen and Unemployment_rate
Men and Median
Women and Median
recent_grads.plot(x='Sample_size', y='Median', kind='scatter', title='Median vs. Sample_size')
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter', title='Unemployment_rate vs. Sample_size')
recent_grads.plot(x='Full_time', y='Median', kind='scatter', title='Median vs. Full_time')
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', title='Unemployment_rate vs. ShareWomen')
recent_grads.plot(x='Men', y='Median', kind='scatter', title='Median vs. Men')
recent_grads.plot(x= 'Women', y='Median' ,kind= 'scatter', title= 'Median vs Women')
To explore the distribution of values in a column, we will generate histograms to explore the distributions of the following columns:
Sample_size
Median
Employed
Full_time
ShareWomen
Unemployment_rate
Men
Women
recent_grads['Sample_size'].hist(bins=20, range=(0,5000)).set_title('Distribution of Sample_size')
recent_grads['Median'].hist(bins=20, range=(20000,120000)).set_title('Distribution of Median')
recent_grads['Employed'].hist(bins=25, range=(0,300000)).set_title('Distribution of Employed')
recent_grads['Full_time'].hist(bins=25, range=(0,30000)).set_title('Distribution of Full_time')
recent_grads['ShareWomen'].hist(bins=20, range=(0,3000)).set_title('Distribution of ShareWomen')
recent_grads['Unemployment_rate'].hist(bins=20, range=(0,3000)).set_title('Distribution of Unemployment_rate')
recent_grads['Men'].hist(bins=20, range=(0,20000)).set_title('Distribution of Men')
recent_grads['Women'].hist(bins=20, range=(0,20000)).set_title('Distribution of Women')
We will create a scatter matrix plot to enable us explore potential relationships and distributions simultaneously using scatter_matrix() function. First we have to inport the function from pandas.plotting module
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Sample_size','Median']], figsize=(8,8))
scatter_matrix(recent_grads[['Sample_size','Median', 'Unemployment_rate']], figsize=(12,12))
recent_grads[:10].plot.bar(x='Major', y='ShareWomen')
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen')
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate')
recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate')