In this project, I'll explore using the pandas plotting functionality along with the Jupyter notebook interface to explore and visualiz data
I'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012.
Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
recent_grads = pd.read_csv("recent-grads.csv")
# first row formatted as a table.
print (recent_grads.iloc[0])
# to become familiar with how data is structured
print (recent_grads.head())
print (recent_grads.tail())
# to generate summary statistics for all of the numeric columns
recent_grads.describe()
# Look up the number of rows
raw_data_count = recent_grads.shape[0]
print (raw_data_count)
# Drop rows with missing values
recent_grads = recent_grads.dropna(axis=0, inplace=True)
# Look up the number of rows to ascertain if data has been droped
cleaned_data_count = recent_grads.shape[0]
print (cleaned_data_count)
Comparing the raw data and cleaned data, it will be observed that number of rows droped to 172 in cleaned data. While raw data has 173 rows. This means a row has been removed for having missing value.
recent_grads.plot(x="Sample_size", y="Median", kind = "scatter", title = "Sample_size VS Median")
recent_grads.plot(x="Sample_size", y="Unemployment_rate", kind = "scatter", title = "Sample_size VS Uemployemny")
recent_grads.plot(x="Full_time", y="Median", kind = "scatter", title = "Full_time VS Median")
recent_grads.plot(x="ShareWomen", y="Unemployment_rate", kind = "scatter", title = "Sharewoman VS Unemployment_rate")
recent_grads.plot(x="Men",y="Median", kind = "scatter", title = "Men VS Median")
recent_grads.plot(x="Women",y="Median", kind = "scatter", title = "Sample_size VS Median")