#!/usr/bin/env python
# coding: utf-8

# # Visualizing Earnings Based on College Majors 

# A lot of concerns mostly parents and also a number of students have risen over the years to determine what courses(majors) in college have a higher probability of assertaining their success. Which may be as a result of trying to gain "Return on Investment(ROA)" based on the resources and time which is put in while in college. However is not to say that some majors are irrelevant  only that some are considered more valuable than the others in the society and world at large.
# 
# This project contains a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job could be gotten from [American community Survey](https://www.census.gov/programs-surveys/acs/), which conducts surveys and aggregates the data.
# However, we would use a cleaned version of the data released by FiveThirtyEight on their [Github repo](https://github.com/fivethirtyeight/data/tree/master/college-majors).
# 
# ## **Aim of the project**
# This project focuses on answering and exploring the following questions using several visualization techniques provided in the Matplot library as below:
# * Do students in more popular majors make more money?
#   * Using scatter plots
# * How many majors are predominantly male? Predominantly female?
#   * Using histograms
# * Which category of majors have the most students?
#   * Using bar plots?
#   
# Lastly, below are the columns in the dataset and their respective definitions:
# * `Rank` - Rank by median earnings (the dataset is ordered by this column).
# * `Major_code` - Major code.
# * `Major` - Major description.
# * `Major_category` - Category of major.
# * `Total` - Total number of people with majors.
# * `Sample_size` - Sample size (unweighted) of full-time.
# * `Men` - Male graduates.
# * `Women` - Female graduates.
# * `ShareWomen` - Women as a share of total.
# * `Employed` - Number employed.
# * `Median` - Median salary of full-time, year-round workers.
# * `Low_wage_jobs` -  Number in low-wage service jobs.
# * `Full_time` - Number employed 35 hours or more.
# * `Part_time` - Number employed less than 35 hours.
# * `Full_time_year_round` - Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35).
# * `Unemployed` -  	Number unemployed (ESR == 3)
# * `Unemployment_rate` - Unemployed / (Unemployed + Employed).
# * `Median` -  	Median earnings of full-time, year-round workers.
# * `P25th` - 25th percentile of earnings.
# * `P50th` - 75th percentile of earnings.
# * `College_jobs` - Number with job requiring a college degree.
# * `Non_college_jobs` - Number with job not requiring a college degree.
# * `Low_wage_jobs` - Number in low-wage service jobs.

# ## Importing the libraries
# The various libraries (**pandas** and **matplotlib**) are required to enable proper data cleaning steps, exploration, analysis and visualization.

# In[1]:


import pandas as pd
import matplotlib.pyplot as plt

# jupyter magic function to display inline plots
get_ipython().run_line_magic('matplotlib', 'inline')


# ## Data exploration
# We need to read in the dataset to examine and explore the dataset, also to identity the contents contained in the dataset. e.g: patterns, outliers, values, changing column names (if need be) e.t.c

# In[2]:


# reading the dataset
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.iloc[0]    # returns the first row of the dataset


# In[3]:


recent_grads.head()    # returns the first 5 elements of the dataset


# The dataframe displayed above consists of the first five elements of the dataset, which gives a better understanding of the dataset worked with.

# In[4]:


recent_grads.tail()    # returns the last 5 elements of the dataset


# The dataframe displayed above consists of the last five elements of the dataset, to also get a better intuition of the dataset worked with.

# ### Changing column names
# Notice that the column names begin with a capital letter, which is not much of a problem, but changing all column names to lower case ensures consistency which is a good thing and could help us carry out exploration and analysis even faster.
# 
# Therefore, the column names would be converted to lowercase.

# In[5]:


# returns the column names in the dataset
recent_grads.columns


# In[6]:


# converting all column names to lowercase
lowercase_recent_grad = []    # stores the lowercase column names

for name in recent_grads.columns:
    name = name.lower()
    lowercase_recent_grad.append(name)
    
recent_grads.columns = lowercase_recent_grad    # replaces the old columns with the new columns in the dataset


# In[7]:


recent_grads.columns


# All the column names have now been converted to lowercase, which brings about a consistent name format for the columns.

# In[8]:


recent_grads.info()


# The information above shows that most of the columns in the dataset contain numeric values of *int64* and *float64*, only two columns namely **Major_category** and **Major** contain string values.

# In[9]:


# returns statistical information of of all the values 
# in the dataset
recent_grads.describe(include='all')


# The information above shows that the dataset contains some columns with missing data specifically **total, men, women** and **sharewomen**.

# ### Dropping rows with missing data
# Using Matplotlib, it is expected that our dataset contain matching rows of data else it throws an error. Since it has been identified that there are some rows with missing data as stated before they need to be removed.

# In[10]:


# displays the column with missing values 
print(recent_grads.isnull().sum())


# Above, notice that there are only 4 columns with missing values and each of those columns contain only one row of missing data

# In[11]:


# returns the total number of rows in the dataset, an
# alternative could be dataFrame.count()
raw_data_count = recent_grads.index
raw_data_count


# In[12]:


# dropping rows with missing values
recent_grads = recent_grads.dropna(axis='index')


# In[13]:


cleaned_data_count = recent_grads.index
cleaned_data_count


# Notice now that there's a difference between **recent_data_count** (length=173) and the new **cleaned_data_count** (length=172). This shows that only one row in the dataset contained missing values and was dropped.

# ## Analysing the Category of Students that tend to Have Higher Income -- Using Scatter plots

# In order to determing the category of students who earn higher amounts, a scatter plot would be created to compare and determine the disparity and correlation between some columns of our dataset. Such as: **sample_size, median, unemployment_rate** e.t.c 
# 
# Since pandas also has some plotting functionalities [DataFrame.plot()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) which enables us create different types of plots very quickly by passing in some arguments, that is what would be used in creating the plots.
# E.g `recent_grads.plot(x='Sample_size', y='Employed', kind='scatter', title='Employed vs. Sample_size', figsize=(5,10))`
# 
# They plots created aids in answering the following important questions:
# * Do students in more popular majors make more money?
# * Do students that majored in subjects that were majority female make more money?
# * Is there any link between the number of full-time employees and median salary?

# In[14]:


# creating a scatter plot sample_size vs. median
ax = recent_grads.plot(x='sample_size', y='median', 
                  kind='scatter')
ax.set_title('Median Salary vs. Sample_size')
ax.set_xlim(0,4300)
ax.set_ylim(0,120000)


# Q. Do students in more popular majors make more money?
# 
# A. NO! Beacause, based on the display shown above, there is little or no signifiant correlation between median earnings of graduates and what the majored in.
# 
# However there are two significant observations.
# 
# 1. The 75th percentile of the sample_size is at 338, which means majority of the sample data collected fell within a range of 500.Therefore, it's preferable to zoom in to get a better view of the relationships. Also, they're are also outliers included in the data collected.
# 2. Furthermore, scatter plot uses earning information for an unweighted sample of people. Therefore, this may not be a good representation of graduates with more popular majors. 

# In[15]:


# scatter plot with a sample size at the 75th percentile 
# and median values
ax = recent_grads.plot(x='sample_size', y='median',
                      kind='scatter')
ax.set_title('Median Salary vs. Sample_size')
ax.set_xlim(0,338) # 75th percentile value of sample_size
ax.set_ylim(0,120000)


# The diagram above still shows no correlation, given a smaller range of the sample size.
# 
# However, in a given sample size of 50 there seem to be an increase in the median earnings of graduates.

# In[16]:


# scatter plot of sample_size vs. unemployment_rate
ax = recent_grads.plot(x='sample_size', y='unemployment_rate', 
                  kind='scatter')
ax.set_title('Unemployment_rate vs. Sample_size')
ax.set_xlim(0, 4300) # max value in sample_size
ax.set_ylim(-0.02, 0.2)


# Q. Do students in more popular majors make more money?
# 
# A. The diagram above also shows little or no correlation between **unemployment_rate** against **sample_size**.
# 
# However, with a sample size in the range 500, it's observed that there's an increase in unemployment rate.
# 
# For better intuition it's better to make analyses within a smaller range of sample size. Say 1000 (this would seem appropriate).

# In[17]:


# creates a scatter plot at the 75th percentile of sample_size)
ax = recent_grads.plot(x='sample_size', y='unemployment_rate',
                      kind='scatter')
ax.set_title('Unemployment_rate vs. Sample_size')
ax.set_xlim(0,1000) # 75th percentile value of sample_size


# The diagram above still shows a lot of variations(no correlation) between sample_size and unemployment_rate.
# 
# Furthermore, exploring some of the rows in the dataset explains the reason for its variations.
# 
# * There's a noticeable difference between sample size and that of unemployed and employed graduates in a given row. For example: Petroleum Engineering(rank 1), the sample size is 36, while unemployed+employed is (1976+37).

# In[18]:


# scatter plot: Full_time vs Median
ax = recent_grads.plot(x='full_time', y='median', kind='scatter')
ax.set_title('Median Salary  vs. Full_time employed grads')
ax.set_xlim(0,)


# Q. Is there any link between the number of full-time employees and median salary?
# 
# A. The diagram above also shows a lot of variations (no correlation) between full_time graduates and their expected earnings, most especially for organizations within 50,000 range of full time employees.
# 
# Asides the minute range of sample size, other variations could be as a result of the various companies/organizations. such as: 
# * Big companies may employ few workers and be willing to pay more for less hours of work, because the target the best workers and focus on productivity.
# * Medium companies could also do as stated above or, decide to employ lots of workers and pay less for more hours of work in order to also achieve productivity. Doing this they target very good and average workers.
# * Small companies are willing to employ a large number of workers (grads) and pay less for more working hours due to lack of sufficient resources. Also, fair or average grads fall into this companies since they're easier to get.
# 
# 
# 
# However, in my opinion its expected that the longer the hours put into work, the more their earnings.

# In[19]:


# scatter plot: ShareWomen vs. Unemployment_rate
ax = recent_grads.plot(x='sharewomen', y='unemployment_rate',
                 kind='scatter')
ax.set_title('Unemployment_rate vs. Fraction of Graduate women')
ax.set_xlim(0,)


# The diagram above shows no correlation between sharewomen and unemployment_rate.

# In[20]:


# scatter plot: Sharewomen vs. median
ax = recent_grads.plot(x='sharewomen', y='median',
                      kind='scatter')
ax.set_title('Median Salary vs. Fraction of graduate women')
ax.set_xlim(0,1.0)
ax.set_ylim(10000,)


# Q. Do students that majored in subjects that were majority female make more money?
# 
# A. No! The scatter plot above shows a weak negative correlation. 
# 
# This means females who concluded their college degrees in less female prospective majors earned more as shown in the diagram 0 - 0.2 (0 - 2%) of female had the highest earnings, while fe,ale concentrated majors (0.2) above had less earnings.
# 

# In[21]:


# scatter plot: Men vs. Median
ax = recent_grads.plot(x='men', y='median', kind='scatter')
ax.set_title('Median Salary vs. Male graduates')
ax.set_xlim(0,)
ax.set_ylim(20000,)


# There's no correlation between Male graduates and their average earnings.

# In[22]:


# scatter plot: Women vs. Median
ax = recent_grads.plot(x='women', y='median', kind='scatter')
ax.set_title('Median Salary vs. Female graduates')
ax.set_xlim(0,)
ax.set_ylim(20000,)


# There's equally no correlation between Male graduates and their average earnings.

# ## Visualizing Graduate information using Histograms
# This sector focuses on answering two questions:
# 
# 1. What percent of majors are predominantly male? Predominantly female?
# 2. What's the most common median salary range?
# 
# NB: `dataFrame[col_name].plot(kind='hist')` was not used in generating histograms because it's difficult to control the binning strategy. Rather this would be more preferable `dataFrame[col_name].hist(bins=<digit>, range=(<digits>)`.
# 
# For better understanding check out [Series.hist()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.hist.html)

# In[23]:


# histogram exploring sample_size
ax = recent_grads['sample_size'].hist()
ax.set_title('Distribution of Sample size data')
ax.set_xlabel('Sample_size')
ax.set_ylabel('Frequency')


# The diagram above shows most of the sample_size data collected were below 500. A mored detailed view of the sample_size could be achieved by looking into the data in the range 500.

# In[24]:


# Histogram of sample_size of range 500
ax = recent_grads['sample_size'].hist(bins=20, range=(0,500))
ax.set_title('Distribution of Sample size data')
ax.set_xlabel('Sample_size')
ax.set_ylabel('Frequecy')


# Moving further, it's observed that most of Sample size fell within the range 100. With such little sample size the median earning of grads may not be so accurate.

# In[25]:


# Histogram of Median earnings
ax = recent_grads['median'].hist(bins=50, range=(20000,80000))
ax.set_title('Median distribution')
ax.set_xlabel('Median')
ax.set_ylabel('Frequency')


# Q. What's the most common median salary range?
# 
# A. The diagram shows the most common median salary to be at $30,000 - $40,000. Next been the $40,000 - $50,000 or $50,000 - $60,000 which is quite hard to tell without further analysis or visualization.

# In[26]:


# histogram for Employed grads
ax = recent_grads['employed'].hist(bins=25, range=(0,30000))
ax.set_title('Distribution of employed graduates')
ax.set_xlabel('Employed')
ax.set_ylabel('Frequecy')


# The diagram above shows that most organizations or companies have at least 5000 employed graduates working with them.
# 
# There's also a reasonable level of distribution at 5000 - 15,000 point which shows assertions that bigger companies could have up to 15,000 or more employed grads.
# 
# It would be fair to say that the number of students employed per major is affected by the the number of students that have taken the major. 
# 
# I'd  examine the columns to determine if any relationship exists.

# In[27]:


# E.g. the total number of people vs the number employed 
# for the largest majors
# Filter by majors with total > 39,000 (75th percentile)
largest_majors = recent_grads.loc[recent_grads['total'] > 39000, ['major_code', 'major', 'total', 'employed']]
largest_majors.sort_values(by='total', ascending=False).head(10)


# Above we notice an obvious relationship between the total number of grads with majors and and the number employed.
# 
# To better illustrate this fact a scatter plot would be used to show the relationship.

# In[28]:


# Scatter plot: Total vs. employed
ax = recent_grads.plot(x='total', y='employed', kind='scatter')
ax.set_title('Total grads with majors vs. Employed grads')
ax.set_xlim(0,)
ax.set_ylim(0,)
ax.set_xlabel('Total')
ax.set_ylabel('Employed')


# In[29]:


# histogram of full time employed grads
ax = recent_grads['full_time'].hist(bins=25, range=(0,250000))
ax.set_title('Distribution of Full time employees')
ax.set_xlabel('Full Time Employees')
ax.set_ylabel('Frequecy')


# This above diagram goes to say that in a company there are about 50,000 full-time grad workers actively with them, which is logical because companies also consits of interns, remote workers, e.t.c which lasts only for a given period of time.

# In[30]:


# histogram for sharewomen
ax = recent_grads['sharewomen'].hist(bins=20)
ax.set_title('Distribution of a Fraction of Graduate women')
ax.set_xlabel('Sharewomen')
ax.set_ylabel('Frequency')


# It appears that just over 50% of all majors are mainly females, with the highest frequency at 70 - 80% female.

# In[31]:


# Evaluating majors with the higest category 
# of females (0.6 - 0.8)
largest_female_share = recent_grads.loc[(recent_grads['sharewomen'] > 0.6) & 
                                        (recent_grads['sharewomen'] <= 0.8)][['major_code', 'major', 'total', 
                                                                              'men', 'women', 'sharewomen',
                                                                             'employed', 'unemployed']]
print(largest_female_share.shape)
largest_female_share.sort_values(by='sharewomen', ascending=False).head()      


# Looking at the dataFrame comparing the columns **men**, **women** and **total**, the histogram confirms the fact that majority of the major consists of more women than men. 

# In[32]:


# histogram of unemployment_rate
ax = recent_grads['unemployment_rate'].hist()
ax.set_title('Distribution of unemployment_rate')
ax.set_xlabel('Unemployment Rate')
ax.set_ylabel('Frequency')


# The majors with the highest unemployment rate is at 6-7%, while the majors with the least unemployment rate is at 14%.
# 
# I would examine the both cases below.

# In[33]:


# Majors with higher unemployment rates
highest_majors_unemployed = recent_grads.loc[(recent_grads['unemployment_rate'] >= 0.06) &
                                         (recent_grads['unemployment_rate'] <= 0.07)][['major_code', 'major', 'major_category', 'unemployment_rate']]
highest_majors_unemployed.sort_values(by='unemployment_rate', ascending=False).head()


# Displayed above shows the top 5 majors with the highest unemployment_rate with *Health and Medical Preparatory programs* at the top.

# In[34]:


# Majors with the least unemployment rates
highest_majors_unemployed = recent_grads.loc[(recent_grads['unemployment_rate'] >= 0.12) &
                                         (recent_grads['unemployment_rate'] <= 0.14)][['major_code', 'major', 'major_category', 'median', 'unemployment_rate']]
highest_majors_unemployed.sort_values(by='unemployment_rate', ascending=False).head()


# PUBLIC POLICY shows the highest prospect of employment amongst all the other majors with also a very good average salary of $50,000

# In[35]:


# histogram distribution of men
ax = recent_grads['men'].hist(bins=25, range=(0,200000))
ax.set_title('Distribution of Men')
ax.set_xlabel('Men')
ax.set_ylabel('Frequency')


# Q. What percent of majors are predominantly male?
# 
# A. The diagram above shows most companies have a high percentage of Male grad workers. It could go as high as 80% (estimate) male workers in an organization.
# 
# However, it doesn't determine the majors significantly dominated by males.

# In[36]:


# determining majors dominated by male grads
male_dominated_majors = recent_grads.loc[recent_grads['men'] >= 0, ['major_code', 'major', 'major_category', 'median', 'men', 'women']]
male_dominated_majors.sort_values(by='men', ascending=False).head(5)


# The majors signifcantly dominated by males are BUSINESS MANAGEMENT AND ADMINISTRATION, GENERAL BUSINESS and  	FINANCE.

# In[37]:


# histogram of unemployment_rate
ax = recent_grads['women'].hist(bins=25, range=(0,200000))
ax.set_title('Distribution of women')
ax.set_xlabel('Women')
ax.set_ylabel('Frequency')
ax.set_ylim(0,120)


# Q. What percent of majors are predominantly Female?
# 
# A. The diagram above shows most companies also have a high percentage of Female grad workers. It could also go as high as 75% (estimate) female workers in an organization.

# In[38]:


# determining majors dominated by male grads
female_dominated_majors = recent_grads.loc[recent_grads['women'] >= 0, ['major_code', 'major', 'major_category', 'median', 'men', 'women']]
female_dominated_majors.sort_values(by='women', ascending=False).head(5)


# The majors signifcantly dominated by females are PSYCHOLOGY, NURSING, BIOLOGY.

# ## Exploring potential relationships and distributions of columns using Scatter matrix plot
# In other to evalutate the relationship between multiple columns more efficiently, a [Scatter Matrix plot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.scatter_matrix.html) is the best viable solution.
# 
# A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us explore potential relationships and distributions simultaneously.

# In[39]:


# importing scatter_matrix
from pandas.plotting import scatter_matrix


# In[40]:


# A 2 by 2 scatter matrix plot of Sample_size 
# and Median Salary
scatter_matrix(recent_grads[['sample_size', 'median']],
              figsize=(10,10))


# The diagram above shows most sample sizes to be less than 1000 (top-left histogram). The scatter plot of Median vs. Sample_size (bottom-left) suggests that the median salary to be somewhere around $30,000 - $40,000.
# 
# However, the scatter plot of Sample_size vs. Median (top-right) suggests the increase in sample size doesn't necessarily affect the Median salary values.

# In[41]:


# A 3 x 3 scatter matrix plot of sample_size, median and
# unemployment columns
scatter_matrix(recent_grads[['sample_size', 'median', 'unemployment_rate']],
              figsize=(10,10))


# There's no correlation in the scatter matrix plot above. It is a good way to show a faster relationship between columns which was shown in the cells above. E.g the total students with a major vs number employed or total students with major vs number unemployed.

# In[42]:


# Total vs unemployed scatter matrix
scatter_matrix(recent_grads[['total', 'unemployed']], figsize=(10,10))


# This shows a weak postitive correlation between total students with majors, meaning only a small fraction of students with majors are unemployed.

# In[43]:


# Total vs employed scatter matrix
scatter_matrix(recent_grads[['total', 'employed']], figsize=(10,10))


# Above exists a very strong positive correlation between the total number of students with majors vs. their rate of employment. This simply means majority of student with majors are employed.

# ## Visualizing data using bar plot
# Bar plots can be created using Series object: `df[range][col].plot(kind='bar')` or DataFrame object:  `df[range].plot.bar(x=labels, y=data for bars)`

# In[44]:


# Bar plot of sharewomen from first ten rows vs sharewomen
# from last ten rows
ax1 = recent_grads[:10].plot.bar(x='major', y='sharewomen', 
                                 title='Fraction of female grads from the top 10 courses with the highest median salary')

ax2 = recent_grads[-10:].plot.bar(x='major', y='sharewomen',
                                 title='Fraction of female grads from the bottom 10 courses with the least median salary')


# Above we observe that courses with the highest median salaries have a lower share of female grads than those with the lowest median salaries, which in this case are majorly females (i.e. more than 50% of grads in the lowest median salaries are females).
# 
# We can calculate how large the difference is below:

# In[45]:


# Calculating the average proportion of female grads for the 
# top and bottom 10 courses
# NB: slices using .loc  includes the index of both the start 
# and stop index contrary to using normal python lists
top_10_female_share = recent_grads.loc[:9, 'sharewomen'].mean()
bottom_10_female_share = recent_grads[-10:]['sharewomen'].mean()


# In[46]:


top_10 = ('The 10 highest paying courses have an average '
          'amount of female share to be: {:.2f}'.format(
          top_10_female_share))
          
bottom_10 = ('The 10 lowest paying courses have an average '
          'amount of female share to be: {:.2f}'.format(
          bottom_10_female_share))
             
print(top_10)
print(bottom_10)


# There's an obvious difference in the average proportion of top and bottom 10 courses for female grads (in terms of median pay), which is more than 50%.
# 
# Next, we check out the difference in the unemployment rate between the top and bottom 10 courses.

# In[47]:


# Unemployment rate for top and bottom 10 courses
ax1 = recent_grads[:10].plot.bar(x='major', y='unemployment_rate',
                                title='Unemployment rate for top 10 courses.')

ax2 = recent_grads[-10:].plot.bar(x='major', y='unemployment_rate',
                                title='Unemployment rate for top 10 courses.')


# For the top 10 courses in general, the unemployment rate is relatively low, however 2 courses NUCLEAR ENGINEERING and MINING AND MINERAL ENGINEERING seem to be outstandingly high.
# 
# While for the bottom 10 courses, the unemployment rate seem to be moderately high with 3-5 courses affriming it.
# 
# We can analyse this further by looking at the average unemployment rates.

# In[48]:


# calculating the average unemployment rates for the
# top and bottom 10 courses
# NB: slices using .loc  includes the index of both the start 
# and stop index contrary to using normal python lists
mean_unemp_rate = recent_grads['unemployment_rate'].mean()
top_10_unemp_rate = recent_grads.loc[:9, 'unemployment_rate'].mean()
bottom_10_unemp_rate = recent_grads[-10:]['unemployment_rate'].mean()

print(('The average unemployment rate for all majors is {:.2f}'
       .format(mean_unemp_rate)))
print(('The average unemployment rate for the top 10 '
       'majors is {:.2f}'.format(top_10_unemp_rate)))
print(('The average unemployment rate for the bottom 10 '
       'majors is {:.2f}'.format(bottom_10_unemp_rate)))


# The average unemployment rate for all majors tend to be similar to that of the top and bottom 10 courses. However, in the top 10 you'd that there are only two course which are distinctively high, while in the bottom 10 there are about 3-5 courses.
# 
# We could examine this further below:

# In[49]:


# creating a new column to calculate the difference btw
# unemployment rate in the top 10 courses
top_10_outliers = (recent_grads[:10]
                   .loc[recent_grads[:10]['unemployment_rate'] > mean_unemp_rate])
top_10_outliers['mean_difference'] = (
    top_10_outliers['unemployment_rate'] - mean_unemp_rate)


# In[50]:


# creating a new column to calculate the difference btw
# unemployment rate in the bottom 10 courses
bottom_10_outliers = (recent_grads[-10:]
                   .loc[recent_grads[-10:]['unemployment_rate'] > mean_unemp_rate])
bottom_10_outliers['mean_difference'] = (
    bottom_10_outliers['unemployment_rate'] - mean_unemp_rate)


# In[51]:


"""Plotting the courses from the top and bottom 10 with the
average unemployment rates and their mean_difference"""
top_10_outliers.plot.bar(x='major', y='mean_difference',
                        title='Majors in the top 10 above average unemployment rate.')

bottom_10_outliers.plot.bar(x='major', y='mean_difference',
                        title='Majors in the bottom 10 above average unemployment rate.')


# In the top 10 courses, the course NUCLEAR ENGINEERING is particularly responsible for shooting up the unemployment rate. While for the bottom 10, CLINICAL PSYCHOLOGY is particularly responsible for shooting up the unemployment rate.

# ## Further Analysis
# Moving on, I'd generate visualiztions to carry out more indept analysis of the following questions:
# * Comparing the number of men and women in each category of majors using a grouped bar plot.
# * Explore the distributions of median salaries and unemployment rate using box plot
# * Explore columns with denser scatter plots using hexagonal bin plot, from the scatter plots created above.
# 
# NB: For more fantastic plots using pandas check out the documentation [plotting in pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

# In[52]:


# comparing the number of men and women in each category major
# NB: df.plot.bar(stacked=True) stacks the plot on top of each
# other    
ax1 = (recent_grads.groupby('major_category')[['men','women']]
 .sum().plot.bar(stacked=True))
ax1.set_title('Category Majors vs. Number of Men and women')
ax1.set_ylabel('Total')


# Q. Comparing the number of men and women in each category of majors using a grouped bar plot.
# 
# A. Above it is noticed that the Business category has the highest number of male and female graduates combined, which are also somewhat evenly distributed.
# 
# Generally, there is a higher percentage of female grads than male grads across the various major categories with some exceptions such as ENGINEERING and COMPUTER & MATHEMATICS which are dominated by male grads.

# In[65]:


# box plot: exploring the distrbutions of median salaries
recent_grads['median'].plot.box(title='Distribution of Median salaries')


# The above figure shows that most common median salaries for majors are between $30,000 - $40,000. We also observe that there are outliers in the median salary for majors which range somewhere around $60,000 - $80,000.

# In[66]:


# box plot: exploring the distrbutions of unemployment rate
ax = recent_grads['unemployment_rate'].plot.box(title='Distribution of unemployment rate')


# In[68]:


# hexagonal bin plot: total vs employed
recent_grads.plot.hexbin(x='employed', y='total', gridsize=30)


# In[69]:


recent_grads.plot.hexbin(x='men', y='median', gridsize=30)


# In[ ]: