It is a fact that there is still a great disparity on how Men and Women are paid all of the world, meaning that women are still earning less than men, with no reasonable explanation other than gender. According to an article realeased by Business Insider, some cities reached a gender pay gap higher than 20%.

Thinking from this perspective, the main goal of the present project is to explore a dataset of job outcomes of students who graduated from college between 2010 and 2012. The data was originally released by American Community Survey (ACS), which conducts surveys to help local officials, community leaders, and businesses understand the changes taking place in their communities. The data was then cleaned by FiveThirtyEight and released on their Github repository.

Our purpose is to explore the dataset in order to find any pattern on earnings for men and women, based on their respective majors and considering the gender factor. To do so, we are going to use data visualization tools, such as pandas and matplotlib libraries, and also some basic exploring techniques.

Header |
Description |
---|---|

Rank | Rank by median earnings |

Major_code | Major code, FO1DP in ACS PUMS |

Major | Major description |

Major_category | Category of major from Carnevale et al |

Total | Total number of people with major |

Sample_size | Sample size (unweighted) of full-time, year-round ONLY (used for earnings) |

Men | Male graduates |

Women | Female graduates |

ShareWomen | Women as share of total |

Employed | Number employed (ESR == 1 or 2) |

Full_time | Employed 35 hours or more |

Part_time | Employed less than 35 hours |

Full_time_year_round | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35) |

Unemployed | Number unemployed (ESR == 3) |

Unemployment_rate | Unemployed / (Unemployed + Employed) |

Median | Median earnings of full-time, year-round workers |

P25th | 25th percentile of earnings |

P75th | 75th percentile of earnings |

College_jobs | Number with job requiring a college degree |

Non_college_jobs | Number with job not requiring a college degree |

Low_wage_jobs | Number in low-wage service jobs |

To initialize our analysis, it is necessary to import some essential libraries for data analysis, such as **pandas** and **matplotlib**. It is also important to make some previous analysis of our dataset using basic exploring techniques.

When importing matplotlib, we have to run the Jupyter magic **%matplotlib inline** either. This tool is very important since it allows Jupyter to plot our graphs inline.

**Importing libraries**

In [1]:

```
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
```

**Reading the dataset**

In [2]:

```
recent_grads = pd.read_csv('recent-grads.csv')
```

**Exploring the dataset**

In [3]:

```
raw_data_count = recent_grads.shape
raw_data_count
```

Out[3]:

In [4]:

```
recent_grads.iloc[0]
```

Out[4]:

In [5]:

```
recent_grads.head()
```

Out[5]:

In [6]:

```
recent_grads.tail()
```

Out[6]:

In [7]:

```
recent_grads.describe(include = 'all')
```

Out[7]:

In [8]:

```
recent_grads.info()
```

From the code lines above, we can point some observations:

The

**recent_grads**dataset contais 173 raw data, which are represented by different majors, and 21 attributes represented by columns;From the statistical description and dataset information, it is possible to notice that there are one missing value for

**Total**,**Men**,**Women**and**ShareWomen**columns. In code line below, we can see that it is the same missing value for all the four columns (for Food Science major). Thus, to make our analysis more precise, we are going to remove this row.

In [9]:

```
recent_grads[recent_grads['Women'].isnull()][['Major', 'Total', 'Men', 'Women', 'ShareWomen']]
```

Out[9]:

**Droping row with null values**

In [10]:

```
recent_grads.dropna(axis = 0, inplace = True)
cleaned_data_count = recent_grads.shape[0]
cleaned_data_count
```

Out[10]:

In this step, we are going to create different kind of plots (scatter, histogram and bar plots) aiming to find any patterns on then. To do so, we are using **Pandas for plotting the graphs**, since it is a great tool that simplifies graphs construction.

**GENERATING SCATTER PLOTS FOR DIFFERENT COLUMNS**

- Full_time vs Median

In [11]:

```
ax = recent_grads.plot(x = 'Full_time', y = 'Median', kind = 'scatter')
ax.set_title('Full Time x Median')
ax.set_xlabel('Full Time')
ax.set_xlim(0,)
ax.set_ylabel('Median')
```

Out[11]:

- ShareWomen vs Unemployment_rate

In [12]:

```
# ShareWomen and Unemployment_rate
ax = recent_grads.plot(x = 'ShareWomen', y = 'Unemployment_rate', kind = 'scatter')
ax.set_title('Share Women x Unemployment Rate')
ax.set_xlabel('Share Women')
ax.set_xlim(0,)
ax.set_ylabel('Unemployment Rate')
ax.set_ylim(0,)
```

Out[12]:

- Men vs Median

In [13]:

```
ax = recent_grads.plot(x = 'Men', y = 'Median', kind = 'scatter')
ax.set_title('Men x Median')
ax.set_xlabel('Men')
ax.set_xlim(0,)
ax.set_ylabel('Median')
```

Out[13]:

- Women vs Median

In [14]:

```
ax = recent_grads.plot(x = 'Women', y = 'Median', kind = 'scatter')
ax.set_title('Women x Median')
ax.set_xlabel('Women')
ax.set_xlim(0,)
ax.set_ylabel('Median')
```

Out[14]:

In [15]:

```
# answering to the question from DQ
'''
1) Do students in more popular majors make more money? (total x median)
2) Do students that majored in subjects that were majority female make more money? (sharewomen x median)
3) Is there any link between the number of full-time employees and median salary?
'''
```

Out[15]:

- Total vs Median

In [16]:

```
ax = recent_grads.plot(x = 'Total', y = 'Median', kind = 'scatter')
ax.set_title('Total x Median')
ax.set_xlabel('Total')
ax.set_xlim(0,)
ax.set_ylabel('Median')
```

Out[16]:

- ShareWomen vs Median

In [17]:

```
ax = recent_grads.plot(x = 'ShareWomen', y = 'Median', kind = 'scatter')
ax.set_title('Share Women x Median')
ax.set_xlabel('Share Women')
ax.set_xlim(0,)
ax.set_ylabel('Median')
ax.set_ylim(0,)
```

Out[17]:

1) There is not a clear relationship between the most popular majors and salarys according to the **Total x Median** plot. Actually, it is possible to notice that less popular majors have a high variety of salary, ranging from \$ 20,000 to \$ 80,000 (with a single outlier reaching over \$ 120,000). As the total of students increase in some majors, it is possible to observe that the median salary becomes stable, close to \$ 40,000.

2) Analyzing **Men x Median** and **Women x Median** plots, we can see that there is no significantly difference between both. However, the **ShareWomen x Median** graph shows us a weak negative correlation between these two parameters which means that majors with more female than male students tends to have lower wages.

3) From the **Full Time x Median** scatter plot, it is possible to conclude that there is not a direct relationship between Full time works and Median Salary. Instead, just like the plot from the first observation, we can observe a high variation of salary for non-Full time employers and some stability in salary as the number of full time works increase.

**GENERATING HISTOGRAMS FOR "SHAREWOMEN" AND "MEDIAN" COLUMNS**

In [18]:

```
# working with histograms(Sample_size, Median, Employed, Full_time, ShareWomen, Unemployment_rate, Men, Women)
```

In [19]:

```
# answering to some questions
'''
What percent of majors are predominantly male? Predominantly female?
What's the most common median salary range?
'''
```

Out[19]:

- ShareWomen

In [20]:

```
ax = recent_grads['ShareWomen'].hist(grid = False, bins = 10)
ax.set_xlabel('Share Women')
ax.set_ylabel('Entries')
ax.set_title('Share Women Histogram', fontsize = 13)
print(recent_grads['ShareWomen'].value_counts(bins = 10).sort_index())
```

- Median

In [21]:

```
ax = recent_grads['Median'].hist(grid = False)
ax.set_xlabel('Median Salary')
ax.set_ylabel('Entries')
ax.set_title('Median Salary Histogram', fontsize = 13)
print(recent_grads['Median'].value_counts(bins = 10).sort_index())
```

1) According to the histogram and the analysis above, about 44% of the majors are predominantly frequented by male student against 56% of the majors with predominance of female students. Despite that, we could see in the previous observations that graduate Women have slightly lower salaries than Men.

2) It also can be observed from the second histogram that the most commom Median Salary range from \$ 30,000 to \$ 40,000 (about 44%).

**GENERATING SCATTER MATRIX FOR TOTAL VS MEDIAN, SHAREWOMEN VS MEDIAN AND FULL_TIME VS MEDIAN**

In order to work with scatter matrix, we have to import it from pandas.plotting, as shown below.

In [22]:

```
# importing scatter_matrix
from pandas.plotting import scatter_matrix
```

```
Do students in more popular majors make more money?
Do students that majored in subjects that were majority female make more money?
Is there any link between the number of full-time employees and median salary?
What percent of majors are predominantly male? Predominantly female?
What's the most common median salary range?
```

**Total vs Median**

In [27]:

```
scatter_matrix(recent_grads[['Total', 'Median']], alpha = 1, figsize = (10, 10))
```

Out[27]:

**ShareWomen vs Median**

In [24]:

```
scatter_matrix(recent_grads[['ShareWomen', 'Median']], alpha = 1, figsize = (10, 10))
```

Out[24]:

**Full_time vs Median**

In [25]:

```
scatter_matrix(recent_grads[['Full_time', 'Median']], alpha = 1, figsize = (10, 10))
```

Out[25]:

The scatter matrices above just emphasize our previous discussion by aggregating scatter plots and histograms side by side for better comprehension.

**GENERATING BAR PLOTS**

To finish our data analysis, we are going to generate bar plots for the 10 majors with higher median salaries and the 10 majors with lower median salaries using the **Major** and **ShareWomen** columns. Our main purpose is to compare weather the higher salary majors are predominantly frequented by men or women.

In [26]:

```
ax1 = recent_grads[:10].plot.bar(x = 'Major', y = 'ShareWomen', legend = False)
ax1.set_ylabel('Share Women')
ax1.set_title('Major x Share Women', fontsize = 13)
ax2 = recent_grads[-10:].plot.bar(x = 'Major', y = 'ShareWomen', legend = False)
ax2.set_ylabel('Share Women')
ax2.set_title('Major x Share Women', fontsize = 13)
```

Out[26]:

We can conclude from the plots above that the majority of high payed majors are frequented mostly by men whereas women students are most likely to frequent majors with lower salaries. Analyzing the reasons why this happen may be complex and it is necessary to observe other external factors which are not in our dataset. However, it is possible to infer that there is a small gap on men and women salaries based on the Recent Grads dataset.