In this course, we've been creating plots using **pyplot** and **matplotlib** directly.

When we want to explore a new dataset by quickly creating visualizations, using these tools directly can be cumbersome.

Thankfully, **pandas** has many methods for quickly generating common plots from data in DataFrames. Like pyplot, the plotting functionality in pandas is a wrapper for matplotlib.

This means we can customize the plots when necessary by accessing the underlying Figure, Axes, and other matplotlib objects.

**In this guided project, we'll explore how using the pandas plotting functionality along with the Jupyter notebook interface allows us to explore data quickly using visualizations.**

Trabajaremos con un conjunto de datos sobre:

- We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012.

The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo.

Each row in the dataset represents a different major in college and contains information on:

**Gender diversity**.**Employment rates**.**Median salaries, and more**.

Here are some of the columns in the dataset:

`Rank`

-*Rank by median earnings (the dataset is ordered by this column)*.`Major_code`

-*Major code*.`Major`

-*Major description*.`Major_category`

-*Category of major.*`Total`

-*Total number of people with major.*`Sample_size`

-*Sample size (unweighted) of full-time.*`Men`

-*Male graduates.*`Women`

-*Female graduates.*`ShareWomen`

-*Women as share of total.*`Employed`

-*Number employed.*`Unemployment_rate`

- Percentage of the work force that is unemployed at any given date`Median`

-*Median salary of full-time, year-round workers.*`Low_wage_jobs`

-*Number in low-wage service jobs.*`Full_time`

-*Number employed 35 hours or more.*`Part_time`

-*Number employed less than 35 hours.*

Using visualizations, we can start to explore questions from the dataset like:

Do students in more popular majors make more money?

- Using scatter plots

How many majors are predominantly male? Predominantly female?

- Using histograms

Which category of majors have the most students?

- Using bar plots

**Nota**: La relación entre el tipo de visualización, está relacionada con la pregunta, por ejemplo:

Cuando se hace referencia a los estudiantes (a un grupo en general) se usa scatter plot.

Cuando se quiere separar por genero entonces se usan histogramas.

Cuando se habla de categorias entonces los graficos de barras.

We'll explore how to do these and more while primarily working in pandas. Before we start creating data visualizations, let's import the libraries we need and remove rows containing null values.

- Let's setup the environment by importing the
**libraries**we need and**running the necessary Jupyter magic**`%matplotlib inline`

so that plots are displayed inline.

In [1]:

```
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
```

In [2]:

```
recent_grads = pd.read_csv("recent-grads.csv")
```

Using `DataFrame.iloc[]`

to return the first row formatted as a table.

In [3]:

```
recent_grads.iloc[:1]
```

Out[3]:

- Use of
`DataFrame.head()`

and`DataFrame.tail()`

to see how the data is structured.

In [4]:

```
recent_grads.head(3)
```

Out[4]:

In [5]:

```
recent_grads.tail(3)
```

Out[5]:

- Generate summary statistics for all of the numeric columns using
`DataFrame.describe()`

.

In [6]:

```
recent_grads.describe()
```

Out[6]:

It can be seen that there are columns that have different lengths.

Therefore I will start droping all rows with missing values.

Matplotlib expects that columns of values we pass in have matching lengths and missing values will cause matplotlib to throw errors.

Look up the number of rows in `recent_grads`

and assign the value to `raw_data_count`

.

In [7]:

```
raw_data_count = len(recent_grads)
raw_data_count
```

Out[7]:

By using `DataFrame.dropna()`

for drop rows containing missing values and assign the resulting DataFrame back to `recent_grads`

.

In [8]:

```
recent_grads.dropna( axis=0, inplace=True)
```

In [9]:

```
cleaned_data_count = len(recent_grads)
cleaned_data_count
```

Out[9]:

Look up the number of rows in `recent_grads`

now and assign the value to `cleaned_data_count`

.

If you compare `cleaned_data_count`

and `raw_data_count`

, you'll notice that only one row contained missing values and was dropped.

There was only one line that could not be counted on;

$173 -172 = 1$

`Pandas`

and the Scatter Plots¶Most of the plotting functionality in pandas is contained within the `DataFrame.plot()`

method.

When we call this method, we specify the data we want plotted as well as the type of plot.

We use **the kind parameter to specify the type of plot we want**. We use x and y to specify the data we want on each axis. You can read about the different parameters in the documentation.

`kind`

types¶`‘line’`

: line plot (default)`‘bar’`

: vertical bar plot`‘barh’`

: horizontal bar plot`‘hist’`

: histogram`‘box’`

: boxplot`‘kde’`

: Kernel Density Estimation plot`‘density’`

: same as ‘kde’`‘area’`

: area plot`‘pie’`

: pie plot`‘scatter’`

: scatter plot (DataFrame only)`‘hexbin’`

: hexbin plot (DataFrame only)

`kind`

.¶By way of learning I find interesting to show the different types of graphs that can be represented with the variations of kind parameter.

In this case **it is not important if the visualization is adequate or not**, what is intended to show is the different representations offered by the kind parameter.

In [10]:

```
kind_types = ['line', 'bar','barh', 'hist', 'box', 'kde', 'density', 'area', 'pie', 'scatter', 'hexbin']
for i in kind_types:
recent_grads.plot(x='Sample_size', y='Employed', kind = i )
```

`DataFrame.plot()`

¶The `DataFrame.plot()`

method has a few parameters we can use for tweaking the scatter plot:

`x =`

`y =`

`kind =`

`title =`

`figsize =`

But we can instantiate an object and make use of the methods that correspond to that object.

In [11]:

```
ax = recent_grads.plot(x='Sample_size', y='Employed', kind='scatter')
ax.set_title('Employed vs. Sample_size')
```

Out[11]:

It can be said that there is a correlation between the measurement of the samples and the number of employees

To explore the following relations generate a series of scatter plots :

`Sample_size`

vs. `Median`

¶`Sample_size`

- Sample size (unweighted) of full-time.`Median`

- Median salary of full-time, year-round workers.

In [12]:

```
recent_grads.plot(x='Sample_size',
y='Median',
kind='scatter',
title='Sample size of full-time vs. median salary of full-time, year-round workers.')
```

Out[12]:

A small number of full-time samples have the highest range of salaries. on the contrary, when the samples are larger, the salary decreases.

`Sample_size`

vs. `Unemployment_rate`

¶`Sample_size`

- Sample size (unweighted) of full-time.`Unemployment_rate`

- The percentage of the work force that is unemployed at any given date

In [13]:

```
recent_grads.plot(x='Sample_size',
y='Unemployment_rate',
kind='scatter',
title='Sample_size vs. Unemployment_rate')
```

Out[13]:

Similarly when the samples are small the unemployment relationship covers a fairly wide range, however when the size of the full-time samples increases the unemployment ratio is focused on central elements of our graph.

`Full_time`

- Number employed 35 hours or more.`Median`

- Median salary of full-time, year-round workers.

In [14]:

```
recent_grads.plot(x='Full_time',
y='Median',
kind='scatter',
title='Full_time vs. Median')
```

Out[14]:

This graph explains how a minority of people between 0 and 2500 approximately taking into account that the scale fund with which we are working is 250000, work more than 35 hours and the average salary they have goes from 20000 to 80000 by the way as the number of people who work full time increases the salary has a tendency to decrease.

`ShareWomen`

vs. `Unemployment_rate`

¶`ShareWomen`

- Women as share of total.`Unemployment_rate`

- The percentage of the work force that is unemployed at any given date.

In [15]:

```
recent_grads.plot(x='ShareWomen',
y='Unemployment_rate',
kind='scatter',
title='ShareWomen vs. Unemployment_rate')
```

Out[15]:

It can be observed that the unemployment relationship among women is distributed in a fairly broad way within this group.

`Men`

vs. `Median`

¶`Men`

- Men.`Unemployment_rate`

Unemployment rate

In [16]:

```
recent_grads.plot(x='Men',
y='Median',
kind='scatter',
title='Men vs. Median')
```

Out[16]:

as we can see there is a large minority of men who earn an average salary of between 20000 and 80000 and that as that number of men increases the salary also decreases

`Women`

vs `Median`

¶`Women`

- Women.`Median`

- Median salary of full-time, year-round workers.

In [17]:

```
recent_grads.plot(x='Women',
y='Median',
kind='scatter',
title='Women vs. Median')
```

Out[17]:

- the same trend occurs in the graphics of women, however we must bear in mind that the number of women here is greater with respect to the graphics of men and that will affect the visualization of our graphics.

- As a remarkable fact we can say that although the salaries at their top can be equivalent, the differences are found in that the average in men is even higher than in women.

`Matplotlib`

, much more detailed plots.¶Because the data that we are going to handle are not categorical, therefore they will be numerical, the ideal in this case is to use a histogram

a histogram is a graph through which it is possible to represent with rectangles distributions of frequencies on determined coordinates. That is, they let us know the frequency of a specific event, thanks to the distribution of information.

A way to do an exploratory analysis and observe the relationships of the data to be analyzed.

In [18]:

```
cols = ["Sample_size",
"Median",
"Employed",
"Full_time", "ShareWomen", "Unemployment_rate", "Men", "Women"]
fig = plt.figure(figsize=(20,40))
r = 0
col_name = cols[r]
ax = fig.add_subplot(len(cols),1,r+1)
ax = recent_grads[col_name].plot(kind='hist', rot=0)
texto = recent_grads[col_name]
ax.set_xlabel(col_name, rotation = 0)
```

Out[18]:

the histogram informs us that the highest frequency of our full-time sample size set is the samples with a low value.

In [19]:

```
cols = ["Sample_size",
"Median",
"Employed",
"Full_time", "ShareWomen", "Unemployment_rate", "Men", "Women"]
fig = plt.figure(figsize=(20,40))
r = 1
col_name = cols[r]
ax = fig.add_subplot(len(cols),1,r+1)
ax = recent_grads[col_name].plot(kind='hist', rot=0, bins = 50)
texto = recent_grads[col_name]
ax.set_xlabel(col_name, rotation = 0)
```

Out[19]:

In this case we see that the amount that is repeated more times (the frequency) of `Median`

salary of full-time, year-round workers. is around 3000o dollars approximately, therefore **students in more popular majors make more money**.

In [20]:

```
cols = ["Sample_size",
"Median",
"Employed",
"Full_time", "ShareWomen", "Unemployment_rate", "Men", "Women"]
fig = plt.figure(figsize=(20,40))
r = 2
col_name = cols[r]
ax = fig.add_subplot(len(cols),1,r+1)
ax = recent_grads[col_name].plot(kind='hist', rot=0, bins = 50)
texto = recent_grads[col_name]
ax.set_xlabel(col_name, rotation = 0)
```

Out[20]:

In a skewed distribution, we see the following:

In the case of the casual histogram, the values pile up toward the starting point of the range.

Values decrease in frequency towards the opposite end, forming the

**tail of the distribution**.Therefore the higher frequency of number of

`Employed`

is quite low.

In [21]:

```
cols = ["Sample_size",
"Median",
"Employed",
"Full_time", "ShareWomen", "Unemployment_rate", "Men", "Women"]
fig = plt.figure(figsize=(20,40))
r = 3
col_name = cols[r]
ax = fig.add_subplot(len(cols),1,r+1)
ax = recent_grads[col_name].plot(kind='hist', rot=0, bins = 50)
texto = recent_grads[col_name]
ax.set_xlabel(col_name, rotation = 0)
```

Out[21]:

The same goes for full-time jobs, the pattern of the graph above is repeated.

In [22]:

```
cols = ["Sample_size",
"Median",
"Employed",
"Full_time", "ShareWomen", "Unemployment_rate", "Men", "Women"]
fig = plt.figure(figsize=(20,40))
r = 4
col_name = cols[r]
ax = fig.add_subplot(len(cols),1,r+1)
ax = recent_grads[col_name].plot(kind='hist', rot=0, bins = 50)
texto = recent_grads[col_name]
ax.set_xlabel(col_name, rotation = 0)
```

Out[22]:

We can see that `ShareWomen`

distribution is homogeneous in the dataset.

In [23]:

```
cols = ["Sample_size",
"Median",
"Employed",
"Full_time", "ShareWomen", "Unemployment_rate", "Men", "Women"]
fig = plt.figure(figsize=(20,40))
r = 5
col_name = cols[r]
ax = fig.add_subplot(len(cols),1,r+1)
ax = recent_grads[col_name].plot(kind='hist', rot=0, bins = 50)
texto = recent_grads[col_name]
ax.set_xlabel(col_name, rotation = 0)
```

Out[23]:

The `unemployment_rate`

can be said to be low since the values on the **x-axis** are low and the distribution covers a fairly small portion.

In [24]:

```
plt.hist(recent_grads['Women'], label= "Women")
plt.hist(recent_grads['Men'], label= "Men")
plt.legend()
plt.show()
```

Through the superposition of graphs we can observe that the total amount and frequency is greater for women.

In the graphs in which we compared the average salary between the two genders this we already saw in the x axis.

The `Mayors`

category type the most advisable thing would be to work instead of scatter plots with bar plots.

In [25]:

```
plt.plot(recent_grads['Women'], label= "Women")
plt.plot(recent_grads['Men'], label= "Men")
plt.legend()
plt.show()
```

Quickly we observe that **women surpass men** and that these are more focused on fields other than men.

In [26]:

```
max_value = recent_grads["Men"].max()
max_value # max number of Men.
```

Out[26]:

In [27]:

```
max_men_value_bool = recent_grads["Men"] == max_value
max_men_value_bool.unique()
```

Out[27]:

In [28]:

```
M_final_col = ["Major_category"]
M_final_col
```

Out[28]:

Through Boolean filtering we obtain which is the most chosen career category

In [29]:

```
result = recent_grads.loc[max_men_value_bool,M_final_col]
result
```

Out[29]:

In [30]:

```
max_value = recent_grads["Women"].max()
max_value
```

Out[30]:

In [31]:

```
max_value_bool = recent_grads["Women"] == max_value
max_value_bool.unique()
```

Out[31]:

In [32]:

```
W_final_col = ["Major_category"]
W_final_col
```

Out[32]:

In [33]:

```
result = recent_grads.loc[max_value_bool,W_final_col]
result
```

Out[33]:

Genre | Mayor category | # people |
---|---|---|

Men | Business | 173809.0 |

Women | Psychology & Social Work | 307087.0 |

The number of **women is almost double** the number of men who are studying.

`matplolib`

by type of major and gender.¶- This is a representation that came to mind from the types of running and through the use of Boolean filters.

In [34]:

```
carreras = recent_grads['Major_category'].unique()
carreras
```

Out[34]:

In [46]:

```
# create data
for carrera in range(len(carreras)): # get len
serie_bool = recent_grads["Major_category"] == carreras[carrera]
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
y1 = list(recent_grads.loc[serie_bool,"Men"])
y2 = list(recent_grads.loc[serie_bool,"Women"])
texto = recent_grads.loc[serie_bool,"Major"]
x = np.arange(len(texto))
width = 0.2
# plot data in grouped manner of bar type
plt.bar(x-0.2, y1, width, color='green')
plt.bar(x, y2, width, color='orange')
plt.xticks(x, recent_grads.loc[serie_bool,"Major"], rotation = 90)
plt.xlabel(carreras[carrera])
plt.ylabel("People")
plt.legend(['Women', 'Men'], fontsize = 14)
plt.show()
```