1. Introduction

In [36]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

recent_grads = pd.read_csv('recent-grads.csv')

print(recent_grads.iloc[0])
print(recent_grads.head())
print(recent_grads.tail())
Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object
   Rank  Major_code                                      Major    Total  \
0     1        2419                      PETROLEUM ENGINEERING   2339.0   
1     2        2416             MINING AND MINERAL ENGINEERING    756.0   
2     3        2415                  METALLURGICAL ENGINEERING    856.0   
3     4        2417  NAVAL ARCHITECTURE AND MARINE ENGINEERING   1258.0   
4     5        2405                       CHEMICAL ENGINEERING  32260.0   

       Men    Women Major_category  ShareWomen  Sample_size  Employed  \
0   2057.0    282.0    Engineering    0.120564           36      1976   
1    679.0     77.0    Engineering    0.101852            7       640   
2    725.0    131.0    Engineering    0.153037            3       648   
3   1123.0    135.0    Engineering    0.107313           16       758   
4  21239.0  11021.0    Engineering    0.341631          289     25694   

       ...        Part_time  Full_time_year_round  Unemployed  \
0      ...              270                  1207          37   
1      ...              170                   388          85   
2      ...              133                   340          16   
3      ...              150                   692          40   
4      ...             5180                 16697        1672   

   Unemployment_rate  Median  P25th   P75th  College_jobs  Non_college_jobs  \
0           0.018381  110000  95000  125000          1534               364   
1           0.117241   75000  55000   90000           350               257   
2           0.024096   73000  50000  105000           456               176   
3           0.050125   70000  43000   80000           529               102   
4           0.061098   65000  50000   75000         18314              4440   

   Low_wage_jobs  
0            193  
1             50  
2              0  
3              0  
4            972  

[5 rows x 21 columns]
     Rank  Major_code                   Major   Total     Men   Women  \
168   169        3609                 ZOOLOGY  8409.0  3050.0  5359.0   
169   170        5201  EDUCATIONAL PSYCHOLOGY  2854.0   522.0  2332.0   
170   171        5202     CLINICAL PSYCHOLOGY  2838.0   568.0  2270.0   
171   172        5203   COUNSELING PSYCHOLOGY  4626.0   931.0  3695.0   
172   173        3501         LIBRARY SCIENCE  1098.0   134.0   964.0   

               Major_category  ShareWomen  Sample_size  Employed  \
168    Biology & Life Science    0.637293           47      6259   
169  Psychology & Social Work    0.817099            7      2125   
170  Psychology & Social Work    0.799859           13      2101   
171  Psychology & Social Work    0.798746           21      3777   
172                 Education    0.877960            2       742   

         ...        Part_time  Full_time_year_round  Unemployed  \
168      ...             2190                  3602         304   
169      ...              572                  1211         148   
170      ...              648                  1293         368   
171      ...              965                  2738         214   
172      ...              237                   410          87   

     Unemployment_rate  Median  P25th  P75th  College_jobs  Non_college_jobs  \
168           0.046320   26000  20000  39000          2771              2947   
169           0.065112   25000  24000  34000          1488               615   
170           0.149048   25000  25000  40000           986               870   
171           0.053621   23400  19200  26000          2403              1245   
172           0.104946   22000  20000  22000           288               338   

     Low_wage_jobs  
168            743  
169             82  
170            622  
171            308  
172            192  

[5 rows x 21 columns]
In [37]:
recent_grads.head()
Out[37]:
Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size Employed ... Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
0 1 2419 PETROLEUM ENGINEERING 2339.0 2057.0 282.0 Engineering 0.120564 36 1976 ... 270 1207 37 0.018381 110000 95000 125000 1534 364 193
1 2 2416 MINING AND MINERAL ENGINEERING 756.0 679.0 77.0 Engineering 0.101852 7 640 ... 170 388 85 0.117241 75000 55000 90000 350 257 50
2 3 2415 METALLURGICAL ENGINEERING 856.0 725.0 131.0 Engineering 0.153037 3 648 ... 133 340 16 0.024096 73000 50000 105000 456 176 0
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING 1258.0 1123.0 135.0 Engineering 0.107313 16 758 ... 150 692 40 0.050125 70000 43000 80000 529 102 0
4 5 2405 CHEMICAL ENGINEERING 32260.0 21239.0 11021.0 Engineering 0.341631 289 25694 ... 5180 16697 1672 0.061098 65000 50000 75000 18314 4440 972

5 rows × 21 columns

In [38]:
print(recent_grads.describe())
             Rank   Major_code          Total            Men          Women  \
count  173.000000   173.000000     172.000000     172.000000     172.000000   
mean    87.000000  3879.815029   39370.081395   16723.406977   22646.674419   
std     50.084928  1687.753140   63483.491009   28122.433474   41057.330740   
min      1.000000  1100.000000     124.000000     119.000000       0.000000   
25%     44.000000  2403.000000    4549.750000    2177.500000    1778.250000   
50%     87.000000  3608.000000   15104.000000    5434.000000    8386.500000   
75%    130.000000  5503.000000   38909.750000   14631.000000   22553.750000   
max    173.000000  6403.000000  393735.000000  173809.000000  307087.000000   

       ShareWomen  Sample_size       Employed      Full_time      Part_time  \
count  172.000000   173.000000     173.000000     173.000000     173.000000   
mean     0.522223   356.080925   31192.763006   26029.306358    8832.398844   
std      0.231205   618.361022   50675.002241   42869.655092   14648.179473   
min      0.000000     2.000000       0.000000     111.000000       0.000000   
25%      0.336026    39.000000    3608.000000    3154.000000    1030.000000   
50%      0.534024   130.000000   11797.000000   10048.000000    3299.000000   
75%      0.703299   338.000000   31433.000000   25147.000000    9948.000000   
max      0.968954  4212.000000  307933.000000  251540.000000  115172.000000   

       Full_time_year_round    Unemployed  Unemployment_rate         Median  \
count            173.000000    173.000000         173.000000     173.000000   
mean           19694.427746   2416.329480           0.068191   40151.445087   
std            33160.941514   4112.803148           0.030331   11470.181802   
min              111.000000      0.000000           0.000000   22000.000000   
25%             2453.000000    304.000000           0.050306   33000.000000   
50%             7413.000000    893.000000           0.067961   36000.000000   
75%            16891.000000   2393.000000           0.087557   45000.000000   
max           199897.000000  28169.000000           0.177226  110000.000000   

              P25th          P75th   College_jobs  Non_college_jobs  \
count    173.000000     173.000000     173.000000        173.000000   
mean   29501.445087   51494.219653   12322.635838      13284.497110   
std     9166.005235   14906.279740   21299.868863      23789.655363   
min    18500.000000   22000.000000       0.000000          0.000000   
25%    24000.000000   42000.000000    1675.000000       1591.000000   
50%    27000.000000   47000.000000    4390.000000       4595.000000   
75%    33000.000000   60000.000000   14444.000000      11783.000000   
max    95000.000000  125000.000000  151643.000000     148395.000000   

       Low_wage_jobs  
count     173.000000  
mean     3859.017341  
std      6944.998579  
min         0.000000  
25%       340.000000  
50%      1231.000000  
75%      3466.000000  
max     48207.000000  
In [39]:
raw_data_count = recent_grads.count()
print(raw_data_count)
Rank                    173
Major_code              173
Major                   173
Total                   172
Men                     172
Women                   172
Major_category          173
ShareWomen              172
Sample_size             173
Employed                173
Full_time               173
Part_time               173
Full_time_year_round    173
Unemployed              173
Unemployment_rate       173
Median                  173
P25th                   173
P75th                   173
College_jobs            173
Non_college_jobs        173
Low_wage_jobs           173
dtype: int64
In [40]:
recent_grads = recent_grads.dropna()
cleaned_data_count = recent_grads.count()
print(cleaned_data_count)
Rank                    172
Major_code              172
Major                   172
Total                   172
Men                     172
Women                   172
Major_category          172
ShareWomen              172
Sample_size             172
Employed                172
Full_time               172
Part_time               172
Full_time_year_round    172
Unemployed              172
Unemployment_rate       172
Median                  172
P25th                   172
P75th                   172
College_jobs            172
Non_college_jobs        172
Low_wage_jobs           172
dtype: int64

2. Pandas, Scatter Plots

Generate scatter plots in separate jupyter notebook cells to explore the following relations:

- Sample_size and Median
- Sample_size and Unemployment_rate
- Full_time and Median
- ShareWomen and Unemployment_rate
- Men and Median
- Women and Median

Use the plots to explore the following questions:

- Do students in more popular majors make more money?
- Do students that majored in subjects that were majority female make more money?
- Is there any link between the number of full-time employees and median salary?
In [41]:
ax = recent_grads.plot(x='Sample_size', y='Employed', kind='scatter')
ax.set_title('Employed vs. Sample-size')
Out[41]:
<matplotlib.text.Text at 0x7fc4ffb7f320>
In [43]:
recent_grads.plot(x='Sample_size', y='Median', kind='scatter')
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4ff635da0>
In [44]:
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter')
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4ffc7ee48>
In [45]:
recent_grads.plot(x='Full_time', y='Median', kind='scatter')
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4ff84ce10>
In [46]:
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter')
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4ffda9f98>
In [47]:
recent_grads.plot(x='Men', y='Median', kind='scatter')
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4ff8734a8>
In [48]:
recent_grads.plot(x='Women', y='Median', kind='scatter')
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4ff774f28>

3. Pandas, Histograms

To explore the distribution of values in a column, we can select it from the DataFrame, call Series.plot(), and set the kind parameter to hist:

In [53]:
recent_grads['Sample_size'].plot(kind='hist')
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4ff0ff080>
In [54]:
recent_grads['Sample_size'].plot(kind='hist', bins=25, range=(0,5000))
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4ff04b6d8>

Alternative way to produce the same result:

In [55]:
recent_grads['Sample_size'].hist(bins=25, range=(0,5000))
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fef6dd68>

Generate histograms in separate jupyter notebook cells to explore the distributions of the following columns:

- Sample_size
- Median
- Employed
- Full_time
- ShareWomen
- Unemployment_rate
- Men
- Women

Use the plots to explore the following questions:

- What percent of majors are predominantly male? Predominantly female?
- What's the most common median salary range?
In [72]:
recent_grads['Median'].hist(bins=10)
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fe82d240>
In [76]:
recent_grads['Employed'].hist(bins=15)
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fe547828>
In [77]:
recent_grads['Full_time'].hist()
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fe4cb0b8>
In [80]:
recent_grads['ShareWomen'].hist()
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fe2de358>
In [79]:
recent_grads['Unemployment_rate'].hist()
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fe37dba8>
In [81]:
recent_grads['Men'].hist()
Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fe20db70>
In [82]:
recent_grads['Women'].hist()
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fe1af4a8>
In [66]:
cols = ['Sample_size', 'Median', 'Employed', 'Full_time', 'ShareWomen', 'Unemployment_rate', 'Men', 'Women']

fig = plt.figure(figsize=(5, 15))
for i in range (0,4):
    ax = fig.add_subplot(4,1,i+1)
    ax = recent_grads[cols[i]].plot(kind='hist', rot=30)

4. Pandas, Scatter Matrix Plot

A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously. A scatter matrix plot consists of n by n plots on a grid, where n is the number of columns, the plots on the diagonal are histograms, and the non-diagonal plots are scatter plots.

In [85]:
from pandas.plotting import scatter_matrix 

scatter_matrix(recent_grads[['Women', 'Men']], figsize=(10,10))
Out[85]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fe118400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fe09b8d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fe0e8400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fe027080>]],
      dtype=object)
In [86]:
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))
Out[86]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fc4feb0aa20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fe9f4dd8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fe98ff28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fea5a4e0>]],
      dtype=object)
In [87]:
scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']], figsize=(10,10))
Out[87]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fed26048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fdf3bac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fdf0a358>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fdec0fd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fde90160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fde4bb38>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fde1b828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fde56668>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc4fdd9eba8>]],
      dtype=object)

5. Pandas, Bar Plots

The following code returns a bar plot of the first 5 values in the Women column:

In [88]:
recent_grads[:5]['Women'].plot(kind='bar')
Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fdc310f0>

By default, pandas will use the default labels on the x-axis for each bar (1 to n) from matplotlib. If we instead use the DataFrame.plot.bar() method, we can use the x parameter to specify the labels and the y parameter to specify the data for the bars:

In [92]:
recent_grads[:5].plot.bar(x='Major', y='Women')
Out[92]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fd9afeb8>

Use bar plots to compare the percentages of women (ShareWomen) from the first ten rows and last ten rows of the recent_grads dataframe.

In [95]:
recent_grads[:10].plot.bar(x='Major', y='ShareWomen', legend=False)
Out[95]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fc55a438>
In [96]:
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen', legend=False)
Out[96]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fc4e5ef0>

Use bar plots to compare the unemployment rate (Unemployment_rate) from the first ten rows and last ten rows of the recent_grads dataframe.

In [101]:
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', legend=False)
Out[101]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fc231cf8>
In [102]:
recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate', legend=False)
Out[102]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4fc1bd780>