Visualizing Earnings Based on College Majors

In this project we'll be analysing the dataset and exploring the following questions:

Do students in more popular majors make more money?

  • Using scatter plots

How many majors are predominantly male? Predominantly female?

  • Using histograms

Which category of majors have the most students?

  • Using bar plots
In [1]:
#import of libraries 
import pandas as pd
import matplotlib.pyplot as plt

#run Jupyter magic so that plots are displayed inline
%matplotlib inline
In [2]:
recent_grads = pd.read_csv("recent-grads.csv")

#exploring first row using .iloc() 
print(recent_grads.iloc[0:1])

#exploring using head() and tail() 
print(recent_grads.head())
print(recent_grads.tail())
   Rank  Major_code                  Major   Total     Men  Women  \
0     1        2419  PETROLEUM ENGINEERING  2339.0  2057.0  282.0   

  Major_category  ShareWomen  Sample_size  Employed      ...        Part_time  \
0    Engineering    0.120564           36      1976      ...              270   

   Full_time_year_round  Unemployed  Unemployment_rate  Median  P25th   P75th  \
0                  1207          37           0.018381  110000  95000  125000   

   College_jobs  Non_college_jobs  Low_wage_jobs  
0          1534               364            193  

[1 rows x 21 columns]
   Rank  Major_code                                      Major    Total  \
0     1        2419                      PETROLEUM ENGINEERING   2339.0   
1     2        2416             MINING AND MINERAL ENGINEERING    756.0   
2     3        2415                  METALLURGICAL ENGINEERING    856.0   
3     4        2417  NAVAL ARCHITECTURE AND MARINE ENGINEERING   1258.0   
4     5        2405                       CHEMICAL ENGINEERING  32260.0   

       Men    Women Major_category  ShareWomen  Sample_size  Employed  \
0   2057.0    282.0    Engineering    0.120564           36      1976   
1    679.0     77.0    Engineering    0.101852            7       640   
2    725.0    131.0    Engineering    0.153037            3       648   
3   1123.0    135.0    Engineering    0.107313           16       758   
4  21239.0  11021.0    Engineering    0.341631          289     25694   

       ...        Part_time  Full_time_year_round  Unemployed  \
0      ...              270                  1207          37   
1      ...              170                   388          85   
2      ...              133                   340          16   
3      ...              150                   692          40   
4      ...             5180                 16697        1672   

   Unemployment_rate  Median  P25th   P75th  College_jobs  Non_college_jobs  \
0           0.018381  110000  95000  125000          1534               364   
1           0.117241   75000  55000   90000           350               257   
2           0.024096   73000  50000  105000           456               176   
3           0.050125   70000  43000   80000           529               102   
4           0.061098   65000  50000   75000         18314              4440   

   Low_wage_jobs  
0            193  
1             50  
2              0  
3              0  
4            972  

[5 rows x 21 columns]
     Rank  Major_code                   Major   Total     Men   Women  \
168   169        3609                 ZOOLOGY  8409.0  3050.0  5359.0   
169   170        5201  EDUCATIONAL PSYCHOLOGY  2854.0   522.0  2332.0   
170   171        5202     CLINICAL PSYCHOLOGY  2838.0   568.0  2270.0   
171   172        5203   COUNSELING PSYCHOLOGY  4626.0   931.0  3695.0   
172   173        3501         LIBRARY SCIENCE  1098.0   134.0   964.0   

               Major_category  ShareWomen  Sample_size  Employed  \
168    Biology & Life Science    0.637293           47      6259   
169  Psychology & Social Work    0.817099            7      2125   
170  Psychology & Social Work    0.799859           13      2101   
171  Psychology & Social Work    0.798746           21      3777   
172                 Education    0.877960            2       742   

         ...        Part_time  Full_time_year_round  Unemployed  \
168      ...             2190                  3602         304   
169      ...              572                  1211         148   
170      ...              648                  1293         368   
171      ...              965                  2738         214   
172      ...              237                   410          87   

     Unemployment_rate  Median  P25th  P75th  College_jobs  Non_college_jobs  \
168           0.046320   26000  20000  39000          2771              2947   
169           0.065112   25000  24000  34000          1488               615   
170           0.149048   25000  25000  40000           986               870   
171           0.053621   23400  19200  26000          2403              1245   
172           0.104946   22000  20000  22000           288               338   

     Low_wage_jobs  
168            743  
169             82  
170            622  
171            308  
172            192  

[5 rows x 21 columns]
In [3]:
#using describe() to generate a summary of statistics
print(recent_grads.describe())
             Rank   Major_code          Total            Men          Women  \
count  173.000000   173.000000     172.000000     172.000000     172.000000   
mean    87.000000  3879.815029   39370.081395   16723.406977   22646.674419   
std     50.084928  1687.753140   63483.491009   28122.433474   41057.330740   
min      1.000000  1100.000000     124.000000     119.000000       0.000000   
25%     44.000000  2403.000000    4549.750000    2177.500000    1778.250000   
50%     87.000000  3608.000000   15104.000000    5434.000000    8386.500000   
75%    130.000000  5503.000000   38909.750000   14631.000000   22553.750000   
max    173.000000  6403.000000  393735.000000  173809.000000  307087.000000   

       ShareWomen  Sample_size       Employed      Full_time      Part_time  \
count  172.000000   173.000000     173.000000     173.000000     173.000000   
mean     0.522223   356.080925   31192.763006   26029.306358    8832.398844   
std      0.231205   618.361022   50675.002241   42869.655092   14648.179473   
min      0.000000     2.000000       0.000000     111.000000       0.000000   
25%      0.336026    39.000000    3608.000000    3154.000000    1030.000000   
50%      0.534024   130.000000   11797.000000   10048.000000    3299.000000   
75%      0.703299   338.000000   31433.000000   25147.000000    9948.000000   
max      0.968954  4212.000000  307933.000000  251540.000000  115172.000000   

       Full_time_year_round    Unemployed  Unemployment_rate         Median  \
count            173.000000    173.000000         173.000000     173.000000   
mean           19694.427746   2416.329480           0.068191   40151.445087   
std            33160.941514   4112.803148           0.030331   11470.181802   
min              111.000000      0.000000           0.000000   22000.000000   
25%             2453.000000    304.000000           0.050306   33000.000000   
50%             7413.000000    893.000000           0.067961   36000.000000   
75%            16891.000000   2393.000000           0.087557   45000.000000   
max           199897.000000  28169.000000           0.177226  110000.000000   

              P25th          P75th   College_jobs  Non_college_jobs  \
count    173.000000     173.000000     173.000000        173.000000   
mean   29501.445087   51494.219653   12322.635838      13284.497110   
std     9166.005235   14906.279740   21299.868863      23789.655363   
min    18500.000000   22000.000000       0.000000          0.000000   
25%    24000.000000   42000.000000    1675.000000       1591.000000   
50%    27000.000000   47000.000000    4390.000000       4595.000000   
75%    33000.000000   60000.000000   14444.000000      11783.000000   
max    95000.000000  125000.000000  151643.000000     148395.000000   

       Low_wage_jobs  
count     173.000000  
mean     3859.017341  
std      6944.998579  
min         0.000000  
25%       340.000000  
50%      1231.000000  
75%      3466.000000  
max     48207.000000  

In the subsequent rows we'll be dropping rows containing missing values. Two ways we'll be shown how to count number of rows with missing values.

In [4]:
#using info() to understand how much rows with missing values there are
print(recent_grads.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 21 columns):
Rank                    173 non-null int64
Major_code              173 non-null int64
Major                   173 non-null object
Total                   172 non-null float64
Men                     172 non-null float64
Women                   172 non-null float64
Major_category          173 non-null object
ShareWomen              172 non-null float64
Sample_size             173 non-null int64
Employed                173 non-null int64
Full_time               173 non-null int64
Part_time               173 non-null int64
Full_time_year_round    173 non-null int64
Unemployed              173 non-null int64
Unemployment_rate       173 non-null float64
Median                  173 non-null int64
P25th                   173 non-null int64
P75th                   173 non-null int64
College_jobs            173 non-null int64
Non_college_jobs        173 non-null int64
Low_wage_jobs           173 non-null int64
dtypes: float64(5), int64(14), object(2)
memory usage: 28.5+ KB
None
In [5]:
#counting number of rows
raw_data_count = len(recent_grads)
print(raw_data_count)
173
In [6]:
#dropping rows with missing values
recent_grads = recent_grads.dropna()
In [7]:
#counting cleaned dataset and analysing missing rows
cleaned_data_count = len(recent_grads)
print(cleaned_data_count)
172

In the above code we've analysed ways to count how many rows with missing values there is. What missing values there is in each column. We dropped any rows with missing values.

In the below scatter plots we'll be exploring several relations. All scatter plots will be analysed in seperate cells.

  • Sample_size and Median
  • Sample_size and Unemployment_rate
  • Full_time and Median
  • ShareWomen and Unemployment_rate
  • Men and Median
  • Women and Median
In [8]:
#Sample_size and Median scatter plot
ax = recent_grads.plot(x='Sample_size', y='Median', kind='scatter', title='Sample Size vs Median')
plt.show()
In [9]:
#Sample_size and Unemployment_rate scatter plot
ax = recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind='scatter', title='Sample Size vs Unemployment Rate')
plt.show()
In [10]:
#Full_time and Median scatter plot
ax = recent_grads.plot(x='Full_time', y='Median', kind='scatter', title='Full-time vs Median')
plt.show()
In [11]:
#Sample_size and Median scatter plot
ax = recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', title='Share Women vs Unemployement Rate')
plt.show()
In [12]:
#Men and Median scatter plot
ax = recent_grads.plot(x='Men', y='Median', kind='scatter', title='Men vs Median')
plt.show()
In [13]:
#Women and Median scatter plot
ax = recent_grads.plot(x='Women', y='Median', kind='scatter', title='Women vs Median')
plt.show()
  • Do students in more popular majors make more money?

    It seems that the lower number of students in majors the higher the median salary

  • Do students that majored in subjects that were majority female make more money?

    There doesn't seem to be any major correlation if subjects have a majority of female they don't necessary make more money

  • Is there any link between the number of full-time employees and median salary?

    It seems that the lower number of full-time employees the higher the mean salary.

In the below code we'll be exploring distributions in the following columns.

- Sample_size

- Median

- Employed

- Full_time

- ShareWomen

- Unemployment_rate

- Men

- Women

In [14]:
#histogram of Sample_size column
fig, ax = plt.subplots()
ax.hist(recent_grads['Sample_size'], bins=5, range=(0, 5000))
ax.set_title("Distribution of Sample Size Column")
plt.show()
In [15]:
#histogram of Median column
fig, ax = plt.subplots()
ax.hist(recent_grads['Median'], bins=12, range=(0, 120000))
ax.set_title("Distribution of Median Column")
plt.show()
In [16]:
#histogram of Employed column
fig, ax = plt.subplots()
ax.hist(recent_grads['Employed'], bins=5, range=(0, 32000))
ax.set_title("Distribution of Employed Column")
plt.show()
In [17]:
#histogram of Full_time column
fig, ax = plt.subplots()
ax.hist(recent_grads['Full_time'], bins=5)
ax.set_title("Distribution of Full time Column")
plt.show()
In [18]:
#histogram of ShareWomen column
fig, ax = plt.subplots()
ax.hist(recent_grads['ShareWomen'], bins=12, range=(0,1.2))
ax.set_title("Distribution of ShareWomen Column")
plt.show()
In [19]:
#histogram of Unemployment_rate column
fig, ax = plt.subplots()
ax.hist(recent_grads['Unemployment_rate'], bins=18)
ax.set_title("Distribution of Unemployment Column")
plt.show()
In [20]:
#histogram of Men column
xtick_number = [0, 20000, 40000, 60000, 80000, 100000, 120000, 140000, 160000]
xtick_labels = ["O", "20K", "40K", "60K", "80K", "100K", "120K", "140K", "160K"]
fig, ax = plt.subplots()
ax.hist(recent_grads['Men'], bins=12)
ax.set_title("Distribution of Men Column")
ax.set_xticks(xtick_number)
ax.set_xticklabels(xtick_labels)
plt.show()
In [21]:
#histogram of Women column
xtick_number = [0, 50000, 100000, 150000, 200000, 250000, 300000, 350000]
xtick_labels = ["O", "50K", "100K", "1500K", "200K", "250K", "300K", "350K"]
fig, ax = plt.subplots()
ax.hist(recent_grads['Women'], bins=8)
ax.set_title("Distribution of Women Column")
ax.set_xticks(xtick_number)
ax.set_xticklabels(xtick_labels)
plt.show()
  • How many majors are predominantly male? Predominantly female?

    Analysing the above histograms we can verify that around 95 majors are predominantly female while the rest are predominantly male.

In the next cells we'll be analysing information with matrix scatter plots. Further questions will be answered if there is any relations between columns of the dataset.

In [22]:
#import scatter_matrix() function
from pandas.plotting import scatter_matrix 

#Sample_size and Median scatter matrix plot
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))
plt.show()
In [23]:
##Sample_size, Unemployment_rate, and Median scatter matrix plot
scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']], figsize=(10,10))
plt.show()

In the above scatter matrix we can see some relations

  • The majors with a lower sample size have the highest rate of unemployment

In the below barplots we'll be analysing:

  • compare percentages of women
  • compare the unemployment rate
  • compare which category of majors have the most students
In [64]:
#barplot for first ten and last ten rows of the dataframe
y_ticks = [0, 0.2, 0.4, 0.6, 0.8, 1.0]
y_tick_labels = ['0%', '20%', '40%', '60%', '80%', '100%']
recent_grads[:10].append(recent_grads[-10:]).plot.bar(x='Major', y='ShareWomen')
plt.ylabel("Percentage of Women in Class")
plt.yticks(y_ticks, y_tick_labels)
plt.title("% of Women in Categories")
plt.show()
In [55]:
#barplot for first ten and last ten rows of the dataframe
recent_grads[:10].append(recent_grads[-10:]).plot.bar(x='Major', y='Unemployment_rate')
plt.ylabel("Unemployment Rate")
plt.title("Unemployment Rate of first and last ten rows")
plt.show()
In [54]:
#extracting top ten categories with most students
sorted_totals = recent_grads.sort_values('Total', ascending = False)
top_ten_majors = sorted_totals[:10]

#barplot for ten categories with most students
top_ten_majors.plot.bar(x='Major', y='Total')
plt.ylabel("Amount of People")
plt.title("Categories With Most Students")
plt.show()

In concluding this project analysis of recent grads dataset we've analysed and answered our questions. Having gone through the College Majors information we can see that a majority of women are attending college compared to men. The most studied majors are psychology, business management and biology. Our final conclusion on median salary is, the less poeple apply for the majors, the higher the median salary and majority of median salaries of all majors are in the same bracket. This concludes our analysis of earnings based on college majors

In [ ]: