Enrolments at Scottish Universities

Investigating trends for female enrolment at Undergraduate level degree courses at Scottish Universities against male enrolment. Focus on Computing but also interested to view how Computing fairs against other subjects.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import janitor
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

Import 2014-2019 dataset

In [2]:
# Read in 2014-2019 dataset
df1 = pd.read_csv(r"C:\Users\klc90\my_python_projects\project_women_in_STEM\data\HESA\student_numbers_by_subject_and_sex_scotland\2014-2019.csv", header=17)
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134380 entries, 0 to 134379
Data columns (total 8 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Subject Area            134380 non-null  object
 1   First year marker       134380 non-null  object
 2   Level of study          134380 non-null  object
 3   Mode of study           134380 non-null  object
 4   Country of HE provider  134380 non-null  object
 5   Sex                     134380 non-null  object
 6   Academic Year           134380 non-null  object
 7   Number                  134380 non-null  int64 
dtypes: int64(1), object(7)
memory usage: 8.2+ MB

View unique values in each column

This will allow us to filter for specific columns later.

In [3]:
print(df1["Subject Area"].unique())
print(df1["First year marker"].unique())
print(df1["Level of study"].unique())
print(df1["Mode of study"].unique())
print(df1["Country of HE provider"].unique())
['(1) Medicine & dentistry' '(2) Subjects allied to medicine'
 '(3) Biological sciences' '(4) Veterinary science'
 '(5) Agriculture & related subjects' '(6) Physical sciences'
 '(7) Mathematical sciences' '(8) Computer science'
 '(9) Engineering & technology' '(A) Architecture, building & planning'
 'Total - Science subject areas' '(B) Social studies' '(C) Law'
 '(D) Business & administrative studies'
 '(E) Mass communications & documentation' '(F) Languages'
 '(G) Historical & philosophical studies' '(H) Creative arts & design'
 '(I) Education' '(J) Combined' 'Total - Non-science subject areas'
 'Total']
['All' 'First year' 'Other years']
['All' 'Postgraduate (research)' 'Postgraduate (taught)'
 'All postgraduate' 'First degree' 'Other undergraduate'
 'All undergraduate']
['Full-time' 'Part-time' 'All']
['England' 'Northern Ireland' 'Scotland' 'Wales' 'All']

Cleaning and Prep

In [4]:
# Headers to snakecase
df1_clean = df1.clean_names()
In [5]:
# Drop total rows (Total - Science subject areas, Total - Non-science subject areas, Total)
df1_clean = df1_clean[(df1_clean["subject_area"] != "Total") 
                                & (df1_clean["subject_area"] != "Total - Science subject areas")
                                & (df1_clean["subject_area"] != "Total - Non-science subject areas")
                                & (df1_clean["subject_area"] != "(J) Combined")].copy()
In [6]:
# Remove the brackets and number/letter at the start of the Subject Area values
df1_clean["subject_area"] = df1_clean["subject_area"].str[4:]
print(df1_clean["subject_area"].unique())
['Medicine & dentistry' 'Subjects allied to medicine'
 'Biological sciences' 'Veterinary science'
 'Agriculture & related subjects' 'Physical sciences'
 'Mathematical sciences' 'Computer science' 'Engineering & technology'
 'Architecture, building & planning' 'Social studies' 'Law'
 'Business & administrative studies' 'Mass communications & documentation'
 'Languages' 'Historical & philosophical studies' 'Creative arts & design'
 'Education']

What to investigate

We are interested in Undergraduate study at Scottish universities. We'll focus on full-time study and first year enrolled.

In [7]:
# Filter for Scotland, Undergraduate level of study, full-time study, first year enrolled
ug_scotland_df = df1_clean[(df1_clean["first_year_marker"] == "First year")
                     & (df1_clean["country_of_he_provider"] == "Scotland")
                     & (df1_clean["level_of_study"] == "All undergraduate")
                     & (df1_clean["mode_of_study"] == "Full-time")].copy()
ug_scotland_df.shape
Out[7]:
(360, 8)
In [8]:
# Dataset per subject area
# Get Computer Science subject area
cs_ug_scotland_df = ug_scotland_df[ug_scotland_df["subject_area"] == "Computer science"].copy()
cs_ug_scotland_df
Out[8]:
subject_area first_year_marker level_of_study mode_of_study country_of_he_provider sex academic_year number
46480 Computer science First year All undergraduate Full-time Scotland Female 2014/15 480
46481 Computer science First year All undergraduate Full-time Scotland Male 2014/15 2335
46482 Computer science First year All undergraduate Full-time Scotland Other 2014/15 0
46483 Computer science First year All undergraduate Full-time Scotland Total 2014/15 2815
46484 Computer science First year All undergraduate Full-time Scotland Female 2015/16 470
46485 Computer science First year All undergraduate Full-time Scotland Male 2015/16 2210
46486 Computer science First year All undergraduate Full-time Scotland Other 2015/16 0
46487 Computer science First year All undergraduate Full-time Scotland Total 2015/16 2680
46488 Computer science First year All undergraduate Full-time Scotland Female 2016/17 530
46489 Computer science First year All undergraduate Full-time Scotland Male 2016/17 2460
46490 Computer science First year All undergraduate Full-time Scotland Other 2016/17 0
46491 Computer science First year All undergraduate Full-time Scotland Total 2016/17 2990
46492 Computer science First year All undergraduate Full-time Scotland Female 2017/18 530
46493 Computer science First year All undergraduate Full-time Scotland Male 2017/18 2505
46494 Computer science First year All undergraduate Full-time Scotland Other 2017/18 5
46495 Computer science First year All undergraduate Full-time Scotland Total 2017/18 3040
46496 Computer science First year All undergraduate Full-time Scotland Female 2018/19 600
46497 Computer science First year All undergraduate Full-time Scotland Male 2018/19 2920
46498 Computer science First year All undergraduate Full-time Scotland Other 2018/19 5
46499 Computer science First year All undergraduate Full-time Scotland Total 2018/19 3525

Import 2019/2020 dataset

In [9]:
# Import 2019-2020 dataset
df2 = pd.read_csv(r"C:\Users\klc90\my_python_projects\project_women_in_STEM\data\HESA\student_numbers_by_subject_and_sex_scotland\2019-2020.csv", header=14)
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32944 entries, 0 to 32943
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CAH level 1             32944 non-null  object
 1   First year marker       32944 non-null  object
 2   Level of study          32944 non-null  object
 3   Mode of study           32944 non-null  object
 4   Country of HE provider  32944 non-null  object
 5   Sex                     32944 non-null  object
 6   Academic Year           32944 non-null  object
 7   Number                  32944 non-null  int64 
dtypes: int64(1), object(7)
memory usage: 2.0+ MB
In [10]:
df2.head()
Out[10]:
CAH level 1 First year marker Level of study Mode of study Country of HE provider Sex Academic Year Number
0 01 Medicine and dentistry All All All All Female 2019/20 42610
1 01 Medicine and dentistry All All All All Male 2019/20 27605
2 01 Medicine and dentistry All All All All Other 2019/20 150
3 01 Medicine and dentistry All All All All Total 2019/20 70370
4 01 Medicine and dentistry All All Full-time All Female 2019/20 36470
In [11]:
# View unique values in each column to see what can be filtered
print(df2["CAH level 1"].unique())
print(df2["First year marker"].unique())
print(df2["Level of study"].unique())
print(df2["Mode of study"].unique())
print(df2["Country of HE provider"].unique())
['01 Medicine and dentistry' '02 Subjects allied to medicine'
 '03 Biological and sport sciences' '04 Psychology'
 '05 Veterinary sciences' '06 Agriculture, food and related studies'
 '07 Physical sciences' '08 General and others in sciences'
 '09 Mathematical sciences' '10 Engineering and technology' '11 Computing'
 '12 Geographical and environmental studies (natural sciences)'
 '13 Architecture, building and planning' 'Total science CAH level 1'
 '12 Geographical and environmental studies (social sciences)'
 '14 Humanities and liberal arts (non-specific)' '15 Social sciences'
 '16 Law' '17 Business and management' '18 Communications and media'
 '19 Language and area studies'
 '20 Historical, philosophical and religious studies'
 '21 Creative arts and design' '22 Education and teaching'
 '23 Combined and general studies' 'Total non-science CAH level 1' 'Total']
['All' 'First year' 'Other years']
['All' 'Postgraduate (research)' 'Postgraduate (taught)'
 'All postgraduate' 'First degree' 'Other undergraduate'
 'All undergraduate']
['All' 'Full-time' 'Part-time']
['All' 'England' 'Northern Ireland' 'Scotland' 'Wales']

Clean and prep

In [12]:
# Headers to snakecase
df2_clean = df2.clean_names()
In [13]:
# Drop total rows (Total - Science subject areas, Total - Non-science subject areas, Total)
df2_clean = df2_clean[(df2_clean["cah_level_1"] != "Total") 
                                & (df2_clean["cah_level_1"] != "Total science CAH level 1")
                                & (df2_clean["cah_level_1"] != "Total non-science CAH level 1")
                                & (df2_clean["cah_level_1"] != "Combined and general studies")].copy()
In [14]:
# Remove the brackets and number/letter at the start of the Subject Area values
df2_clean["cah_level_1"] = df2_clean["cah_level_1"].str[3:]
print(df2_clean["cah_level_1"].unique())
['Medicine and dentistry' 'Subjects allied to medicine'
 'Biological and sport sciences' 'Psychology' 'Veterinary sciences'
 'Agriculture, food and related studies' 'Physical sciences'
 'General and others in sciences' 'Mathematical sciences'
 'Engineering and technology' 'Computing'
 'Geographical and environmental studies (natural sciences)'
 'Architecture, building and planning'
 'Geographical and environmental studies (social sciences)'
 'Humanities and liberal arts (non-specific)' 'Social sciences' 'Law'
 'Business and management' 'Communications and media'
 'Language and area studies'
 'Historical, philosophical and religious studies'
 'Creative arts and design' 'Education and teaching'
 'Combined and general studies']

Filter for subset

In [15]:
# Filter for Scotland, Undergraduate level of study, and full-time study
ug_scotland_df2 = df2_clean[(df2_clean["first_year_marker"] == "First year")
                     & (df2_clean["country_of_he_provider"] == "Scotland")
                     & (df2_clean["level_of_study"] == "All undergraduate")
                     & (df2_clean["mode_of_study"] == "Full-time")].copy()
ug_scotland_df2.shape
Out[15]:
(96, 8)
In [16]:
# Get Computing subject area
cs_ug_scotland_df2 = ug_scotland_df2[ug_scotland_df2["cah_level_1"] == "Computing"].copy()
cs_ug_scotland_df2.shape
Out[16]:
(4, 8)
In [17]:
# Append datasets
# Rename column
cs_ug_scotland_df2.rename(columns={"cah_level_1": "subject_area"}, inplace=True)

# Append
cs_ug_scotland_df3 = cs_ug_scotland_df.append(cs_ug_scotland_df2, ignore_index=True)

# Drop "Other"
cs_ug_scotland_df3 = cs_ug_scotland_df3[cs_ug_scotland_df3["sex"] != "Other"]
In [18]:
### Line plot trend
# plt.figure(figsize=(16, 6))

# sns.lineplot(x="academic_year", y="number", hue="sex", data=cs_ug_scotland_df3)
# plt.title("Number of first year enrolments in UG Computer degrees at Scottish Universities")
# plt.ylabel("Enrolments")
# plt.xlabel("Academic year")
# plt.show()
In [19]:
# Drop total rows
cs_ug_scotland_sex = cs_ug_scotland_df3[cs_ug_scotland_df3["sex"] != "Total"].copy()

fig = px.bar(cs_ug_scotland_sex, x="academic_year", y='number', color='sex', orientation='v',
             labels={"Academic year", "Enrolments", "Sex"},
            title="Proportion of males and females enrolling in UG Computing degress at Scottish Universities")
fig.show()