One requirement for certification as a speech-language pathologist is the completion of a Master's degree is Speech-Language Pathology / Communication Sciences and Disorders (MA SLP). As the career pays well and opportunities are plentiful across the country, there is a lot of competition to enter into Master's programs in the field, so graduate programs will often set a minimum GPA threshold for applicants to be considered.
To help students in our Pre-Professional Speech Language Pathology program understand the importance of maintaining a high GPA, I carried out the following analysis using data collected from the American Speech-Language-Hearing Association's EdFind app. (Note that the data used below was collected in 2019, and the EdFind app and html structure/format has changed since then.)
import folium, os, re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
Python Selenium was used to scrape data from ASHA's EdFind app for all MA SLP programs in the US, with the raw contents of the body section for each school saved as separate file. The code in the following sections imports data from these saved html files.
Manual review of the data structure reveals the following:
<div class="headline">
, which contains the section heading only. The Master's programs we are interested in can be identified by <a name="CED-SLP"></a>
inside of the <h2>...</h2>
element of this div
element.divs
which are all siblings with each other and the div
above.divs
contains a single description list dl
, which in turn contains dt
tags followed by one or more dd
descriptions.dd
elements consist of a simple value/description.dd
descriptions consist of a key followed by one or more values. Such keys can be identified by the presence of a colon either in front of a numerical value on the same line, or at the end of a line when the subsequent dd
elements represent its values.Thus, we will create a dictionary with the following structure:
{school_name:
{program_name:
{section: # General, Admission, Enrollment, Graduation
{dt_key:
dd_value / [dd_value1, ...] / {dd_subkey: dd_value,}, # depends on case
},
},
city: city_name,
state: state_abbreviation,
zip: zip_code,
},
}
# Read in the data for all programs
def get_raw_program_data():
schools = sorted(os.listdir('slp_programs')) # filenames here are equivalent to school names
response = dict()
for s in schools:
with open(os.path.join('slp_programs', s), 'r') as f:
data = f.read()
response[s] = data
return(response)
def extract_to_dictionary(programs_html):
"""
Inputs:
programs_html (dict) - {school_name: raw_html}
Outputs:
dico (dict)
"""
dico = dict()
# Loop over each program
for school, val in programs_html.items():
# Parse the html for the program
soup = BeautifulSoup(val, 'html.parser')
# Locate the section for MA programs
divs = soup.find_all('div', class_ = "headline")
cur = None
for div in divs:
if div.find('a', attrs = {"name": "CED-SLP"}):
cur = div
program_name = cur.text
dico[school] = {'program': {
"CED-SLP": {
'name': program_name,
},
},
}
break
# If the section was not found, skip this school
if not cur:
continue
#Extract the city, state, and zip code
pattern = re.compile('(?P<city>[^,]+),? (?P<state>[A-Z]{2}) (?P<zip>[0-9]{5}).*')
location = {}
brs = soup.find('p').find_all('br')
for br in brs:
text = br.next.strip()
if text:
result = pattern.match(text)
if result:
location['city'], location['state'], location['zip'] = result.groups()
break
# Update the dictionary with the city, state and zip code
dico[school].update(location)
# Add the section data
for section in ["General", "Admissions", "Enrollment", "Graduation"]:
# Set cur to the current section
cur = cur.find_next_sibling('div')
# Check that section is the expected one
assert cur.h3.text == section
# Find all <dt> elements in current section
dts = cur.find_all('dt')
dts_dict = dict()
for dt in dts:
dds = []
dds_values_dict = {}
dds_values_list = []
# Make a list of all <dd> elements of current <dt>
dd = dt.find_next_sibling(['dt','dd'])
while dd and dd.name != 'dt':
dds.append(dd)
dd = dd.find_next_sibling(['dt','dd'])
# Loop through all <dd> elements
n = 0
while n < len(dds):
dd = dds[n]
# Lack of a colon indicates a simple value
if ':' not in dd.text:
dds_values_list.append(dd.text.strip())
n += 1
# If there is a colon, the value is a dictionary
else:
dd_text = dd.text.split(':')
dd_key = dd_text[0].strip()
dd_value = dd_text[1].strip()
# If there is text following the colon, the key: value
# pair is found in the same line
if dd_value != '':
dds_values_dict.update({dd_key: dd_value})
n += 1
# Otherwise, the value(s) is/are found in the following
# <dd> elements
else:
dd_value = []
n += 1
while n < len(dds) and not dds[n].findAll('b'):
dd_value.append(dds[n].text.strip())
n += 1
dds_values_dict.update({dd_key: dd_value})
# Remove the colon from the key
dt_key = dt.text.replace(':','').strip()
# If no values of the <dt> tag were {key: value} format...
if len(dds_values_dict) == 0:
# Convert lists of length 1 to simple strings and save as value
if len(dds_values_list) == 1:
dts_dict[dt_key] = dds_values_list[0]
# Otherwise save list as value
else:
dts_dict[dt_key] = dds_values_list
# If all values of the <dt> tag were in {key: value} format, save
# the dictionary as the value
elif len(dds_values_list) == 0:
dts_dict[dt_key] = dds_values_dict
# If the values of the <dt> tag were of mixed formats, convert the lists into
# dictionaries, merge it with the dictionary for the <dd> tags, and save this
# dictionary as the value for the <dt> element
else:
for val in dds_values_list:
dds_values_dict.update({val: True})
dts_dict[dt_key] = dds_values_dict
dico[school]['program']['CED-SLP'][section] = dts_dict
return dico
programs_html = get_raw_program_data()
programs = extract_to_dictionary(programs_html)
programs['University of Delaware']['program']['CED-SLP']['Admissions']
{'Admission Contact': {'Email Address': 'cscd-info@udel.edu'}, 'Application Deadline(s)': {'Application deadline for Fall admission': 'February 01'}, 'Application Process': ['CSD Central Application Service (CSDCAS)', 'University online application'], 'Application Requirements': {'Letters of recommendation': '3', 'Writing sample/essay': '2', 'Graduate Record Examination (GRE) score': True}, 'Average GRE Score for Applicants Offered Admission': {'Verbal reasoning': 'N/A', 'Quantitative reasoning': 'N/A', 'Analytical writing': '4.05'}, 'GPA Range for Applicants Offered Admission': '3.3-3.9', 'Undergraduate Prerequisites Required': 'Yes', 'Offer Prerequisite Courses': 'No', 'Number of Applications Received': {'Full-time Students': '290', 'Part-time Students': '0', 'Total': '290'}, 'Number of Admission Offers': {'Full time': '86', 'Part time': '0', 'Total': '86'}, 'Number of Admission Offers with Funding': '86'}
We are only interested in the GPA and state data, so we'll extract this into a new dictionary and convert it into a dataframe.
# Extract the relevant data into a new dictionary
def get_gpa_info(program_dict):
gpa_range = program_dict['program']['CED-SLP']['Admissions']['GPA Range for Applicants Offered Admission']
gpa_info = {'low_gpa': None, 'high_gpa': None}
# GPA range represented as text without consistent representation (e.g., spacing, number of hyphens, etc.),
# so we'll simply identify and extract the numerical characters and decimal point
if '-' in gpa_range:
gpa_range = re.match('(?P<low>[\d\.]+)[^\d]+(?P<high>[\d\.]+).*', gpa_range)
low, high = gpa_range.groups()
gpa_info['low_gpa'] = float(low)
gpa_info['high_gpa'] = float(high)
return(gpa_info)
programs_min_dict = dict()
for key in programs:
programs_min_dict[key] = {'state': programs[key]['state'],}
programs_min_dict[key].update(get_gpa_info(programs[key]))
df = pd.DataFrame.from_dict(programs_min_dict, orient='index')
df.head()
state | low_gpa | high_gpa | |
---|---|---|---|
Abilene Christian University | TX | 3.00 | 4.00 |
Adelphi University | NY | 2.80 | 4.00 |
Alabama A&M University | AL | 3.30 | 3.87 |
Appalachian State University | NC | 3.33 | 4.00 |
Arizona State University | AZ | 3.41 | 4.10 |
len(df)
259
Out of 259 schools, we lack GPA information for 10 of them, so we'll remove them from our dataset.
df.isna().sum()
state 0 low_gpa 10 high_gpa 10 dtype: int64
df.dropna(axis=0, inplace=True)
len(df)
249
Sorting the values by low GPA reveals that one row appears to have the low and high GPAs reversed. We'll want to swap the reversed values before continuing.
df.sort_values(by='low_gpa', ascending=False)[:10]
state | low_gpa | high_gpa | |
---|---|---|---|
University of Missouri | MO | 4.00 | 3.09 |
University of Pittsburgh | PA | 3.92 | 4.00 |
Misericordia University | PA | 3.80 | 4.00 |
West Texas A&M University | TX | 3.70 | 4.00 |
California State University, Los Angeles | CA | 3.68 | 4.00 |
Duquesne University | PA | 3.68 | 4.00 |
Brigham Young University | UT | 3.66 | 3.98 |
California State University, Sacramento | CA | 3.64 | 4.00 |
Kansas State University | KS | 3.63 | 4.00 |
Oklahoma State University | OK | 3.61 | 4.00 |
df[df['low_gpa'] > df['high_gpa']]
state | low_gpa | high_gpa | |
---|---|---|---|
University of Missouri | MO | 4.0 | 3.09 |
rows_to_swap = df['low_gpa'] > df['high_gpa']
df.loc[rows_to_swap, ['low_gpa', 'high_gpa']] = df.loc[rows_to_swap, ['high_gpa', 'low_gpa']].values
df.sort_values(by='low_gpa', ascending=False)[:10]
state | low_gpa | high_gpa | |
---|---|---|---|
University of Pittsburgh | PA | 3.92 | 4.00 |
Misericordia University | PA | 3.80 | 4.00 |
West Texas A&M University | TX | 3.70 | 4.00 |
Duquesne University | PA | 3.68 | 4.00 |
California State University, Los Angeles | CA | 3.68 | 4.00 |
Brigham Young University | UT | 3.66 | 3.98 |
California State University, Sacramento | CA | 3.64 | 4.00 |
Kansas State University | KS | 3.63 | 4.00 |
University of Florida, Gainesville | FL | 3.61 | 4.00 |
Oklahoma State University | OK | 3.61 | 4.00 |
Finally, we look at how many schools from each state are represented in this data. Ten states only have one MA SLP program, while New York has the highest number of programs at 26.
by_state = df.groupby(by='state').size().reset_index(name='count').sort_values('count')
by_state.T
24 | 44 | 39 | 37 | 28 | 19 | 11 | 9 | 48 | 7 | ... | 16 | 20 | 8 | 22 | 32 | 12 | 3 | 35 | 41 | 31 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
state | MT | VT | SD | RI | NH | ME | ID | HI | WY | DE | ... | LA | MI | FL | MO | OH | IL | CA | PA | TX | NY |
count | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 7 | 7 | 8 | 9 | 10 | 12 | 14 | 15 | 18 | 26 |
2 rows × 49 columns
by_state[by_state['count'] == 1].count()
state 10 count 10 dtype: int64
More detailed investigation of the data reveals that 3 states have 0 MA SLP programs represented in the data set, and two 'states' in our data set are not actually states: Washington, D.C. (DC) and Puerto Rico (PR).
by_state.count()
state 49 count 49 dtype: int64
by_state.sort_index()
state | count | |
---|---|---|
0 | AL | 6 |
1 | AR | 4 |
2 | AZ | 4 |
3 | CA | 14 |
4 | CO | 2 |
5 | CT | 3 |
6 | DC | 3 |
7 | DE | 1 |
8 | FL | 8 |
9 | HI | 1 |
10 | IA | 3 |
11 | ID | 1 |
12 | IL | 12 |
13 | IN | 5 |
14 | KS | 3 |
15 | KY | 5 |
16 | LA | 7 |
17 | MA | 6 |
18 | MD | 3 |
19 | ME | 1 |
20 | MI | 7 |
21 | MN | 3 |
22 | MO | 9 |
23 | MS | 4 |
24 | MT | 1 |
25 | NC | 6 |
26 | ND | 2 |
27 | NE | 3 |
28 | NH | 1 |
29 | NJ | 5 |
30 | NM | 3 |
31 | NY | 26 |
32 | OH | 10 |
33 | OK | 4 |
34 | OR | 3 |
35 | PA | 15 |
36 | PR | 3 |
37 | RI | 1 |
38 | SC | 3 |
39 | SD | 1 |
40 | TN | 5 |
41 | TX | 18 |
42 | UT | 4 |
43 | VA | 6 |
44 | VT | 1 |
45 | WA | 4 |
46 | WI | 6 |
47 | WV | 2 |
48 | WY | 1 |
We first look at some descriptive statistics of the dataset.
df.describe()
low_gpa | high_gpa | |
---|---|---|
count | 249.000000 | 249.000000 |
mean | 3.193325 | 3.980763 |
std | 0.244902 | 0.075176 |
min | 2.350000 | 3.400000 |
25% | 3.020000 | 4.000000 |
50% | 3.200000 | 4.000000 |
75% | 3.340000 | 4.000000 |
max | 3.920000 | 4.260000 |
Unsurpisingly, we notice that there is almost no variation in the highest GPAs of candidates offered admission. In contrast, a GPA of 3.2 or higher is required for 50% of SLP programs in the dataset, with a standard deviation of 0.25 points.
Plotting out the number of schools that accept any given GPA scores reveals an initial jump starting at 3.0, suggesting that the large majority of MA SLP programs have a GPA threshold of 3.0 or higher.
# For each GPA point, calculate sum of school for which it suffices
low_gpa = [[x/100, len(df[df["low_gpa"] <= x/100])] for x in range(230,401)]
low_gpa = pd.DataFrame(low_gpa, columns=['Min_GPA', 'Num_Schools']).set_index('Min_GPA')
low_gpa.plot()
ax = plt.gca()
ax.get_legend().remove()
ax.set_xlabel('GPA')
ax.set_ylabel('Number of Schools')
plt.show()
Assuming that schools' GPA cut-off points are usually set at the level of a single demical point, we can estimate the cut-off points to be the floor of the lowest GPA values to one demical point. To plot this out, we can simply bin scores in each decimal point range.
plt.xticks(np.arange(2.3, 4.01, 0.1))
plt.hist(df['low_gpa'], bins = 17, range=(2.3, 4.), cumulative = False)
ax = plt.gca()
ax.set_xlabel('GPA Threshold')
ax.set_ylabel('Number of Schools')
plt.show()
In line with earlier observations, the graph above indicates that most common (assumed) cut-off limits are at 3.0 and 3.2.
In my experience, most MA students in SLP seem to prefer to attend a school close to where they (or their family) lives. For students with regional preferences, mapping out the lowest and average GPA scores accepted in different states can be helpful.
The first map shows the lowest GPA score accepted in each state.
# Find the minimum values by state
min_by_state = df.groupby(by='state').min()
# URL for US states mapping data
url = (
"https://raw.githubusercontent.com/python-visualization/folium/main/examples/data"
)
state_geo = f"{url}/us-states.json"
# Initialize the map
m = folium.Map(location=[48,-102], zoom_start=3, tiles="cartodb positron")
m.save('slp_usa.html')
# Plot the data on the map
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=min_by_state.reset_index(),
columns=["state", "low_gpa"],
key_on="feature.id",
fill_color="BuGn",
nan_fill_color="gray",
nan_fill_opacity=0.4,
fill_opacity=0.7,
line_opacity=0.2,
bins=9,
legend_name="Minimum GPA by State",
).add_to(m)
#m.save('slp_min_by_state.html')
m
The second map shows the average lowest accepted GPA score for each state.
# Find the average values by state
mean_by_state = df.groupby(by='state').mean()
# Initialize the map
m = folium.Map(location=[48,-102], zoom_start=3, tiles="cartodb positron")
# Plot the data on the map
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=mean_by_state.reset_index(),
columns=["state", "low_gpa"],
key_on="feature.id",
fill_color="BuGn",
nan_fill_color="gray",
nan_fill_opacity=0.4,
fill_opacity=0.7,
line_opacity=0.2,
bins=9,
legend_name="Mean Minimum GPA by State",
).add_to(m)
#m.save('slp_avg_by_state.html')
m