The SAT (Scholastic Aptitude Test) is administered during the senior year of high school. Colleges across the country rely on SAT scores as one of their key admission criteria.
We will use data published by the New York City Public schools to investigate some of these questions and explore some of the demographic factors that may be connected to variations in SAT scores. The initial dataset covers 435 highschools with over 100 pieces of information on school performance, programs, and student demographics for each school.
New York City has a significant immigrant population and is very diverse, so one of the aims of this project is to find correlations between SAT scores and demographic factors such as gender, race, income, and other variables.
After analyzing the different features that may influence the SAT score results, we arrived to the following conclusions:
In this project are used the following Python packages:
The final data set combines 7 different data sets freely published by New York City in order to find correlations between SAT scores, different demographic measures (for example, gender/ethnicity) or schools satisfaction scores (for example, school safety):
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
import pandas as pd
import numpy
import re
%matplotlib inline
from plotly.offline import iplot, init_notebook_mode
import plotly.express as px
import cufflinks as cf
from sklearn import preprocessing
cf.go_offline(connected=True)
init_notebook_mode(connected=True)
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)
survey["DBN"] = survey["dbn"]
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
plt.figure(figsize=(20, 10))
sns.heatmap(combined.isnull())
combined.isnull().sum()
After creating the combined dataset, the previous heatmap shows us that ther are som columns like priority09 which contain mainly null values. To solve this problem, data imputation technique will be applied.
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
combined.head(5)
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
plt.figure(figsize=(20, 10))
sns.set_style("darkgrid")
ax = sns.distplot(combined['sat_score'])
ax.set_yticks([])
ax.set_title("New York Schools Sat Scores Distributions", fontweight="bold", fontsize=15,)
ax.set_xlabel("Sat Score")
ax.set_ylabel("")
ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
sns.despine(left=True)
The New York Schools Sat Scores Distributions histogram shows us that the majority of the schools have a sat score between 1100 and 1300. There are some schools over 1600 sat score, these may be considered as schools with a high sat score.
correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
sns.set_style("darkgrid")
plt.figure(figsize=(20, 20))
ax = sns.barplot(y=survey_fields, x=correlations[survey_fields], palette='Blues')
ax.set_title("Sat score correlation with survey fields", fontweight="bold", fontsize=15,)
ax.set_xlabel("Correlation with SAT Scores",fontsize=12)
ax.set_ylabel("",fontsize=14)
ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
sns.despine(left=True)
As is shown in the previous Sat score correlation with survey fields plot. There are high correlations between sat_score and N_s and N_p. It makes sense, since these columns are correlated with the total_enrollment column.
Follwed by the safety columns saf_t_11 and saf_s_11 which are high correlated with the sat_score column to.
Also, another interesting correlation but no with the same importance as the previous ones is the sat_score correlation with rr_s colum. Which shows the student response rate.
df = combined[['DBN'] + survey_fields]
df
df = df.melt(id_vars='DBN')
df
aux = pd.DataFrame(df['variable'].str.split("_").to_list())
df['question'] = aux[0]
df['respondent'] = aux[1]
df = df.drop(columns=['variable'])
df
df['question'].value_counts(dropna=False)
df['respondent'].value_counts(dropna=False)
df = df[df['respondent'] != 'tot']
df = df[df['question'] != 'N']
df = df[df['question'] != 'rr']
df['respondent'] = df['respondent'].map({
'p': 'Parent',
't':'Teacher',
's':'Student'})
df['question'] = df['question'].map({
'aca': 'Academic Expectations',
'eng': 'Engagement',
'com': 'Communication',
'saf': 'Safety'})
df = df.reset_index(drop=True)
df
plt.figure(figsize=(20, 10))
sns.set_style("darkgrid")
ax = sns.barplot(x='question',
y='value',
hue='respondent',
data=df,
palette=sns.color_palette("GnBu_d"))
ax.set_title("Question values by respondent", fontweight="bold", fontsize=15,)
ax.set_xlabel("Question")
ax.set_ylabel("Value")
ax.tick_params(bottom=True, top=False, left=True, right=False, labelbottom=True)
sns.despine(left=True)
From the Question values by respondent chart, we can see that all perceptions independently of the role of the participant are common. In general students are the ones that rate things slightly lower than teachers and parents. On the other hand, parents are the ones who rate things slightly higher. Teachers are in the middle quite near the student's ratings.
The only type of question where students and parents ratings are quite the same is Academic Expectations.
On the other hand, Safety rating seems to be the aspect where they differ the most. That's why in the next section the Safety scores will be deeply explored.
warnings.filterwarnings("ignore",category=FutureWarning)
combined.iplot(
x='saf_s_11',
y='sat_score',
categories='boro',
text="SCHOOL NAME",
vline= {"x":combined["saf_s_11"].mean(),
"color":"#000000",
"dash": "dash"
},
hline={"y":combined["sat_score"].mean(),
"color":"#000000",
"dash": "dash"
},
xTitle='Safety score 11',
yTitle='Satisfaction score avg',
title='Exploring safety score'
)
After seeing the Exploring safety score we can conclude that there is a correlation between the sat_score and sad_s_11.
The majority of the schools have an average saf_s_11 between 5.5 and 7.5 and an average sat_score over 1000 and 1400.
Also, we can see that there is a high correlation between schools with an extremely high SAT score and sad_s_11 score.
To sum up, there is a correlation between these indicators but it isn't so strong at may seem at first look.
fig = px.scatter(combined[['sat_score','SCHOOL NAME', 'saf_tot_11', 'boro']].dropna(),
trendline='ols',
x="saf_tot_11",
y="sat_score",
size="saf_tot_11",
color="sat_score",
size_max=15,
hover_name="SCHOOL NAME",
hover_data=['boro', 'saf_tot_11','sat_score'],
color_continuous_scale=px.colors.sequential.Viridis,
title='Environment Safety and Respect Perception Correlation with SAT Scores')
fig.show()
In the Environment Safety and Respect Perception Correlation with SAT Scores plot we can see clearly the postivie correlation between sad_tot_11 and sat_scores with the number of outliers in the upper-right corner.
On the other hand, the previous plot also shows us that there is a reduced group of schools which have a high safety rank but low SAT scores. In the lower right part of the chart, this schools are mainly from Queens and Brooklyn.
safety_avg_by_district = combined.groupby('school_dist').agg(numpy.mean)
safety_avg_by_district
safety_avg_by_district.reset_index(inplace=True)
import folium
import branca.colormap as cm
# NY = [40.693943, -73.985880]
school_map = folium.Map(location=[40.693943, -73.985880],
#width=500,
#height=800,
zoom_start=10,
tiles='CartoDB positron')
map_title = "Safety average by district"
title_html = '''
<h3 align="center" style="font-size:20px"><b>{}</b></h3>
'''.format(map_title)
school_map.get_root().html.add_child(folium.Element(title_html))
size_parameter = "total_enrollment"
color_parameter = "saf_s_11"
start_col_val = safety_avg_by_district[color_parameter].min()
end_col_val = safety_avg_by_district[color_parameter].max()
start_col_hex = "#d01c8b" # used https://colorbrewer2.org/
end_col_hex = "#4dac26"
colormap = cm.LinearColormap(colors=[start_col_hex,
#"white",
end_col_hex],
vmin=start_col_val,
vmax=end_col_val
)
colormap.caption = "Safety average by district"
for index, row in safety_avg_by_district.iterrows():
folium.Circle(location=[row["lat"],row["lon"]],
color=colormap(row[color_parameter]),
fill_color=colormap(row[color_parameter]),
fill_opacity=0.75,
radius=row[size_parameter],
weight=2,
tooltip="District "+row["school_dist"]+
" <br> Avg. Safety / Respect Score, students response (color): {:.1f}".format(row[color_parameter])+
" <br> Avg. enrollment per school (size): {:.0f}".format(row[size_parameter])
).add_to(school_map)
school_map.add_child(colormap)
school_map
saf_pivot_df = combined.pivot_table(index=['boro', 'school_dist'], values='saf_s_11', margins='All')
index = saf_pivot_df.index
saf_pivot_df
saf_pivot_df = saf_pivot_df.reset_index(drop=True)
df = pd.DataFrame(index.tolist())
df['saf_s_11'] = saf_pivot_df['saf_s_11']
df
df = df.rename(columns={0: "boro", 1: "school_dist"})
df.drop(df.tail(1).index,inplace=True)
df
import plotly.express as px
fig = px.bar(df, x='school_dist', y='saf_s_11', color='boro', title='Saf_s_11 by school district')
fig.show()
From the Safety average by district interactive map, it looks like:
From the safety by district table printed and Saf_s_11 by school district interactive bar chart we can exctract more precise conclusions:
race_percentage_indicators = [
'white_per',
'asian_per',
'black_per',
'hispanic_per'
]
plt.figure(figsize=(20, 10))
ax = correlations[race_percentage_indicators].plot.bar()
ax.set_title("Racial correlation with SAT performance", fontweight="bold", fontsize=15,)
ax.set_xlabel("Race percentage")
ax.set_ylabel("Correlation")
ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
sns.despine(left=True)
From the Racial correlation with SAT performance it seems like a higher percentage of white or asian students correlates positively with the SAT score averages.
On the other hand, seems looks like a higher percentage of black and hispanic students correlates negatively with the SAT score averages.
On this point we have two questions are this correlations due a lack of funding for schools in certain areas, with higher percentage of black and hispanic students or simply white and asian students are more fitting better in the current educational system.
warnings.filterwarnings("ignore",category=FutureWarning)
combined.iplot(
x='hispanic_per',
y='sat_score',
categories='boro',
text="SCHOOL NAME",
vline= {"x":combined["hispanic_per"].mean(),
"color":"#000000",
"dash": "dash"
},
hline={"y":combined["sat_score"].mean(),
"color":"#000000",
"dash": "dash"
},
xTitle='Hispanic %',
yTitle='Satisfaction score avg',
title='Hispanic per vs Sat score'
)
It tseems that there isn't a strong correlation between Hispanic per and Sat score but the previous plot shows clearly that independently of the percentage of the hispanic students, the Sat score is between 1000 and 1400, as we have seen previously.
What the previous scatter plot shows is that the schools with a percentage of hispanic students between 0% and 20% are the ones with a higher average Satisfaction score.
On the other hand the extreme cases of the schools with a a hispanic students percentage of 100% are the ones with lower SAT score. This may be due a lack of capital on this schools for hispanic students.
combined[combined['hispanic_per'] > 95]['SCHOOL NAME']
Schools info. After reading the descriptions of the listed schools, they appear to primarily be geared towards recent immigrants to the US. These schools students are starting the inclusion program by learning English, which would explain the lower SAT scores.
combined[(combined['hispanic_per'] < 10) & (combined['sat_score'] > 1800)]['SCHOOL NAME']
The schools listed above are specialized on science and technology, that generally receive extra founding to foment investigation and research. They only admin students after passing an entrance exam.
There is no explanation why this percentage of hispanic students is so low, but the admision process may explain why the SAT score of this students is higher.
plt.figure(figsize=(20, 10))
ax = correlations[['male_per', 'female_per']].plot.bar()
ax.set_title("Gender differences in SAT scores", fontweight="bold", fontsize=15,)
ax.set_xlabel("Gender",fontsize=12)
ax.set_ylabel("Correlation",fontsize=12)
ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
In the Gender differences in SAT scores, we can see that females have a high positive correlation with SAT scores and males a high negative correlation.
warnings.filterwarnings("ignore",category=FutureWarning)
combined.iplot(
x='female_per',
y='sat_score',
categories='boro',
text="SCHOOL NAME",
vline= {"x":combined["female_per"].mean(),
"color":"#000000",
"dash": "dash"
},
hline={"y":combined["sat_score"].mean(),
"color":"#000000",
"dash": "dash"
},
xTitle='Female %',
yTitle='Satisfaction score avg',
title='Female per vs Sat score'
)
From the scatter above we can see a strong correlation between Satisfaction Score and Female percentage. But, there is a cluster of schools with a female percentage between 55% and 65% with a high SAT score.
combined[(
combined["female_per"] > 60) &
(combined["sat_score"] > 1700)]["SCHOOL NAME"]
The shwon cluster in the scatter plot which schools are listed aboved, are formed by schools with high academic standards and specialized on arts. School info example.
In the U.S., high school students take Advanced Placement (AP). exams to earn college credit. There are AP exams for many different subjects.
It makes sense that the number of students at a school who took AP exams would be highly correlated with the school's SAT scores. Let's explore this relationship. Because total_enrollment is highly correlated with sat_score, we don't want to bias our results. Instead, we'll look at the percentage of students in each school who took at least one AP exam.
combined["ap_per"] = combined["AP Test Takers "] / combined["total_enrollment"]
fig = px.scatter(combined[['sat_score','SCHOOL NAME', 'ap_per', 'boro']].dropna(),
trendline='ols',
x="ap_per",
y="sat_score",
size="ap_per",
color="sat_score",
size_max=15,
hover_name="SCHOOL NAME",
hover_data=['boro', 'ap_per','sat_score'],
color_continuous_scale=px.colors.sequential.Viridis,
title='AP per vs Sat score')
fig.show()
It looks like there is a relationship between the percentage of students in a school who take the AP exam, and their average SAT scores. It's not an extremely strong correlation, though. There are some cases with a high ap_per and low sat_score and some cases with low ap_per and low sat_score.
combined[['sat_score','SCHOOL NAME','AVERAGE CLASS SIZE', 'boro']]
fig = px.scatter(combined[['sat_score','SCHOOL NAME', 'AVERAGE CLASS SIZE', 'boro']].dropna(),
trendline='ols',
x="AVERAGE CLASS SIZE",
y="sat_score",
size="AVERAGE CLASS SIZE",
color="sat_score",
size_max=15,
hover_name="SCHOOL NAME",
hover_data=['boro', 'AVERAGE CLASS SIZE','sat_score'],
color_continuous_scale=px.colors.sequential.Viridis,
title='Average Class Size vs SAT Scores')
fig.show()
From the previous scatter plot we can extract that there is a correlation and in general schools with higher average in terms of class size seem to have also a higher average in SAT scores.
combined.head(5)
combined.describe()
correlations = combined.corr()
correlations = correlations["sat_score"]
correlations
To compute the Schools Final Score from the correlations array, all the columns with a positive correlation bigger than the threshold (threshold=0.35) will be used.
positivie_correlations = correlations[correlations > 0.35]
positivie_correlations
Before that, we will remove the columns that contain repeted information or may affect on the final score, this columns are:
positivie_correlations = positivie_correlations.drop(labels=['asian_num', 'white_num'])
aux = combined[positivie_correlations.index]
aux
aux.describe()
Now we have a dataset with the most signigicant columns. But as it is shown in the previous description this columns values are in different ranges, that's why we previously importes Sklearn preprocessing, now all this dataset values will be scaled to values between 0 and 1.
x = aux.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)
df
df.describe()
df['final_score'] = df.sum(axis=1)
df['final_score'] = (df['final_score'] * 10) / 17
df
df['final_score'].describe()
Now let's add the final_score column to the previous combined dataframe.
combined['final_score'] = df['final_score']
combined
combined.loc[combined['final_score'] > 5]
combined.loc[combined['final_score'].idxmax()]
As we can see following our sclae, only 8 schools have a final sat score bigger than 5. The school with the best final sat score is BROOKLYN TECHNICAL HIGH SCHOOL. The next chart shows all the final sat score results for each school.
fig = px.bar(combined, x='SCHOOL NAME', y='final_score', title='Final score by school')
fig.show()
combined['final_score'].describe()
This project combined and analyzed 7 data sets that contained information no only about SAT scores, also included schools of NY and student's demographic factors. The main objective of this project was to analyze and find which are the factors that influence the most performances in SATs.
From the New York Schools Sat Scores Distributions histogram we have seen that the majority of the schools have an SAT score between 1100 and 1300. Also, there is a small group with SAT scores over 1600, this is considered schools with a very high SAT score.
From the correlations plot we concluded that columns like total_enrollment have a good correlation with SAT score but the most important conclusion is that there is a high correlation between SAT score and the saf_t_11 and saf_s_11 columns which refer to the safety ranks of the schools.
After that, we studied the difference in the results depending on the respondent. In the Question values by respondent chart, we can see that all perceptions independently of the role of the participant are common. In general students are the ones that rate things slightly lower than teachers and parents. On the other hand, parents are the ones who rate things slightly higher. Teachers are in the middle quite near the student's ratings. Safety rating seems to be the aspect where students and parents differ the most, knowing that and the high correlation between safety and SAT scores we decided to explore deeply Safety Scores.
From the safety exploration and the Safety average by district interactive map, we have seen that safety influences the average Sat Score, there are no high SAT scores for schools with low safety scores. But there are schools with a low SAT score and a normal/high safety score and low values of other factors. In the Safety average by district interactive map we have seen that safest schools are located in Manhattan and in the nearby regions of Queens and Bronx.
In terms of ethnic groups, we have seen that Hispanic and black ethnic groups are very disadvantaged in comparison with White and Asian ethnic groups. In general, they form or formed part of schools with poor academic standards and low safety ranks.
We have seen that there is no strong evidence depending on students gender. Also, there is no evidence that AP exam preparation influences in the SAT scores.
In the last step, we have seen that there is a relationship between average class size and SAT scores, schools with bigger classes tend to have a better SAT score (also there are exceptions).
Finally, we have used the features correlations values with SAT_score and a threshold of 0.35 to determine which are the most important columns. After reducing this list of features we used the Scikit Learn Preprocessing MinMaxScaler to normalize all these values in a range of 0 and 1. After we summed all the results a divided it by the total num of columns. Last, the results were multiplied by 10 to convert it in a 0 to 10 scale. That's how we created the final_score for each school.