Analyzing NYC High School Data¶

Introduction¶

The SAT (Scholastic Aptitude Test) is administered during the senior year of high school. Colleges across the country rely on SAT scores as one of their key admission criteria.

We will use data published by the New York City Public schools to investigate some of these questions and explore some of the demographic factors that may be connected to variations in SAT scores. The initial dataset covers 435 highschools with over 100 pieces of information on school performance, programs, and student demographics for each school.

New York City has a significant immigrant population and is very diverse, so one of the aims of this project is to find correlations between SAT scores and demographic factors such as gender, race, income, and other variables.

Conclusions¶

After analyzing the different features that may influence the SAT score results, we arrived to the following conclusions:

From the New York Schools Sat Scores Distributions histogram we have seen that the majority of the schools have a SAT score between 1100 and 1300.

SAT score is highly related with total enrollment (N_s and N_p) and safety (saf_t_11 and saf_s_11).

Safety average by district interactive map shows us that safest schools which are located in Manhattan and in the nearby regions of Queens and Bronx have a hogher safety score.

Hispanic and Black ethnic groups are very disadvantaged in comparision with White and Asian ethnic groups.

There is no strong evidence of differences depending on students gender.

In general schools with bigger classes tend to have a better SAT score.

Prerequisites¶

In this project are used the following Python packages:

Numpy

Pandas

Seaborn

Matplotlib

Warnings

Plotly

Cufflinks

Sklearn

Read in the data¶

The final data set combines 7 different data sets freely published by New York City in order to find correlations between SAT scores, different demographic measures (for example, gender/ethnicity) or schools satisfaction scores (for example, school safety):

SAT scores by school: SAT scores for each high school in New York City.

School attendance: Attendance information for each school in New York City.

School survey: Surveys of parents, teachers, and students at each school.

Class size: Information on class size for each school.

AP test results: Advanced Placement (AP) exam.

Graduation outcomes: he percentage of students who graduated, and other outcome information.

Demographics: Demographic information for each school.

In [160]:

import matplotlib.pyplot as plt
import warnings
import seaborn as sns
import pandas as pd
import numpy
import re
%matplotlib inline

In [161]:

from plotly.offline import iplot, init_notebook_mode
import plotly.express as px
import cufflinks as cf

In [162]:

from sklearn import preprocessing

In [163]:

cf.go_offline(connected=True)
init_notebook_mode(connected=True)

In [164]:

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d

Read in the surveys¶

In [165]:

all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN", 
    "rr_s", 
    "rr_t", 
    "rr_p", 
    "N_s", 
    "N_t", 
    "N_p", 
    "saf_p_11", 
    "com_p_11", 
    "eng_p_11", 
    "aca_p_11", 
    "saf_t_11", 
    "com_t_11", 
    "eng_t_11", 
    "aca_t_11", 
    "saf_s_11", 
    "com_s_11", 
    "eng_s_11", 
    "aca_s_11", 
    "saf_tot_11", 
    "com_tot_11", 
    "eng_tot_11", 
    "aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey

Add DBN columns¶

In [166]:

data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation
    
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

Convert columns to numeric¶

In [167]:

cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

Condense datasets¶

In [168]:

class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

Convert AP scores to numeric¶

In [169]:

cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

Combine the datasets¶

In [170]:

combined = data["sat_results"]

combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

In [171]:

plt.figure(figsize=(20, 10))
sns.heatmap(combined.isnull())

Out[171]:

<AxesSubplot:>

In [172]:

combined.isnull().sum()

Out[172]:

DBN                                  0
SCHOOL NAME                          0
Num of SAT Test Takers               0
SAT Critical Reading Avg. Score     28
SAT Math Avg. Score                 28
                                  ... 
priority09                         363
priority10                         363
Location 1                           0
lat                                  0
lon                                  0
Length: 159, dtype: int64

After creating the combined dataset, the previous heatmap shows us that ther are som columns like priority09 which contain mainly null values. To solve this problem, data imputation technique will be applied.

Data Imputation¶

In [173]:

combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
combined.head(5)

Out[173]:

	DBN	SCHOOL NAME	Num of SAT Test Takers	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score	sat_score	SchoolName	AP Test Takers	Total Exams Taken	...	priority04	priority05	Location 1	lat	lon
0	01M292	HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES	29	355.0	404.0	363.0	1122.0	0	129.028846	197.038462	...	Then to Manhattan students or residents	Then to New York City residents	220 Henry Street\nNew York, NY 10002\n(40.7137...	40.713764	-73.985260
1	01M448	UNIVERSITY NEIGHBORHOOD HIGH SCHOOL	91	383.0	423.0	366.0	1172.0	UNIVERSITY NEIGHBORHOOD H.S.	39.000000	49.000000	...	0	0	200 Monroe Street\nNew York, NY 10002\n(40.712...	40.712332	-73.984797
2	01M450	EAST SIDE COMMUNITY SCHOOL	70	377.0	402.0	370.0	1149.0	EAST SIDE COMMUNITY HS	19.000000	21.000000	...	0	0	420 East 12 Street\nNew York, NY 10009\n(40.72...	40.729783	-73.983041
3	01M509	MARTA VALLE HIGH SCHOOL	44	390.0	433.0	384.0	1207.0	0	129.028846	197.038462	...	0	0	145 Stanton Street\nNew York, NY 10002\n(40.72...	40.720569	-73.985673
4	01M539	NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ...	159	522.0	574.0	525.0	1621.0	NEW EXPLORATIONS SCI,TECH,MATH	255.000000	377.000000	...	0	0	111 Columbia Street\nNew York, NY 10002\n(40.7...	40.718725	-73.979426

5 rows × 159 columns

Add a school district column for mapping¶

In [174]:

def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

New York Schools sat scores distribution¶

In [175]:

plt.figure(figsize=(20, 10))
sns.set_style("darkgrid")
ax = sns.distplot(combined['sat_score'])
ax.set_yticks([]) 
ax.set_title("New York Schools Sat Scores Distributions", fontweight="bold", fontsize=15,)
ax.set_xlabel("Sat Score")
ax.set_ylabel("")
ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
sns.despine(left=True)

The New York Schools Sat Scores Distributions histogram shows us that the majority of the schools have a sat score between 1100 and 1300. There are some schools over 1600 sat score, these may be considered as schools with a high sat score.

Find correlations¶

In [176]:

correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)

SAT Critical Reading Avg. Score    0.986820
SAT Math Avg. Score                0.972643
SAT Writing Avg. Score             0.987771
sat_score                          1.000000
AP Test Takers                     0.523140
                                     ...   
priority08                              NaN
priority09                              NaN
priority10                              NaN
lat                               -0.121029
lon                               -0.132222
Name: sat_score, Length: 67, dtype: float64

Plotting survey correlations¶

In [177]:

# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")

In [178]:

sns.set_style("darkgrid")

In [179]:

plt.figure(figsize=(20, 20))
ax = sns.barplot(y=survey_fields, x=correlations[survey_fields], palette='Blues')
ax.set_title("Sat score correlation with survey fields", fontweight="bold", fontsize=15,)
ax.set_xlabel("Correlation with SAT Scores",fontsize=12)
ax.set_ylabel("",fontsize=14)
ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
sns.despine(left=True)

As is shown in the previous Sat score correlation with survey fields plot. There are high correlations between sat_score and N_s and N_p. It makes sense, since these columns are correlated with the total_enrollment column.

Follwed by the safety columns saf_t_11 and saf_s_11 which are high correlated with the sat_score column to.

Also, another interesting correlation but no with the same importance as the previous ones is the sat_score correlation with rr_s colum. Which shows the student response rate.

Differences between parent, teacher and students responses¶

In [180]:

df = combined[['DBN'] + survey_fields]
df

Out[180]:

	DBN	rr_s	rr_t	rr_p	N_s	N_t	N_p	saf_p_11	com_p_11	eng_p_11	...	eng_t_11	aca_t_11	saf_s_11	com_s_11	eng_s_11	aca_s_11	saf_tot_11	com_tot_11	eng_tot_11	aca_tot_11
0	01M292	89.0	70	39	379.000000	26.0	151.0	7.8	7.7	7.4	...	6.1	6.5	6.000000	5.600000	6.100000	6.700000	6.7	6.2	6.6	7.0
1	01M448	84.0	95	10	385.000000	37.0	46.0	7.9	7.4	7.2	...	6.6	7.3	6.000000	5.700000	6.300000	7.000000	6.8	6.3	6.7	7.2
2	01M450	0.0	98	28	598.208333	42.0	150.0	8.7	8.2	8.1	...	8.0	8.8	6.611667	6.094722	6.620278	7.381389	7.9	7.9	7.9	8.4
3	01M509	90.0	100	21	306.000000	29.0	69.0	7.7	7.4	7.2	...	6.1	6.8	6.400000	5.900000	6.400000	7.000000	6.9	6.2	6.6	7.0
4	01M539	98.0	68	51	923.000000	67.0	736.0	8.5	7.9	7.9	...	5.9	7.3	7.300000	6.400000	7.000000	7.700000	7.8	6.7	6.9	7.8
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
358	32K545	80.0	57	34	504.000000	28.0	205.0	8.0	7.5	7.5	...	3.4	4.8	6.400000	5.800000	6.200000	7.000000	6.7	5.2	5.7	6.5
359	32K549	73.0	88	41	295.000000	30.0	160.0	8.2	8.1	7.9	...	8.1	8.5	6.500000	5.900000	6.600000	7.300000	7.3	7.2	7.5	8.1
360	32K552	59.0	49	31	256.000000	17.0	127.0	7.9	7.9	7.7	...	7.0	7.1	6.400000	5.700000	6.700000	7.300000	6.9	6.5	7.1	7.4
361	32K554	88.0	81	38	206.000000	13.0	80.0	9.4	8.6	8.5	...	9.1	9.6	8.900000	7.100000	8.400000	8.900000	9.3	8.1	8.7	9.1
362	32K556	67.0	71	35	323.000000	24.0	164.0	8.5	7.9	7.8	...	7.6	7.7	6.600000	6.000000	6.700000	7.300000	7.4	7.0	7.4	7.7

363 rows × 23 columns

In [181]:

df = df.melt(id_vars='DBN')
df

Out[181]:

	DBN	variable	value
0	01M292	rr_s	89.0
1	01M448	rr_s	84.0
2	01M450	rr_s	0.0
3	01M509	rr_s	90.0
4	01M539	rr_s	98.0
...	...	...	...
7981	32K545	aca_tot_11	6.5
7982	32K549	aca_tot_11	8.1
7983	32K552	aca_tot_11	7.4
7984	32K554	aca_tot_11	9.1
7985	32K556	aca_tot_11	7.7

7986 rows × 3 columns

In [182]:

aux = pd.DataFrame(df['variable'].str.split("_").to_list())
df['question'] = aux[0]
df['respondent'] = aux[1]
df = df.drop(columns=['variable'])
df

Out[182]:

	DBN	value	question	respondent
0	01M292	89.0	rr	s
1	01M448	84.0	rr	s
2	01M450	0.0	rr	s
3	01M509	90.0	rr	s
4	01M539	98.0	rr	s
...	...	...	...	...
7981	32K545	6.5	aca	tot
7982	32K549	8.1	aca	tot
7983	32K552	7.4	aca	tot
7984	32K554	9.1	aca	tot
7985	32K556	7.7	aca	tot

7986 rows × 4 columns

In [183]:

df['question'].value_counts(dropna=False)

Out[183]:

saf    1452
com    1452
eng    1452
aca    1452
rr     1089
N      1089
Name: question, dtype: int64

In [184]:

df['respondent'].value_counts(dropna=False)

Out[184]:

t      2178
s      2178
p      2178
tot    1452
Name: respondent, dtype: int64

In [185]:

df = df[df['respondent'] != 'tot']
df = df[df['question'] != 'N']
df = df[df['question'] != 'rr']
df['respondent'] = df['respondent'].map({
    'p': 'Parent', 
    't':'Teacher', 
    's':'Student'})
df['question'] = df['question'].map({
    'aca': 'Academic Expectations', 
    'eng': 'Engagement', 
    'com': 'Communication',
    'saf': 'Safety'})
df = df.reset_index(drop=True)
df

Out[185]:

	DBN	value	question	respondent
0	01M292	7.8	Safety	Parent
1	01M448	7.9	Safety	Parent
2	01M450	8.7	Safety	Parent
3	01M509	7.7	Safety	Parent
4	01M539	8.5	Safety	Parent
...	...	...	...	...
4351	32K545	7.0	Academic Expectations	Student
4352	32K549	7.3	Academic Expectations	Student
4353	32K552	7.3	Academic Expectations	Student
4354	32K554	8.9	Academic Expectations	Student
4355	32K556	7.3	Academic Expectations	Student

4356 rows × 4 columns

In [186]:

plt.figure(figsize=(20, 10))
sns.set_style("darkgrid")
ax = sns.barplot(x='question',
                 y='value',
                 hue='respondent',
                 data=df,
                 palette=sns.color_palette("GnBu_d"))
ax.set_title("Question values by respondent", fontweight="bold", fontsize=15,)
ax.set_xlabel("Question")
ax.set_ylabel("Value")
ax.tick_params(bottom=True, top=False, left=True, right=False, labelbottom=True)
sns.despine(left=True)

From the Question values by respondent chart, we can see that all perceptions independently of the role of the participant are common. In general students are the ones that rate things slightly lower than teachers and parents. On the other hand, parents are the ones who rate things slightly higher. Teachers are in the middle quite near the student's ratings.

The only type of question where students and parents ratings are quite the same is Academic Expectations.

On the other hand, Safety rating seems to be the aspect where they differ the most. That's why in the next section the Safety scores will be deeply explored.

Exploring safety scores¶

In [187]:

warnings.filterwarnings("ignore",category=FutureWarning)
combined.iplot(
    x='saf_s_11',
    y='sat_score',
    categories='boro',
    text="SCHOOL NAME",
    vline= {"x":combined["saf_s_11"].mean(),
            "color":"#000000",
            "dash": "dash"
           },
    hline={"y":combined["sat_score"].mean(),
           "color":"#000000",
           "dash": "dash"
          },
    xTitle='Safety score 11',
    yTitle='Satisfaction score avg',
    title='Exploring safety score'
)

After seeing the Exploring safety score we can conclude that there is a correlation between the sat_score and sad_s_11.

The majority of the schools have an average saf_s_11 between 5.5 and 7.5 and an average sat_score over 1000 and 1400.

Also, we can see that there is a high correlation between schools with an extremely high SAT score and sad_s_11 score.

To sum up, there is a correlation between these indicators but it isn't so strong at may seem at first look.

In [188]:

fig = px.scatter(combined[['sat_score','SCHOOL NAME', 'saf_tot_11', 'boro']].dropna(),
                 trendline='ols',
                 x="saf_tot_11", 
                 y="sat_score", 
                 size="saf_tot_11", 
                 color="sat_score", 
                 size_max=15, 
                 hover_name="SCHOOL NAME", 
                 hover_data=['boro', 'saf_tot_11','sat_score'],
                 color_continuous_scale=px.colors.sequential.Viridis, 
                 title='Environment Safety and Respect Perception Correlation with SAT Scores')
fig.show()

In the Environment Safety and Respect Perception Correlation with SAT Scores plot we can see clearly the postivie correlation between sad_tot_11 and sat_scores with the number of outliers in the upper-right corner.

On the other hand, the previous plot also shows us that there is a reduced group of schools which have a high safety rank but low SAT scores. In the lower right part of the chart, this schools are mainly from Queens and Brooklyn.

Mapping safety scores¶

In [189]:

safety_avg_by_district = combined.groupby('school_dist').agg(numpy.mean)
safety_avg_by_district

Out[189]:

	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score	sat_score	AP Test Takers	Total Exams Taken	Number of Exams with scores 3 4 or 5	Total Cohort	CSD	NUMBER OF STUDENTS / SEATS FILLED	...	grade_span_max	expgrade_span_max	zip	total_students	number_programs	priority08	priority09	priority10	lat	lon
school_dist
01	441.833333	473.333333	439.333333	1354.500000	116.681090	173.019231	135.800000	93.500000	1.0	115.244241	...	12.0	12.0	10003.166667	659.500000	1.333333	0.0	0.0	0.0	40.719022	-73.982377
02	426.619092	444.186256	424.832836	1295.638184	128.908454	201.516827	157.495833	158.647849	2.0	149.818949	...	12.0	12.0	10023.770833	621.395833	1.416667	0.0	0.0	0.0	40.739699	-73.991386
03	428.529851	437.997512	426.915672	1293.443035	156.183494	244.522436	193.087500	183.384409	3.0	156.005994	...	12.0	12.0	10023.750000	717.916667	2.000000	0.0	0.0	0.0	40.781574	-73.977370
04	402.142857	416.285714	405.714286	1224.142857	129.016484	183.879121	151.035714	113.857143	4.0	132.362265	...	12.0	12.0	10029.857143	580.857143	1.142857	0.0	0.0	0.0	40.793449	-73.943215
05	427.159915	438.236674	419.666098	1285.062687	85.722527	115.725275	142.464286	143.677419	5.0	120.623901	...	12.0	12.0	10030.142857	609.857143	1.142857	0.0	0.0	0.0	40.817077	-73.949251
06	382.011940	400.565672	382.066269	1164.643881	108.711538	159.715385	105.425000	180.848387	6.0	139.041709	...	12.0	12.0	10036.200000	628.900000	1.300000	0.0	0.0	0.0	40.848970	-73.932502
07	376.461538	380.461538	371.923077	1128.846154	73.703402	112.476331	105.276923	105.605459	7.0	97.597416	...	12.0	12.0	10452.692308	465.846154	1.461538	0.0	0.0	0.0	40.816815	-73.919971
08	386.214383	395.542741	377.908005	1159.665129	118.379371	168.020979	144.731818	215.510264	8.0	129.765099	...	12.0	12.0	10467.000000	547.636364	1.272727	0.0	0.0	0.0	40.823803	-73.866087
09	373.755970	383.582836	374.633134	1131.971940	71.411538	104.265385	98.470000	113.330645	9.0	100.118588	...	12.0	12.0	10456.100000	449.700000	1.150000	0.0	0.0	0.0	40.836349	-73.906240
10	403.363636	418.000000	400.863636	1222.227273	132.231206	226.914336	191.618182	161.318182	10.0	168.876526	...	12.0	12.0	10463.181818	757.863636	1.500000	0.0	0.0	0.0	40.870345	-73.898360
11	389.866667	394.533333	380.600000	1165.000000	83.813462	122.484615	108.833333	122.866667	11.0	129.031031	...	12.0	12.0	10467.933333	563.666667	1.533333	0.0	0.0	0.0	40.873138	-73.856120
12	364.769900	379.109453	357.943781	1101.823134	93.102564	139.442308	153.450000	110.467742	12.0	91.684504	...	12.0	12.0	10463.166667	409.000000	1.083333	0.0	0.0	0.0	40.831412	-73.886946
13	409.393800	424.127440	403.666361	1237.187600	232.931953	382.704142	320.773077	224.595533	13.0	218.306055	...	12.0	12.0	11207.153846	895.153846	2.076923	0.0	0.0	0.0	40.692865	-73.977016
14	395.937100	398.189765	385.333049	1179.459915	77.798077	114.873626	123.282143	112.347926	14.0	123.643728	...	12.0	12.0	11210.785714	545.357143	2.000000	0.0	0.0	0.0	40.711599	-73.948360
15	395.679934	404.628524	390.295854	1190.604312	94.574786	141.581197	153.450000	104.207885	15.0	135.707319	...	12.0	12.0	11214.222222	573.111111	1.666667	0.0	0.0	0.0	40.675972	-73.989255
16	371.529851	379.164179	369.415672	1120.109701	82.264423	126.519231	153.450000	247.185484	16.0	177.501282	...	12.0	12.0	11219.000000	440.250000	1.750000	0.0	0.0	0.0	40.688008	-73.929686
17	386.571429	394.071429	380.785714	1161.428571	105.583791	163.087912	111.360714	121.357143	17.0	130.246192	...	12.0	12.0	11220.642857	547.071429	1.642857	0.0	0.0	0.0	40.660313	-73.955636
18	373.454545	373.090909	371.454545	1118.000000	129.028846	197.038462	153.450000	72.771261	18.0	72.209438	...	12.0	12.0	11224.000000	344.000000	1.090909	0.0	0.0	0.0	40.641863	-73.914726
19	367.083333	377.583333	359.166667	1103.833333	88.097756	124.769231	120.670833	114.322581	19.0	105.752625	...	12.0	12.0	11207.500000	440.416667	1.916667	0.0	0.0	0.0	40.676547	-73.882158
20	406.223881	465.731343	401.732537	1273.687761	227.805769	359.407692	177.690000	591.374194	20.0	420.029766	...	12.0	12.0	11210.200000	2521.400000	3.800000	0.0	0.0	0.0	40.626751	-74.006191
21	395.283582	421.786974	389.242062	1206.312619	135.467657	203.835664	142.377273	275.351906	21.0	224.702989	...	12.0	12.0	11221.000000	1098.272727	3.272727	0.0	0.0	0.0	40.593596	-73.978465
22	473.500000	502.750000	474.250000	1450.500000	391.007212	614.509615	370.362500	580.250000	22.0	495.279369	...	12.0	12.0	11223.000000	2149.000000	2.250000	0.0	0.0	0.0	40.618285	-73.952288
23	380.666667	398.666667	378.000000	1157.333333	29.000000	31.000000	153.450000	87.000000	23.0	120.113095	...	12.0	12.0	11219.000000	391.000000	1.333333	0.0	0.0	0.0	40.668586	-73.912298
24	405.846154	434.000000	402.153846	1242.000000	126.474852	179.094675	115.165385	234.682382	24.0	213.471903	...	12.0	12.0	11206.153846	962.461538	2.230769	0.0	0.0	0.0	40.740621	-73.911518
25	437.250000	483.500000	436.250000	1357.000000	205.260817	279.889423	174.793750	268.733871	25.0	280.576007	...	12.0	12.0	11361.000000	1288.875000	1.875000	0.0	0.0	0.0	40.745414	-73.815558
26	445.200000	487.600000	444.800000	1377.600000	410.605769	632.407692	392.090000	825.600000	26.0	595.953216	...	12.0	12.0	11388.600000	2837.400000	4.600000	0.0	0.0	0.0	40.748507	-73.759176
27	407.800000	422.200000	394.300000	1224.300000	100.611538	145.315385	95.125000	288.961290	27.0	249.324536	...	12.0	12.0	11556.300000	1072.000000	2.500000	0.0	0.0	0.0	40.638828	-73.807823
28	445.941655	465.997286	435.908005	1347.846947	182.010490	273.559441	175.336364	351.214076	28.0	255.381164	...	12.0	12.0	11422.000000	1304.272727	2.545455	0.0	0.0	0.0	40.709344	-73.806367
29	395.764925	399.457090	386.707836	1181.929851	63.385817	96.514423	135.268750	98.108871	29.0	88.372155	...	12.0	12.0	11413.625000	474.125000	1.250000	0.0	0.0	0.0	40.685276	-73.752740
30	430.679934	465.961857	429.740299	1326.382090	157.231838	252.123932	115.150000	310.526882	30.0	251.803744	...	12.0	12.0	11103.000000	1123.333333	2.555556	0.0	0.0	0.0	40.755398	-73.932306
31	457.500000	472.500000	452.500000	1382.500000	228.908654	355.111538	194.435000	450.787097	31.0	380.528319	...	12.0	12.0	10307.100000	1847.500000	5.000000	0.0	0.0	0.0	40.595680	-74.125726
32	371.500000	385.833333	362.166667	1119.500000	70.342949	100.179487	83.558333	105.333333	32.0	100.525613	...	12.0	12.0	11231.666667	381.500000	1.000000	0.0	0.0	0.0	40.696295	-73.917124

32 rows × 67 columns

In [190]:

safety_avg_by_district.reset_index(inplace=True)

In [191]:

import folium
import branca.colormap as cm

In [192]:

# NY = [40.693943, -73.985880]
school_map = folium.Map(location=[40.693943, -73.985880], 
                        #width=500,
                        #height=800,
                        zoom_start=10,
                        tiles='CartoDB positron')

map_title = "Safety average by district"
title_html = '''
             <h3 align="center" style="font-size:20px"><b>{}</b></h3>
             '''.format(map_title)
school_map.get_root().html.add_child(folium.Element(title_html))

size_parameter = "total_enrollment"
color_parameter = "saf_s_11"


start_col_val = safety_avg_by_district[color_parameter].min()
end_col_val = safety_avg_by_district[color_parameter].max()
start_col_hex = "#d01c8b" # used https://colorbrewer2.org/ 
end_col_hex = "#4dac26"
colormap = cm.LinearColormap(colors=[start_col_hex,
                                     #"white",
                                     end_col_hex],
                             vmin=start_col_val,
                             vmax=end_col_val
                            )
colormap.caption = "Safety average by district"

for index, row in safety_avg_by_district.iterrows():
    folium.Circle(location=[row["lat"],row["lon"]],
                  color=colormap(row[color_parameter]),
                  fill_color=colormap(row[color_parameter]),
                  fill_opacity=0.75,
                  radius=row[size_parameter],
                  weight=2,
                  tooltip="District "+row["school_dist"]+
                      " <br> Avg. Safety / Respect Score, students response (color): {:.1f}".format(row[color_parameter])+
                      " <br> Avg. enrollment per school (size): {:.0f}".format(row[size_parameter])
                 ).add_to(school_map)

school_map.add_child(colormap)
    

school_map

Out[192]:

Make this Notebook Trusted to load map: File -> Trust Notebook

In [193]:

saf_pivot_df = combined.pivot_table(index=['boro', 'school_dist'], values='saf_s_11', margins='All')
index = saf_pivot_df.index
saf_pivot_df

Out[193]:

		saf_s_11
boro	school_dist
Bronx	07	6.800000
	08	6.364697
	09	6.655000
	10	6.577273
	11	6.186667
	12	7.116667
Brooklyn	13	6.407692
	14	6.685714
	15	6.177778
	16	5.875000
	17	6.007143
	18	6.190909
	19	6.150000
	20	7.120000
	21	6.390909
	22	7.025000
	23	6.233333
	32	6.766667
Manhattan	01	6.768611
	02	6.910660
	03	6.716667
	04	6.885714
	05	6.314286
	06	6.950000
Queens	24	7.107692
	25	6.912500
	26	6.760000
	27	6.390000
	28	6.627273
	29	6.075000
	30	7.033333
Staten Island	31	6.530000
All		6.611667

In [194]:

saf_pivot_df = saf_pivot_df.reset_index(drop=True)

In [195]:

df = pd.DataFrame(index.tolist())
df['saf_s_11'] = saf_pivot_df['saf_s_11']

In [196]:

df

Out[196]:

	0	1	saf_s_11
0	Bronx	07	6.800000
1	Bronx	08	6.364697
2	Bronx	09	6.655000
3	Bronx	10	6.577273
4	Bronx	11	6.186667
5	Bronx	12	7.116667
6	Brooklyn	13	6.407692
7	Brooklyn	14	6.685714
8	Brooklyn	15	6.177778
9	Brooklyn	16	5.875000
10	Brooklyn	17	6.007143
11	Brooklyn	18	6.190909
12	Brooklyn	19	6.150000
13	Brooklyn	20	7.120000
14	Brooklyn	21	6.390909
15	Brooklyn	22	7.025000
16	Brooklyn	23	6.233333
17	Brooklyn	32	6.766667
18	Manhattan	01	6.768611
19	Manhattan	02	6.910660
20	Manhattan	03	6.716667
21	Manhattan	04	6.885714
22	Manhattan	05	6.314286
23	Manhattan	06	6.950000
24	Queens	24	7.107692
25	Queens	25	6.912500
26	Queens	26	6.760000
27	Queens	27	6.390000
28	Queens	28	6.627273
29	Queens	29	6.075000
30	Queens	30	7.033333
31	Staten Island	31	6.530000
32	All		6.611667

In [197]:

df = df.rename(columns={0: "boro", 1: "school_dist"})
df.drop(df.tail(1).index,inplace=True)
df

Out[197]:

	boro	school_dist	saf_s_11
0	Bronx	07	6.800000
1	Bronx	08	6.364697
2	Bronx	09	6.655000
3	Bronx	10	6.577273
4	Bronx	11	6.186667
5	Bronx	12	7.116667
6	Brooklyn	13	6.407692
7	Brooklyn	14	6.685714
8	Brooklyn	15	6.177778
9	Brooklyn	16	5.875000
10	Brooklyn	17	6.007143
11	Brooklyn	18	6.190909
12	Brooklyn	19	6.150000
13	Brooklyn	20	7.120000
14	Brooklyn	21	6.390909
15	Brooklyn	22	7.025000
16	Brooklyn	23	6.233333
17	Brooklyn	32	6.766667
18	Manhattan	01	6.768611
19	Manhattan	02	6.910660
20	Manhattan	03	6.716667
21	Manhattan	04	6.885714
22	Manhattan	05	6.314286
23	Manhattan	06	6.950000
24	Queens	24	7.107692
25	Queens	25	6.912500
26	Queens	26	6.760000
27	Queens	27	6.390000
28	Queens	28	6.627273
29	Queens	29	6.075000
30	Queens	30	7.033333
31	Staten Island	31	6.530000

In [198]:

import plotly.express as px
fig = px.bar(df, x='school_dist', y='saf_s_11', color='boro', title='Saf_s_11 by school district')
fig.show()

From the Safety average by district interactive map, it looks like:

High safety scores: Upper Manhattan and parts of Queens.

Low safety scores: Brooklyn.

From the safety by district table printed and Saf_s_11 by school district interactive bar chart we can exctract more precise conclusions:

saf_s_11 scores, range from a low of 5.87 for district 16 in Brooklyn, to a high of 7.12 for district 20, which is also in Brooklyn.

Queens and the Bronx have districts near both the bottom and top of the range.

The unique district of Staten Island's saf_s_11 score is under the mean of the city.

Racial differences in SAT preformance¶

In [199]:

race_percentage_indicators = [
    'white_per', 
    'asian_per', 
    'black_per', 
    'hispanic_per'
]

In [200]:

plt.figure(figsize=(20, 10))
ax = correlations[race_percentage_indicators].plot.bar()
ax.set_title("Racial correlation with SAT performance", fontweight="bold", fontsize=15,)
ax.set_xlabel("Race percentage")
ax.set_ylabel("Correlation")
ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
sns.despine(left=True)

From the Racial correlation with SAT performance it seems like a higher percentage of white or asian students correlates positively with the SAT score averages.

On the other hand, seems looks like a higher percentage of black and hispanic students correlates negatively with the SAT score averages.

On this point we have two questions are this correlations due a lack of funding for schools in certain areas, with higher percentage of black and hispanic students or simply white and asian students are more fitting better in the current educational system.

Hispanic per vs Sat score¶

In [201]:

warnings.filterwarnings("ignore",category=FutureWarning)
combined.iplot(
    x='hispanic_per',
    y='sat_score',
    categories='boro',
    text="SCHOOL NAME",
    vline= {"x":combined["hispanic_per"].mean(),
            "color":"#000000",
            "dash": "dash"
           },
    hline={"y":combined["sat_score"].mean(),
           "color":"#000000",
           "dash": "dash"
          },
    xTitle='Hispanic %',
    yTitle='Satisfaction score avg',
    title='Hispanic per vs Sat score'
)

It tseems that there isn't a strong correlation between Hispanic per and Sat score but the previous plot shows clearly that independently of the percentage of the hispanic students, the Sat score is between 1000 and 1400, as we have seen previously.

What the previous scatter plot shows is that the schools with a percentage of hispanic students between 0% and 20% are the ones with a higher average Satisfaction score.

On the other hand the extreme cases of the schools with a a hispanic students percentage of 100% are the ones with lower SAT score. This may be due a lack of capital on this schools for hispanic students.

In [202]:

combined[combined['hispanic_per'] > 95]['SCHOOL NAME']

Out[202]:

44                         MANHATTAN BRIDGES HIGH SCHOOL
82      WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL
89     GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M...
125                  ACADEMY FOR LANGUAGE AND TECHNOLOGY
141                INTERNATIONAL SCHOOL FOR LIBERAL ARTS
176     PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE
253                            MULTICULTURAL HIGH SCHOOL
286               PAN AMERICAN INTERNATIONAL HIGH SCHOOL
Name: SCHOOL NAME, dtype: object

Schools info¶

Schools info. After reading the descriptions of the listed schools, they appear to primarily be geared towards recent immigrants to the US. These schools students are starting the inclusion program by learning English, which would explain the lower SAT scores.

In [203]:

combined[(combined['hispanic_per'] < 10) & (combined['sat_score'] > 1800)]['SCHOOL NAME']

Out[203]:

37                                STUYVESANT HIGH SCHOOL
151                         BRONX HIGH SCHOOL OF SCIENCE
187                       BROOKLYN TECHNICAL HIGH SCHOOL
327    QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO...
356                  STATEN ISLAND TECHNICAL HIGH SCHOOL
Name: SCHOOL NAME, dtype: object

The schools listed above are specialized on science and technology, that generally receive extra founding to foment investigation and research. They only admin students after passing an entrance exam.

There is no explanation why this percentage of hispanic students is so low, but the admision process may explain why the SAT score of this students is higher.

Gender differences in SAT scores¶

In [204]:

plt.figure(figsize=(20, 10))
ax = correlations[['male_per', 'female_per']].plot.bar()
ax.set_title("Gender differences in SAT scores", fontweight="bold", fontsize=15,)
ax.set_xlabel("Gender",fontsize=12)
ax.set_ylabel("Correlation",fontsize=12)
ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)

In the Gender differences in SAT scores, we can see that females have a high positive correlation with SAT scores and males a high negative correlation.

Female gender and SAT score¶

In [205]:

warnings.filterwarnings("ignore",category=FutureWarning)
combined.iplot(
    x='female_per',
    y='sat_score',
    categories='boro',
    text="SCHOOL NAME",
    vline= {"x":combined["female_per"].mean(),
            "color":"#000000",
            "dash": "dash"
           },
    hline={"y":combined["sat_score"].mean(),
           "color":"#000000",
           "dash": "dash"
          },
    xTitle='Female %',
    yTitle='Satisfaction score avg',
    title='Female per vs Sat score'
)

From the scatter above we can see a strong correlation between Satisfaction Score and Female percentage. But, there is a cluster of schools with a female percentage between 55% and 65% with a high SAT score.

In [206]:

combined[(
    combined["female_per"] > 60) & 
    (combined["sat_score"] > 1700)]["SCHOOL NAME"]

Out[206]:

5                         BARD HIGH SCHOOL EARLY COLLEGE
26                         ELEANOR ROOSEVELT HIGH SCHOOL
60                                    BEACON HIGH SCHOOL
61     FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A...
302                          TOWNSEND HARRIS HIGH SCHOOL
Name: SCHOOL NAME, dtype: object

The shwon cluster in the scatter plot which schools are listed aboved, are formed by schools with high academic standards and specialized on arts. School info example.

AP Exam Scores and SAT Scores¶

In the U.S., high school students take Advanced Placement (AP). exams to earn college credit. There are AP exams for many different subjects.

It makes sense that the number of students at a school who took AP exams would be highly correlated with the school's SAT scores. Let's explore this relationship. Because total_enrollment is highly correlated with sat_score, we don't want to bias our results. Instead, we'll look at the percentage of students in each school who took at least one AP exam.

In [207]:

combined["ap_per"] = combined["AP Test Takers "] / combined["total_enrollment"]

AP per and SAT score¶

In [208]:

fig = px.scatter(combined[['sat_score','SCHOOL NAME', 'ap_per', 'boro']].dropna(),
                 trendline='ols',
                 x="ap_per", 
                 y="sat_score", 
                 size="ap_per", 
                 color="sat_score", 
                 size_max=15, 
                 hover_name="SCHOOL NAME", 
                 hover_data=['boro', 'ap_per','sat_score'],
                 color_continuous_scale=px.colors.sequential.Viridis, 
                 title='AP per vs Sat score')
fig.show()

It looks like there is a relationship between the percentage of students in a school who take the AP exam, and their average SAT scores. It's not an extremely strong correlation, though. There are some cases with a high ap_per and low sat_score and some cases with low ap_per and low sat_score.

Average Class Size and SAT score¶

In [209]:

combined[['sat_score','SCHOOL NAME','AVERAGE CLASS SIZE', 'boro']]

Out[209]:

	sat_score	SCHOOL NAME	AVERAGE CLASS SIZE	boro
0	1122.0	HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES	22.564286	Manhattan
1	1172.0	UNIVERSITY NEIGHBORHOOD HIGH SCHOOL	22.231250	Manhattan
2	1149.0	EAST SIDE COMMUNITY SCHOOL	21.200000	Manhattan
3	1207.0	MARTA VALLE HIGH SCHOOL	23.571429	Manhattan
4	1621.0	NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ...	25.510526	Manhattan
...	...	...	...	...
358	1154.0	EBC HIGH SCHOOL FOR PUBLIC SERVICE–BUSHWICK	22.958824	Brooklyn
359	1035.0	BUSHWICK SCHOOL FOR SOCIAL JUSTICE	22.760000	Brooklyn
360	1060.0	ACADEMY OF URBAN PLANNING	23.900000	Brooklyn
361	1315.0	ALL CITY LEADERSHIP SECONDARY SCHOOL	17.793750	Brooklyn
362	1055.0	BUSHWICK LEADERS HIGH SCHOOL FOR ACADEMIC EXCE...	25.060000	Brooklyn

363 rows × 4 columns

In [210]:

fig = px.scatter(combined[['sat_score','SCHOOL NAME', 'AVERAGE CLASS SIZE', 'boro']].dropna(),
                 trendline='ols',
                 x="AVERAGE CLASS SIZE", 
                 y="sat_score", 
                 size="AVERAGE CLASS SIZE", 
                 color="sat_score", 
                 size_max=15, 
                 hover_name="SCHOOL NAME", 
                 hover_data=['boro', 'AVERAGE CLASS SIZE','sat_score'],
                 color_continuous_scale=px.colors.sequential.Viridis, 
                 title='Average Class Size vs SAT Scores')
fig.show()

From the previous scatter plot we can extract that there is a correlation and in general schools with higher average in terms of class size seem to have also a higher average in SAT scores.

Schools final scores¶

In [211]:

combined.head(5)

Out[211]:

	DBN	SCHOOL NAME	Num of SAT Test Takers	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score	sat_score	SchoolName	AP Test Takers	Total Exams Taken	...	Location 1	lat	lon	school_dist	ap_per
0	01M292	HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES	29	355.0	404.0	363.0	1122.0	0	129.028846	197.038462	...	220 Henry Street\nNew York, NY 10002\n(40.7137...	40.713764	-73.985260	01	0.305756
1	01M448	UNIVERSITY NEIGHBORHOOD HIGH SCHOOL	91	383.0	423.0	366.0	1172.0	UNIVERSITY NEIGHBORHOOD H.S.	39.000000	49.000000	...	200 Monroe Street\nNew York, NY 10002\n(40.712...	40.712332	-73.984797	01	0.098985
2	01M450	EAST SIDE COMMUNITY SCHOOL	70	377.0	402.0	370.0	1149.0	EAST SIDE COMMUNITY HS	19.000000	21.000000	...	420 East 12 Street\nNew York, NY 10009\n(40.72...	40.729783	-73.983041	01	0.031773
3	01M509	MARTA VALLE HIGH SCHOOL	44	390.0	433.0	384.0	1207.0	0	129.028846	197.038462	...	145 Stanton Street\nNew York, NY 10002\n(40.72...	40.720569	-73.985673	01	0.351577
4	01M539	NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ...	159	522.0	574.0	525.0	1621.0	NEW EXPLORATIONS SCI,TECH,MATH	255.000000	377.000000	...	111 Columbia Street\nNew York, NY 10002\n(40.7...	40.718725	-73.979426	01	0.158091

5 rows × 161 columns

In [212]:

combined.describe()

Out[212]:

	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score	sat_score	AP Test Takers	Total Exams Taken	Number of Exams with scores 3 4 or 5	Total Cohort	CSD	NUMBER OF STUDENTS / SEATS FILLED	...	expgrade_span_max	zip	total_students	number_programs	priority08	priority09	priority10	lat	lon	ap_per
count	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	...	363.0	363.000000	363.000000	363.000000	363.0	363.0	363.0	363.000000	363.000000	363.000000
mean	404.119403	419.656716	399.662687	1223.438806	129.028846	197.038462	153.450000	193.870968	14.019284	171.762429	...	12.0	10724.719008	777.556474	1.834711	0.0	0.0	0.0	40.743533	-73.925345	0.210503
std	57.486468	64.804541	59.187101	178.223775	175.816044	310.125423	239.117292	224.190938	9.251380	184.956860	...	0.0	536.599348	815.206633	1.680307	0.0	0.0	0.0	0.081881	0.071633	0.167423
min	279.000000	312.000000	286.000000	887.000000	6.000000	6.000000	6.000000	1.000000	1.000000	10.000000	...	12.0	10002.000000	110.000000	1.000000	0.0	0.0	0.0	40.528229	-74.192154	0.011765
25%	371.000000	378.000000	365.000000	1113.000000	42.000000	56.000000	76.000000	82.000000	7.000000	80.311189	...	12.0	10301.000000	395.000000	1.000000	0.0	0.0	0.0	40.687143	-73.979256	0.069868
50%	397.000000	408.000000	390.000000	1193.000000	129.028846	197.038462	153.450000	111.000000	12.000000	105.166667	...	12.0	10472.000000	477.000000	1.000000	0.0	0.0	0.0	40.741205	-73.929154	0.172805
75%	417.500000	439.000000	411.500000	1266.500000	129.028846	197.038462	153.450000	193.870968	21.000000	153.784188	...	12.0	11216.500000	648.000000	2.000000	0.0	0.0	0.0	40.819472	-73.890124	0.307213
max	679.000000	735.000000	682.000000	2096.000000	2117.000000	3692.000000	2687.000000	1209.000000	32.000000	1329.052632	...	12.0	11694.000000	5458.000000	10.000000	0.0	0.0	0.0	40.888373	-73.726910	0.877747

8 rows × 68 columns

In [213]:

correlations = combined.corr()
correlations = correlations["sat_score"]
correlations

Out[213]:

SAT Critical Reading Avg. Score    0.986820
SAT Math Avg. Score                0.972643
SAT Writing Avg. Score             0.987771
sat_score                          1.000000
AP Test Takers                     0.523140
                                     ...   
priority09                              NaN
priority10                              NaN
lat                               -0.121029
lon                               -0.132222
ap_per                             0.057171
Name: sat_score, Length: 68, dtype: float64

To compute the Schools Final Score from the correlations array, all the columns with a positive correlation bigger than the threshold (threshold=0.35) will be used.

In [214]:

positivie_correlations = correlations[correlations > 0.35]
positivie_correlations

Out[214]:

SAT Critical Reading Avg. Score         0.986820
SAT Math Avg. Score                     0.972643
SAT Writing Avg. Score                  0.987771
sat_score                               1.000000
AP Test Takers                          0.523140
Total Exams Taken                       0.514333
Number of Exams with scores 3 4 or 5    0.463245
NUMBER OF STUDENTS / SEATS FILLED       0.394626
NUMBER OF SECTIONS                      0.362673
AVERAGE CLASS SIZE                      0.381014
total_enrollment                        0.367857
asian_num                               0.475445
asian_per                               0.570730
white_num                               0.449559
white_per                               0.620718
female_num                              0.388631
N_s                                     0.423463
N_p                                     0.421530
total_students                          0.407827
Name: sat_score, dtype: float64

Before that, we will remove the columns that contain repeted information or may affect on the final score, this columns are:

asian_num
white_num

In [215]:

positivie_correlations = positivie_correlations.drop(labels=['asian_num', 'white_num'])

In [216]:

aux = combined[positivie_correlations.index]
aux

Out[216]:

	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score	sat_score	AP Test Takers	Total Exams Taken	Number of Exams with scores 3 4 or 5	NUMBER OF STUDENTS / SEATS FILLED	NUMBER OF SECTIONS	AVERAGE CLASS SIZE	total_enrollment	asian_per	white_per	female_num	N_s	N_p	total_students
0	355.0	404.0	363.0	1122.0	129.028846	197.038462	153.45	88.000000	4.000000	22.564286	422	14.0	1.7	163.0	379.000000	151.0	323.0
1	383.0	423.0	366.0	1172.0	39.000000	49.000000	10.00	105.687500	4.750000	22.231250	394	29.2	2.3	168.0	385.000000	46.0	299.0
2	377.0	402.0	370.0	1149.0	19.000000	21.000000	153.45	57.600000	2.733333	21.200000	598	9.7	10.4	271.0	598.208333	150.0	649.0
3	390.0	433.0	384.0	1207.0	129.028846	197.038462	153.45	69.642857	3.000000	23.571429	367	9.3	1.6	197.0	306.000000	69.0	401.0
4	522.0	574.0	525.0	1621.0	255.000000	377.000000	191.00	156.368421	6.157895	25.510526	1613	27.8	44.9	819.0	923.000000	736.0	1725.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
358	384.0	409.0	361.0	1154.0	47.000000	64.000000	13.00	150.941176	6.411765	22.958824	606	1.7	0.7	290.0	504.000000	205.0	513.0
359	348.0	353.0	334.0	1035.0	129.028846	197.038462	153.45	71.066667	3.266667	22.760000	434	0.5	1.8	224.0	295.000000	160.0	416.0
360	342.0	364.0	354.0	1060.0	76.000000	100.000000	10.00	102.375000	4.312500	23.900000	398	1.5	1.0	181.0	256.000000	127.0	313.0
361	428.0	465.0	422.0	1315.0	7.000000	8.000000	153.45	66.937500	3.812500	17.793750	263	5.3	1.5	126.0	206.000000	80.0	333.0
362	347.0	358.0	350.0	1055.0	34.000000	35.000000	18.00	132.333333	5.400000	25.060000	462	0.9	0.0	216.0	323.000000	164.0	362.0

363 rows × 17 columns

In [217]:

aux.describe()

Out[217]:

	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score	sat_score	AP Test Takers	Total Exams Taken	Number of Exams with scores 3 4 or 5	NUMBER OF STUDENTS / SEATS FILLED	NUMBER OF SECTIONS	AVERAGE CLASS SIZE	total_enrollment	asian_per	white_per	female_num	N_s	N_p	total_students
count	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000
mean	404.119403	419.656716	399.662687	1223.438806	129.028846	197.038462	153.450000	171.762429	6.491468	24.794077	793.035813	10.149587	8.430028	392.983471	598.208333	247.721763	777.556474
std	57.486468	64.804541	59.187101	178.223775	175.816044	310.125423	239.117292	184.956860	5.956961	3.555077	865.298394	14.716465	14.030145	426.246744	667.764179	283.769637	815.206633
min	279.000000	312.000000	286.000000	887.000000	6.000000	6.000000	6.000000	10.000000	1.000000	10.000000	141.000000	0.000000	0.000000	0.000000	70.000000	12.000000	110.000000
25%	371.000000	378.000000	365.000000	1113.000000	42.000000	56.000000	76.000000	80.311189	3.366071	22.642708	386.500000	1.300000	1.000000	183.000000	291.000000	102.000000	395.000000
50%	397.000000	408.000000	390.000000	1193.000000	129.028846	197.038462	153.450000	105.166667	4.375000	24.966667	471.000000	3.800000	2.100000	244.000000	374.000000	175.000000	477.000000
75%	417.500000	439.000000	411.500000	1266.500000	129.028846	197.038462	153.450000	153.784188	6.296703	27.157222	669.000000	10.750000	8.900000	374.500000	552.500000	279.000000	648.000000
max	679.000000	735.000000	682.000000	2096.000000	2117.000000	3692.000000	2687.000000	1329.052632	41.157895	37.023077	5332.000000	89.500000	82.100000	2315.000000	4768.000000	2804.000000	5458.000000

Now we have a dataset with the most signigicant columns. But as it is shown in the previous description this columns values are in different ranges, that's why we previously importes Sklearn preprocessing, now all this dataset values will be scaled to values between 0 and 1.

In [218]:

x = aux.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

In [219]:

df

Out[219]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
0	0.1900	0.217494	0.194444	0.194376	0.058280	0.051828	0.054998	0.059133	0.074705	0.464947	0.054132	0.156425	0.020706	0.070410	0.065773	0.049785	0.039828
1	0.2600	0.262411	0.202020	0.235732	0.015632	0.011666	0.001492	0.072543	0.093381	0.452622	0.048738	0.326257	0.028015	0.072570	0.067050	0.012178	0.035340
2	0.2450	0.212766	0.212121	0.216708	0.006158	0.004069	0.054998	0.036087	0.043163	0.414461	0.088037	0.108380	0.126675	0.117063	0.112433	0.049427	0.100785
3	0.2775	0.286052	0.247475	0.264682	0.058280	0.051828	0.054998	0.045216	0.049803	0.502216	0.043537	0.103911	0.019488	0.085097	0.050234	0.020415	0.054413
4	0.6075	0.619385	0.603535	0.607113	0.117954	0.100651	0.069004	0.110965	0.128440	0.573973	0.283568	0.310615	0.546894	0.353780	0.181567	0.259312	0.301982
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
358	0.2625	0.229314	0.189394	0.220844	0.019422	0.015735	0.002611	0.106850	0.134762	0.479547	0.089578	0.018994	0.008526	0.125270	0.092380	0.069126	0.075355
359	0.1725	0.096927	0.121212	0.122415	0.058280	0.051828	0.054998	0.046296	0.056444	0.472189	0.056444	0.005587	0.021924	0.096760	0.047893	0.053009	0.057218
360	0.1575	0.122931	0.171717	0.143093	0.033160	0.025502	0.001492	0.070031	0.082487	0.514375	0.049509	0.016760	0.012180	0.078186	0.039591	0.041189	0.037958
361	0.3725	0.361702	0.343434	0.354012	0.000474	0.000543	0.054998	0.043165	0.070036	0.288411	0.023502	0.059218	0.018270	0.054428	0.028948	0.024355	0.041698
362	0.1700	0.108747	0.161616	0.138958	0.013264	0.007868	0.004476	0.092743	0.109567	0.557301	0.061838	0.010056	0.000000	0.093305	0.053853	0.054441	0.047120

363 rows × 17 columns

In [220]:

df.describe()

Out[220]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
count	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000	363.000000
mean	0.312799	0.254508	0.287027	0.278279	0.058280	0.051828	0.054998	0.122635	0.136747	0.547461	0.125609	0.113403	0.102680	0.169755	0.112433	0.084428	0.124824
std	0.143716	0.153202	0.149462	0.147414	0.083286	0.084136	0.089190	0.140219	0.148338	0.131557	0.166692	0.164430	0.170891	0.184124	0.142138	0.101637	0.152432
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.230000	0.156028	0.199495	0.186931	0.017054	0.013565	0.026110	0.053304	0.058919	0.467849	0.047293	0.014525	0.012180	0.079050	0.047041	0.032235	0.053291
50%	0.295000	0.226950	0.262626	0.253102	0.058280	0.051828	0.054998	0.072148	0.084043	0.553848	0.063572	0.042458	0.025579	0.105400	0.064708	0.058381	0.068624
75%	0.346250	0.300236	0.316919	0.313896	0.058280	0.051828	0.054998	0.109006	0.131897	0.634910	0.101715	0.120112	0.108404	0.161771	0.102703	0.095630	0.100598
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

In [221]:

df['final_score'] = df.sum(axis=1)
df['final_score'] = (df['final_score'] * 10) / 17
df

Out[221]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	final_score
0	0.1900	0.217494	0.194444	0.194376	0.058280	0.051828	0.054998	0.059133	0.074705	0.464947	0.054132	0.156425	0.020706	0.070410	0.065773	0.049785	0.039828	1.186626
1	0.2600	0.262411	0.202020	0.235732	0.015632	0.011666	0.001492	0.072543	0.093381	0.452622	0.048738	0.326257	0.028015	0.072570	0.067050	0.012178	0.035340	1.292734
2	0.2450	0.212766	0.212121	0.216708	0.006158	0.004069	0.054998	0.036087	0.043163	0.414461	0.088037	0.108380	0.126675	0.117063	0.112433	0.049427	0.100785	1.263724
3	0.2775	0.286052	0.247475	0.264682	0.058280	0.051828	0.054998	0.045216	0.049803	0.502216	0.043537	0.103911	0.019488	0.085097	0.050234	0.020415	0.054413	1.303027
4	0.6075	0.619385	0.603535	0.607113	0.117954	0.100651	0.069004	0.110965	0.128440	0.573973	0.283568	0.310615	0.546894	0.353780	0.181567	0.259312	0.301982	3.397787
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
358	0.2625	0.229314	0.189394	0.220844	0.019422	0.015735	0.002611	0.106850	0.134762	0.479547	0.089578	0.018994	0.008526	0.125270	0.092380	0.069126	0.075355	1.258947
359	0.1725	0.096927	0.121212	0.122415	0.058280	0.051828	0.054998	0.046296	0.056444	0.472189	0.056444	0.005587	0.021924	0.096760	0.047893	0.053009	0.057218	0.936425
360	0.1575	0.122931	0.171717	0.143093	0.033160	0.025502	0.001492	0.070031	0.082487	0.514375	0.049509	0.016760	0.012180	0.078186	0.039591	0.041189	0.037958	0.939801
361	0.3725	0.361702	0.343434	0.354012	0.000474	0.000543	0.054998	0.043165	0.070036	0.288411	0.023502	0.059218	0.018270	0.054428	0.028948	0.024355	0.041698	1.258644
362	0.1700	0.108747	0.161616	0.138958	0.013264	0.007868	0.004476	0.092743	0.109567	0.557301	0.061838	0.010056	0.000000	0.093305	0.053853	0.054441	0.047120	0.991267

363 rows × 18 columns

In [222]:

df['final_score'].describe()

Out[222]:

count    363.000000
mean       1.728054
std        1.060487
min        0.589832
25%        1.116121
50%        1.331321
75%        1.903177
max        8.611356
Name: final_score, dtype: float64

Now let's add the final_score column to the previous combined dataframe.

In [223]:

combined['final_score'] = df['final_score']
combined

Out[223]:

	DBN	SCHOOL NAME	Num of SAT Test Takers	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score	sat_score	SchoolName	AP Test Takers	Total Exams Taken	...	priority07	priority08	priority09	priority10	Location 1	lat	lon	school_dist	ap_per	final_score
0	01M292	HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES	29	355.0	404.0	363.0	1122.0	0	129.028846	197.038462	...	0	0.0	0.0	0.0	220 Henry Street\nNew York, NY 10002\n(40.7137...	40.713764	-73.985260	01	0.305756	1.186626
1	01M448	UNIVERSITY NEIGHBORHOOD HIGH SCHOOL	91	383.0	423.0	366.0	1172.0	UNIVERSITY NEIGHBORHOOD H.S.	39.000000	49.000000	...	0	0.0	0.0	0.0	200 Monroe Street\nNew York, NY 10002\n(40.712...	40.712332	-73.984797	01	0.098985	1.292734
2	01M450	EAST SIDE COMMUNITY SCHOOL	70	377.0	402.0	370.0	1149.0	EAST SIDE COMMUNITY HS	19.000000	21.000000	...	0	0.0	0.0	0.0	420 East 12 Street\nNew York, NY 10009\n(40.72...	40.729783	-73.983041	01	0.031773	1.263724
3	01M509	MARTA VALLE HIGH SCHOOL	44	390.0	433.0	384.0	1207.0	0	129.028846	197.038462	...	0	0.0	0.0	0.0	145 Stanton Street\nNew York, NY 10002\n(40.72...	40.720569	-73.985673	01	0.351577	1.303027
4	01M539	NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ...	159	522.0	574.0	525.0	1621.0	NEW EXPLORATIONS SCI,TECH,MATH	255.000000	377.000000	...	0	0.0	0.0	0.0	111 Columbia Street\nNew York, NY 10002\n(40.7...	40.718725	-73.979426	01	0.158091	3.397787
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
358	32K545	EBC HIGH SCHOOL FOR PUBLIC SERVICE–BUSHWICK	88	384.0	409.0	361.0	1154.0	EBC-HS FOR PUB SERVICE (BUSH)	47.000000	64.000000	...	0	0.0	0.0	0.0	1155 Dekalb Avenue\nBrooklyn, NY 11221\n(40.69...	40.694482	-73.929154	32	0.077558	1.258947
359	32K549	BUSHWICK SCHOOL FOR SOCIAL JUSTICE	75	348.0	353.0	334.0	1035.0	0	129.028846	197.038462	...	0	0.0	0.0	0.0	400 Irving Avenue\nBrooklyn, NY 11237\n(40.696...	40.696962	-73.910816	32	0.297301	0.936425
360	32K552	ACADEMY OF URBAN PLANNING	67	342.0	364.0	354.0	1060.0	Academy of Urban Planning	76.000000	100.000000	...	0	0.0	0.0	0.0	400 Irving Avenue\nBrooklyn, NY 11237\n(40.696...	40.696962	-73.910816	32	0.190955	0.939801
361	32K554	ALL CITY LEADERSHIP SECONDARY SCHOOL	39	428.0	465.0	422.0	1315.0	All City Leadership Secondary School	7.000000	8.000000	...	Then to New York City residents	0.0	0.0	0.0	321 Palmetto Street\nBrooklyn, NY 11237\n(40.6...	40.697408	-73.913153	32	0.026616	1.258644
362	32K556	BUSHWICK LEADERS HIGH SCHOOL FOR ACADEMIC EXCE...	23	347.0	358.0	350.0	1055.0	Bushwick Leaders High School for Academic Exce...	34.000000	35.000000	...	0	0.0	0.0	0.0	797 Bushwick Avenue\nBrooklyn, NY 11221\n(40.6...	40.694996	-73.927986	32	0.073593	0.991267

363 rows × 162 columns

In [224]:

combined.loc[combined['final_score'] > 5]

Out[224]:

	DBN	SCHOOL NAME	Num of SAT Test Takers	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score	sat_score	SchoolName	AP Test Takers	Total Exams Taken	...	Location 1	lat	lon	school_dist	ap_per	final_score
37	02M475	STUYVESANT HIGH SCHOOL	832	679.0	735.0	682.0	2096.0	STUYVESANT HS	1510.0	2819.0	...	345 Chambers Street\nNew York, NY 10282\n(40.7...	40.717746	-74.014049	02	0.457992	7.327012
61	03M485	FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A...	531	566.0	564.0	577.0	1707.0	FIORELLO H.LAGUARDIA HS	691.0	1236.0	...	100 Amsterdam Avenue\nNew York, NY 10023\n(40....	40.773671	-73.985269	03	0.265259	5.108604
151	10X445	BRONX HIGH SCHOOL OF SCIENCE	731	632.0	688.0	649.0	1969.0	BRONX HS OF SCIENCE	1190.0	2435.0	...	75 West 205 Street\nBronx, NY 10468\n(40.87705...	40.877056	-73.889780	10	0.394955	6.613325
187	13K430	BROOKLYN TECHNICAL HIGH SCHOOL	1277	587.0	659.0	587.0	1833.0	BROOKLYN TECHNICAL HS	2117.0	3692.0	...	29 Ft Greene Place\nBrooklyn, NY 11217\n(40.68...	40.688107	-73.976745	13	0.397037	8.611356
276	22K405	MIDWOOD HIGH SCHOOL	824	478.0	519.0	476.0	1473.0	MIDWOOD HS	745.0	1223.0	...	2839 Bedford Avenue\nBrooklyn, NY 11210\n(40.6...	40.633335	-73.952916	22	0.193909	5.718548
304	26Q415	BENJAMIN N. CARDOZO HIGH SCHOOL	888	480.0	545.0	489.0	1514.0	BENJAMIN N. CARDOZO	676.0	1145.0	...	57-00 223Rd Street\nOakland Gardens, NY 11364\...	40.752392	-73.756083	26	0.172802	5.427189
305	26Q430	FRANCIS LEWIS HIGH SCHOOL	934	468.0	539.0	467.0	1474.0	FRANCIS LEWIS HS	697.0	1033.0	...	58-20 Utopia Parkway\nFresh Meadows, NY 11365\...	40.740556	-73.792848	26	0.167871	5.008304
322	28Q440	FOREST HILLS HIGH SCHOOL	762	456.0	497.0	454.0	1407.0	FOREST HILLS HS	733.0	1116.0	...	67-01 110 Street\nForest Hills, NY 11375\n(40....	40.729432	-73.845629	28	0.191184	5.021383

8 rows × 162 columns

In [225]:

combined.loc[combined['final_score'].idxmax()]

Out[225]:

DBN                                                        13K430
SCHOOL NAME                        BROOKLYN TECHNICAL HIGH SCHOOL
Num of SAT Test Takers                                       1277
SAT Critical Reading Avg. Score                               587
SAT Math Avg. Score                                           659
                                                ...              
lat                                                       40.6881
lon                                                      -73.9767
school_dist                                                    13
ap_per                                                   0.397037
final_score                                               8.61136
Name: 187, Length: 162, dtype: object

As we can see following our sclae, only 8 schools have a final sat score bigger than 5. The school with the best final sat score is BROOKLYN TECHNICAL HIGH SCHOOL. The next chart shows all the final sat score results for each school.

In [226]:

fig = px.bar(combined, x='SCHOOL NAME', y='final_score', title='Final score by school')
fig.show()

In [227]:

combined['final_score'].describe()

Out[227]:

count    363.000000
mean       1.728054
std        1.060487
min        0.589832
25%        1.116121
50%        1.331321
75%        1.903177
max        8.611356
Name: final_score, dtype: float64

Conclusions¶

This project combined and analyzed 7 data sets that contained information no only about SAT scores, also included schools of NY and student's demographic factors. The main objective of this project was to analyze and find which are the factors that influence the most performances in SATs.

From the New York Schools Sat Scores Distributions histogram we have seen that the majority of the schools have an SAT score between 1100 and 1300. Also, there is a small group with SAT scores over 1600, this is considered schools with a very high SAT score.

From the correlations plot we concluded that columns like total_enrollment have a good correlation with SAT score but the most important conclusion is that there is a high correlation between SAT score and the saf_t_11 and saf_s_11 columns which refer to the safety ranks of the schools.

After that, we studied the difference in the results depending on the respondent. In the Question values by respondent chart, we can see that all perceptions independently of the role of the participant are common. In general students are the ones that rate things slightly lower than teachers and parents. On the other hand, parents are the ones who rate things slightly higher. Teachers are in the middle quite near the student's ratings. Safety rating seems to be the aspect where students and parents differ the most, knowing that and the high correlation between safety and SAT scores we decided to explore deeply Safety Scores.

From the safety exploration and the Safety average by district interactive map, we have seen that safety influences the average Sat Score, there are no high SAT scores for schools with low safety scores. But there are schools with a low SAT score and a normal/high safety score and low values of other factors. In the Safety average by district interactive map we have seen that safest schools are located in Manhattan and in the nearby regions of Queens and Bronx.

In terms of ethnic groups, we have seen that Hispanic and black ethnic groups are very disadvantaged in comparison with White and Asian ethnic groups. In general, they form or formed part of schools with poor academic standards and low safety ranks.

We have seen that there is no strong evidence depending on students gender. Also, there is no evidence that AP exam preparation influences in the SAT scores.

In the last step, we have seen that there is a relationship between average class size and SAT scores, schools with bigger classes tend to have a better SAT score (also there are exceptions).

Finally, we have used the features correlations values with SAT_score and a threshold of 0.35 to determine which are the most important columns. After reducing this list of features we used the Scikit Learn Preprocessing MinMaxScaler to normalize all these values in a range of 0 and 1. After we summed all the results a divided it by the total num of columns. Last, the results were multiplied by 10 to convert it in a 0 to 10 scale. That's how we created the final_score for each school.