Where Do Best Students Come From?¶

Abstract¶

• The aim of the project is to determine what factors may maximize high school students academic performance and consequently the chances to be admitted by highly ranked colleges and universities. Good middle school background is certainly an important factor, but is it the only one? What determines a decent high school, what do they have in common? The answers to these questions may help us understand better the primary and secondary education, including the major problems existing there.
• For these purposes we introduce several datasets published by New York City on NYC Open Data, which we merge together in order to carry out our analysis. The main object of our investigation is the SAT score (available for 2012), reported on the school level, and which is a standardized test widely used in college admissions in the United States. Moreover, in these datasets there are a number of characteristics of those schools, which may help explaining the average performance of its graduates. The results of the analysis of the combined datasets might be quite representative, considering the fact that in New York City there is the largest public school system in the United States, where more than one million students are taught in 1,200 separate public schools.
• At the end of our analysis we conclude by constructing a portrayal of a typical high school student, who is likely to be successful academically, as measured by the SAT score. Such a student is surrounded with Asian cultural environment (maybe tiger-parented), has fluent English, feels himself safe and not bullied, highly motivated to study beyond standardized high-school program, is placed in a school, where teachers are well-paid and students are rather drawn all over the city and has good basics obtained through middle school and earlier. At the same time, we claim that SAT scores is not a good predictor of the subsequent students' success, but rather reflects the pure ability to solve fast within a certain time frame and can only partially describe students full potential. Moreover, the distributions of the SAT scores among schools is indicative of large disparities related to socio-cultural differences and wealth/income, having a high detrimental effect to our society.

Intial Data Preparation¶

We have a number of datasets from NYC Open Data, which may be used to explore various characteristics of schools in this city. As long as SAT scores are available for 2012, all the other data cover roughly the same period. Overall, we have 8 distinct datasets, which we want to merge together based on schools codes. The list of datasets is presented below:

In [1]:
import pandas as pd
import numpy as np
import re

data_files = [
"2010__AP__College_Board__School_Level_Results.csv",
"2010-2011_Class_Size_-_School-level_detail.csv",
"2006_-_2012_School_Demographics_and_Accountability_Snapshot.csv",
"2012_SAT_Results.csv",
'Archived_DOE_High_School_Directory_2014-2015.csv'
]

data = {}

for f in data_files:
d.to_csv("Projects_data/{0}".format(f))
data[f.replace(".csv", "")] = d

In [2]:
# download additional data on attendance and surveys of students, parents and teachers
School_Attendance_and_Enrollment.to_csv("Projects_data/2010_-_2011_School_Attendance_and_Enrollment_Statistics_by_District.csv")
School_Attendance_and_Enrollment['CSD'] = School_Attendance_and_Enrollment.index + 1

pupil_teacher_ratio = data["2010-2011_Class_Size_-_School-level_detail"][data["2010-2011_Class_Size_-_School-level_detail"]['SCHOOLWIDE PUPIL-TEACHER RATIO'].notna()]
data["2010-2011_Class_Size_-_School-level_detail"] = pd.merge(data["2010-2011_Class_Size_-_School-level_detail"], pupil_teacher_ratio[['SCHOOL CODE','SCHOOLWIDE PUPIL-TEACHER RATIO']], on='SCHOOL CODE', how='left')
data["2010-2011_Class_Size_-_School-level_detail"] = data["2010-2011_Class_Size_-_School-level_detail"].rename(columns={'SCHOOLWIDE PUPIL-TEACHER RATIO_y': 'Student-Teacher Ratio'})

all_survey.to_csv("Projects_data/masterfile11_gened_final.csv")
all_survey.to_csv("Projects_data/masterfile11_d75_final.csv")
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

# create a list of survey topics to extract for further analysis
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]

survey = survey.loc[:,survey_fields]
data["survey"] = survey

In [3]:
# make school codes comparable for further matching
data["Archived_DOE_High_School_Directory_2014-2015"]["DBN"] = data["Archived_DOE_High_School_Directory_2014-2015"]["dbn"]

string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation

data["2010-2011_Class_Size_-_School-level_detail"]["DBN"] = data["2010-2011_Class_Size_-_School-level_detail"]["padded_csd"] + data["2010-2011_Class_Size_-_School-level_detail"]["SCHOOL CODE"]


Next, we should also make SAT and AP scores numeric for further analysis and prepare schools' longitude and latitude columns to locate them on the New York City map.

In [4]:
# convert SAT subsections' scores into numeric values and calculate the total score
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["2012_SAT_Results"][c] = pd.to_numeric(data["2012_SAT_Results"][c], errors="coerce")
data['2012_SAT_Results']['sat_score'] = data['2012_SAT_Results'][cols[0]] + data['2012_SAT_Results'][cols[1]] + data['2012_SAT_Results'][cols[2]]

# convert AP scores into numeric values
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["2010__AP__College_Board__School_Level_Results"][col] = pd.to_numeric(data["2010__AP__College_Board__School_Level_Results"][col], errors="coerce")

# create functions to extract latitude and longitude values out of rows and apply them
def find_lat(loc):
coords = re.findall("$.+, .+$", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat

def find_lon(loc):
coords = re.findall("$.+, .+$", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon

data["Archived_DOE_High_School_Directory_2014-2015"]["lat"] = data["Archived_DOE_High_School_Directory_2014-2015"]["Location 1"].apply(find_lat)
data["Archived_DOE_High_School_Directory_2014-2015"]["lon"] = data["Archived_DOE_High_School_Directory_2014-2015"]["Location 1"].apply(find_lon)

data["Archived_DOE_High_School_Directory_2014-2015"]["lat"] = pd.to_numeric(data["Archived_DOE_High_School_Directory_2014-2015"]["lat"], errors="coerce")
data["Archived_DOE_High_School_Directory_2014-2015"]["lon"] = pd.to_numeric(data["Archived_DOE_High_School_Directory_2014-2015"]["lon"], errors="coerce")


In our further analysis we will consider only students of 9-12 grades who are enrolled in General Education program, as well as school demographics data for 2011-12 period. In addition, we also select only the 2006 total cohorts of students, who entered 9th grade in a given school in that year and graduated in 2010.

In [5]:
# choose only student from 9-12 grades enrolled in the general education program
class_size = data["2010-2011_Class_Size_-_School-level_detail"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

# aggregate values in Class Size file by school codes
class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)
data["2010-2011_Class_Size_-_School-level_detail"] = class_size

# choose only values for 2011 and 2012 in School Demographics file
data["2006_-_2012_School_Demographics_and_Accountability_Snapshot"] = data["2006_-_2012_School_Demographics_and_Accountability_Snapshot"][data["2006_-_2012_School_Demographics_and_Accountability_Snapshot"]["schoolyear"] == 20112012]

# choose only the most recent cohort of students


When all datasets are prepared we combine them together into a single one for further analysis.

In [6]:
# create a merged dataset
combined = data["2012_SAT_Results"]
combined = combined.merge(data["2010__AP__College_Board__School_Level_Results"], on="DBN", how="left").copy()

# create list of datasets for merging and loop through it
to_merge = ["2010-2011_Class_Size_-_School-level_detail", "2006_-_2012_School_Demographics_and_Accountability_Snapshot", "survey", "Archived_DOE_High_School_Directory_2014-2015"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner").copy()

combined = combined.merge(School_Attendance_and_Enrollment, on="CSD", how="left").copy()

# create function to extract school district codes
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

print('Number of rows and columns in combined dataset:', combined.shape)

Number of rows and columns in combined dataset: (363, 163)


After compiling the dataset, let's check the variable of our major interest - the distribution of SAT scores among New York schools.

In [7]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

fig = plt.figure(figsize=(17,8))

sns.set(palette='summer')
sns.distplot(combined['sat_score'], bins=60, hist=True)

plt.title("Distribution of SAT Scores among New York Schools",
fontweight="bold", fontsize=15, y=1.06)
plt.yticks([])
plt.show()


The histogram tells us that the disribution of SAT scores is unimodal and skewed to the right. Not that many schools on average perform over the 1600 threshold.

Environment Safety Perception Might Be Quite Important to Perform Well on SAT¶

Let's start our analysis with exploring correlations between major variables. In NYC school survey there are data on teachers, students and parents perception of safety, communication, engagement and academic expectations. Investigating their relationship with SAT scores may help us understanding better the determinants of school performance.

In [8]:
# create correlation matrix
correlations = combined.corr()
survey_fields.remove("DBN")
print('Variables correlation with the SAT score:\n\n',correlations["sat_score"])

Variables correlation with the SAT score:

SAT Critical Reading Avg. Score    0.986820
SAT Math Avg. Score                0.972643
SAT Writing Avg. Score             0.987771
sat_score                          1.000000
AP Test Takers                     0.542641
...
total_students                     0.411785
number_programs                    0.118241
lat                               -0.124166
lon                               -0.137163
YTD % Attendance (Avg)             0.271784
Name: sat_score, Length: 74, dtype: float64

In [9]:
fig = plt.figure(figsize=(15,12))

sns.barplot(y=survey_fields, x=correlations["sat_score"][survey_fields], palette='summer')

# list of variables to show on the graph with their respective correlations
y_ticks = [
'Student Response Rate',
'Teacher Response Rate',
'Parent Response Rate',
'Number of student respondents',
'Number of teacher respondents',
'Number of parent respondents',
'Safety and Respect score based on parent responses',
'Communication score based on parent responses',
'Engagement score based on parent responses',
'Academic expectations score based on parent responses',
'Safety and Respect score based on teacher responses',
'Communication score based on teacher responses',
'Engagement score based on teacher responses',
'Academic expectations score based on teacher responses',
'Safety and Respect score based on student responses',
'Communication score based on student responses',
'Engagement score based on student responses',
'Academic expectations score based on student responses',
'Safety and Respect total score',
'Communication total score',
'Engagement total score',
]

plt.title("School Survey Results' Correlation with SAT Scores" ,
fontweight="bold", fontsize = 20, y=1.03)
locs, labels=plt.yticks()
plt.yticks(locs, y_ticks)
plt.show()


There is a significant positive correlation between the number of survey responses (closely related to the total number of students in a school) made and SAT scores - the number of students in the best public schools is usually over 1000, while the average number of students per school in New York City is around 500. This result is not surprising, as many of such schools are of magnet type, i.e. they may draw students outside of defined school zones and provide specialised curriculum, while many parents all over the city, quite naturally, wish better future for their children and try to place them in such schools.

We may also observe that students' response rate is positively correlated with SAT score, which might be explained by self-selection - students confident in their test results can also be more responsive to surveys. In addition, teachers' and students' safety and respect perception of environment seems to be related to higher SAT scores, which is less the case for parents, who are not the direct participants of the studying process. Therefore, safety or, say, the general well-being of surroundings matters - the total score for this indicator has the highest correlation among other totals, indicating its overall (i.e. among all survey participants) higher significance. Finally, it is quite noteworthy that academic expectations of students are correlated with SAT scores more, than analogous indicators for teachers or parents. This theoretically may mean that students are able to assess their abilities more precisely than even teachers, who teach them.

Knowing that among the totals considered Safety and Respect total score correlates the most with the SAT scores of the corresponding schools, let's explore their relationship in greater detail by plotting a scatter plot.

In [10]:
fig = plt.figure(figsize=(7,5))
combined = combined.rename(columns={'saf_tot_11':'Safety and Respect Total Score', 'sat_score':'SAT Score'})

sns.regplot(x='Safety and Respect Total Score', y='SAT Score', data=combined)
plt.title("Environment Safety and Respect Perception Correlation with SAT Scores" ,
fontweight="bold", fontsize = 15, y=1.06)

plt.show()


We can see, that there is some positive correlation between safety and SAT scores with the number of outliers in the upper-right corner. At the same time, there is also a group of schools with comparatively high safety level, but lower SAT scores - in the lower-right part of the plot. Nonetheless, it doesn't work the other way around - none of the schools with Safety and Respect total score lower than 6 has a SAT score higher than 1400. Presumably, this may be explained by the fact that unsafe districts are usually populated by poor people with lower paid jobs - generally speaking, it is quite rare to encounter schools with well-performing students there.

To check this conjecture, let's inspect how the Safety and Respect Score is distributed in New York City by school districts.

In [11]:
from mpl_toolkits.basemap import Basemap

# aggregated data by school districts for mapping
combined_grouped = combined.groupby('school_dist').agg(np.mean)
combined_grouped.reset_index(inplace=True)

fig = plt.figure(figsize = (18, 18))

longitudes = list(combined_grouped['lon'])
latitudes = list(combined_grouped['lat'])

ax.set_title("Safety and Respect Score by School Districts", fontweight="bold", fontsize = 15, y=1.05)
# set map boundaries
map = Basemap(llcrnrlat=40.496044,
urcrnrlat=40.915256,
llcrnrlon=-74.255735,
urcrnrlon=-73.700272, epsg=2263)

# set map background and locate school districts with their mean values
map.arcgisimage(service='World_Topo_Map', xpixels = 1500, verbose= True)
map.scatter(longitudes, latitudes, s=50, zorder=2, latlon=True, c=combined_grouped["Safety and Respect Total Score"], cmap='winter')
map.drawcounties(linewidth=0.5, linestyle='solid', color='k', antialiased=1, facecolor='none', ax=None, zorder=2, drawbounds=False)

# set borough names
plt.annotate('Brooklyn', xy=(0.45, 0.35), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Queens', xy=(0.75, 0.55), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Bronx', xy=(0.65, 0.85), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Manhattan', xy=(0.43, 0.65), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Staten Island', xy=(0.1, 0.25), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.colorbar(label='Safety and Respect Score',fraction=0.046, pad=0.04, orientation='horizontal')

# the same procedure is done for the second map
ax.set_title('Average SAT Scores by School Districts', fontweight="bold", fontsize = 15, y=1.05)
map = Basemap(llcrnrlat=40.496044,
urcrnrlat=40.915256,
llcrnrlon=-74.255735,
urcrnrlon=-73.700272, epsg=2263)
#http://server.arcgisonline.com/arcgis/rest/services

map.arcgisimage(service='World_Topo_Map', xpixels = 1500, verbose= True)
map.scatter(longitudes, latitudes, s=50, zorder=2, latlon=True, c=combined_grouped["SAT Score"], cmap='winter')

map.drawcounties(linewidth=0.5, linestyle='solid', color='k', antialiased=1, facecolor='none', ax=None, zorder=2, drawbounds=False)
plt.annotate('Brooklyn', xy=(0.45, 0.35), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Queens', xy=(0.75, 0.55), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Bronx', xy=(0.65, 0.85), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Manhattan', xy=(0.43, 0.65), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Staten Island', xy=(0.1, 0.25), xycoords='axes fraction', fontweight="bold",fontsize = 15)

plt.show()

/Users/mac/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: MatplotlibDeprecationWarning:
The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.

http://server.arcgisonline.com/ArcGIS/rest/services/World_Topo_Map/MapServer/export?bbox=913122.7552658002,120102.74445561501,1067090.3211205194,272872.4941039085&bboxSR=2263&imageSR=2263&size=1500,1488&dpi=96&format=png32&transparent=true&f=image

/Users/mac/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: MatplotlibDeprecationWarning:
The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.
/Users/mac/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:24: MatplotlibDeprecationWarning:
The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.
/Users/mac/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:40: MatplotlibDeprecationWarning:
The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.

http://server.arcgisonline.com/ArcGIS/rest/services/World_Topo_Map/MapServer/export?bbox=913122.7552658002,120102.74445561501,1067090.3211205194,272872.4941039085&bboxSR=2263&imageSR=2263&size=1500,1488&dpi=96&format=png32&transparent=true&f=image

/Users/mac/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:45: MatplotlibDeprecationWarning:
The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.
/Users/mac/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:47: MatplotlibDeprecationWarning:
The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.


From the above plots we may note that overall Brooklyn school districts are considered as the least safe, followed by Bronx and Queens in this respect. Nonetheless, there are also two safest school districts in Brooklyn, indicating that this measure is not homogeneous within boroughs. On the other hand, the best performing school district in terms of SAT scores is located in Brooklyn, closely followed by another districts in Staten Island and Queens. Overall, these maps show us some divergence in the geographical distribution of these two variables, even though the overall correlation is high. It is possible to assume then that richer districts are often safer, have higher level of education-related funding and henceforth the best-performing schools are located there - i.e. the correlation between the level of life and academic performance should be even more ubiquitous, as compared to pure safety perception.

Income Effect Might Have Higher Explanatory Power for SAT Scores Than Wealth Effect¶

To confirm or reject the above-mentioned proposition, we may actually explore in greater detail the average price per square foot over New York City, as one of the possible proxies for wealth distribution on the neighborhood level. For that purpose we use 2012 property annual sales data from NYC Department of Finance. To make our analysis more representative and accurate, we get rid of unrealistically small and extremely large values for sales prices and footage, so that to take into account only average properties. In addition, we supply the map with the data on Adjusted Gross Income per Tax Return reported by the Internal Revenue Service on the ZIP code level and defined as gross individual income reported by tax payers, adjusted for education expenses, alimony payments and contributions to a retirement account.

In [12]:
# download property data

# merge data and get rid of outliers and unrealistic values
property_data = pd.concat([manhattan, bronx, brooklyn, queens, staten_island], axis=0)
property_data = property_data.iloc[:,[10,14,19]].dropna()
property_data.columns = ['ZIP Code','Land Square Feet','Sale price']
property_data.to_csv('Projects_data/NYC_property_data')
property_data = property_data[(property_data['ZIP Code'] != 'ZIP CODE\n') & (property_data['ZIP Code'] != 0)]
property_data = property_data.astype(int)
property_data = property_data[(property_data['Sale price']>50000) & (property_data['Sale price']<5000000)]
property_data = property_data[(property_data['Land Square Feet']>500) & (property_data['Land Square Feet']<5000)]
property_data['Price Per Square Foot'] = property_data['Sale price']/property_data['Land Square Feet']
property_data_aggregated = round(property_data.groupby('ZIP Code').agg(np.mean),2)
property_data_aggregated.reset_index(inplace=True)

combined = pd.merge(combined, adjusted_gross_income, left_on='zip', right_on='ZIP Code', how='left')

# ZIP Code Definitions of New York City Neighborhoods (NYC Department of Health) webfile was manually
# adjusted to make a correspondence between ZIP codes and neighborhoods for mapping
NYC_zips_dist.to_csv('Projects_data/NYC_zips_dist')
property_data_merged = pd.merge(NYC_zips_dist, property_data_aggregated, on='ZIP Code', how='left')
combined = pd.merge(combined, property_data_aggregated, left_on='zip', right_on='ZIP Code', how='left')

# dowload locations of all ZIP codes in New York
import json
from urllib.request import urlopen
with urlopen('https://raw.githubusercontent.com/gdobler/nycep/master/d3/data/nyc-zip-code.json') as f:

# create list with all New York ZIP codes for complete mapping
zips = []
GEO_zips = pd.DataFrame()
for i in geodata['features']:
for j in i['properties']:
zips.append(i['properties']['ZIP'])

# create dataset for mapping
zips = pd.DataFrame(zips).astype(int)
zips.columns = ['ZIP Code']
GEO_zips = zips.drop_duplicates()
property_data_for_graph = pd.merge(GEO_zips, property_data_merged, on='ZIP Code', how='left')
property_data_for_graph = pd.merge(property_data_for_graph, adjusted_gross_income, on='ZIP Code', how='left')

In [13]:
import plotly.express as px

# set variable to show on the map
property_data_for_graph['Neighborhood'] = property_data_for_graph['Neighborhood'].fillna('No data')
property_data_for_graph['color'] = property_data_for_graph['Price Per Square Foot'].fillna(-500) # set minimal color for colorscale
property_data_for_graph['Avg. Price Per Square Foot'] = property_data_for_graph['Price Per Square Foot'].fillna('No data')
property_data_for_graph['Gross Income per Tax Return'] = property_data_for_graph['Adjusted Gross Income per Tax Return'].fillna('No data')
combined['School Average SAT Score'] = combined['SAT Score'].fillna('No data')
fig = px.choropleth_mapbox(property_data_for_graph, geojson=geodata, color="